Our goal is to define every ufunc in PyTorch as a simple templated inner loop function like below:
template <typename T>
C10_HOST_DEVICE T add(T self, T other, T alpha) {
return self + alpha * other;
}
and transform this into a full torch.add kernel.
Proposal for how to do this is at Ufunc codegen for pointwise operators - Google Docs
WIP PR is ufunc codegen by ezyang · Pull Request #65851 · pytorch/pytorch · GitHub