Generalizing AMP to work on CPU

Intel is interested in bringing automatic mixed precision to CPU in [RFC] Extend Autocast to CPU/CUDA with BF16 data type · Issue #55374 · pytorch/pytorch · GitHub One big question is what the API for autocasting should be for CPU; should we provide a single, generalized API torch.autocast (keep in mind that CPU autocasting would be through bfloat16, while the existing GPU autocasting is via float16), or provide separate APIs for CPU/CUDA? If you have any thoughts or opinions on the subject, please chime in on the issue.