Improve the extension with `PrivateUse1` for custom device

Some more device things to consider for the GradScaler in addition to the amp_foreach_non_finite_check_and_unscale op:

  • There is a MultiDeviceReplicator that lazily serves copies of found_inf and grad_scale to support the foreach op. (This is also only XLA/CUDA)
  • The other amp op, _amp_update_scale, is CUDA only, though if scale is allowed to remain on cuda, this support may be optional