DTensor random RNG state support for non-CUDA backends

Sujoy_Saraswati · October 16, 2024, 3:36pm

DTensor random ops mesh support for a backend is checked with is_rng_supported_mesh, which checks the presence of hasattr(device_handle, “set_rng_state”).

For CPU RNG state, this seems to return False. However, If a backend uses CPU RNG state and has a set_rng_state() implemented, it sets is_rng_supported_mesh to be True and DTensor random mechanism tries to use OffsetBasedRNGTracker. The seed and offset based APIs assume the RNG state to be a CUDA like offset based. CPU RNG state doesn’t work and fails as shown below -

[rank0]:   File ".../.local/lib/python3.10/site-packages/torch/distributed/_tensor/random.py", line 176, in _distribute_region
[rank0]:     old_offset = self.get_offset("parallel-rng")
[rank0]:   File ".../.local/lib/python3.10/site-packages/torch/distributed/_tensor/random.py", line 195, in get_offset
[rank0]:     return int(offset_tensor.item())
[rank0]: RuntimeError: a Tensor with 631 elements cannot be converted to Scalar

Is there a plan to enhance the DTensor to use CPU like RNG state, or is it possible to mark is_rng_supported_mesh to be false for backends that use CPU RNG state and not the offset based RNG state?

albanD · October 17, 2024, 1:55pm

Hey!

cc @wanchaol in case he’s seeing this.
This kind of request might be best as an issue on github though as this is where we track features and improvements usually!

Sujoy_Saraswati · October 18, 2024, 2:11pm

Sujoy_Saraswati:

[rank0]:   File ".../.local/lib/python3.10/site-packages/torch/distributed/_tensor/random.py", line 176, in _distribute_region
[rank0]:     old_offset = self.get_offset("parallel-rng")
[rank0]:   File ".../.local/lib/python3.10/site-packages/torch/distributed/_tensor/random.py", line 195, in get_offset
[rank0]:     return int(offset_tensor.item())
[rank0]: RuntimeError: a Tensor with 631 elements cannot be converted to Scalar

Thanks @albanD . I have submitted DTensor RNG state for non CUDA backends · Issue #138329 · pytorch/pytorch · GitHub for this.

Topic		Replies	Views
DTensor - Status, Design and Looking Forward distributed	2	1385	July 9, 2025
CUDA histogram feature proposal - Help frontend API	0	636	December 2, 2022
OpenCL backend dev - questions/support hardware-backends	4	316	August 29, 2024
[tac] Follow up: Inductor HW backend implementation hardware-backends	7	900	November 16, 2024
No CPU backend in triton FX	4	633	January 20, 2025

DTensor random RNG state support for non-CUDA backends

Related topics