About "[FSDP] Don't move ignored params / buffers to device"

Hi,

I ran into the following error after upgrading Pytorch from 2.1.0 to 2.3.1

"torch/nn/functional.py", line 2264, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:7! (when checking argument for argument index in method wrapper_CUDA__index_select)

Looks like it’s related to [RFC][FSDP] Don't move ignored params / buffers to device by rohan-varma · Pull Request #108033 · pytorch/pytorch · GitHub where ignored_parameters are not moved to device anymore, wondering if the change could be further discussed here as the github issue had been closed.

Ignored parameters are not sharded and their gradients are not reduced, but they are still likely to be part of the computation, would it make more sense to stick with the previous logic or I’m missing something? Any comment is appreciated.

1 Like

I am also concerned about this issue.

In my training code, I freeze some params in LLM model, but i still need those params in my forward/backward process.

1 Like