About "[FSDP] Don't move ignored params / buffers to device"

shawnd · July 31, 2024, 12:50pm

Hi,

I ran into the following error after upgrading Pytorch from 2.1.0 to 2.3.1

"torch/nn/functional.py", line 2264, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:7! (when checking argument for argument index in method wrapper_CUDA__index_select)

Looks like it’s related to [RFC][FSDP] Don't move ignored params / buffers to device by rohan-varma · Pull Request #108033 · pytorch/pytorch · GitHub where ignored_parameters are not moved to device anymore, wondering if the change could be further discussed here as the github issue had been closed.

Ignored parameters are not sharded and their gradients are not reduced, but they are still likely to be part of the computation, would it make more sense to stick with the previous logic or I’m missing something? Any comment is appreciated.

NightStar · August 2, 2024, 3:42am

I am also concerned about this issue.

In my training code, I freeze some params in LLM model, but i still need those params in my forward/backward process.

Topic		Replies	Views
Rethinking PyTorch Fully Sharded Data Parallel (FSDP) from First Principles distributed	19	11506	September 17, 2024
Proposed changes to how nn.Modules ought to be written	0	571	March 2, 2021
Multi-GPU management extension	4	661	June 1, 2023
Planning for Release 2.6.1 Release Announcements	2	496	February 25, 2025
TorchDynamo Update 11: Making FSDP and Dynamo Work Together compiler	5	4596	December 27, 2023

About "[FSDP] Don't move ignored params / buffers to device"

Related topics