Lately I have been getting a lot of questions from several folks about how fake tensors and deferred module initialization work. Most people find them a bit “too magical” and wonder how we implemented them. Since we now also have some major external projects, such as PyTorch Lightning, depending on deferred module initialization, I decided to write down the internal mechanics of both features and published a “design notes” section in the public torchdistX website. In case you have had similar questions, I encourage you to check out the documentation. Of course I would also appreciate any feedback.
Cheers!