Internals of Deferred Module Initialization

Lately I have been getting a lot of questions from several folks about how fake tensors and deferred module initialization work. Most people find them a bit “too magical” and wonder how we implemented them. Since we now also have some major external projects, such as PyTorch Lightning, depending on deferred module initialization, I decided to write down the internal mechanics of both features and published a “design notes” section in the public torchdistX website. In case you have had similar questions, I encourage you to check out the documentation. Of course I would also appreciate any feedback.



Some naive questions:

  • Are there any advantages to meta tensors over fake tensors?
  • Would fake tensors ever “replace” meta tensors? Or could they both be unified at some point?
  • Is there any relationship to lazy tensors?