Delving into what happens when you `import torch`

Overview

I recently spent some time looking into import torch time, sharing some learnings from this.

Python provides an environment variable, PROFILE_IMPORT_TIME , that will give the breakdown of how long each import takes.


PYTHONPROFILEIMPORTTIME=1 python -c "import torch;" 2> import_torch_profile.txt

The output timings are visualized with tuna below. Note that the diagram is left-heavy, that is, the longest import within each package is on the left.

From the above, we can see that the top 10 modules by import time are

  1. torch._C (41.1%)

  2. torch._meta_registrations (18.9%)

    • imports decomps, refs, prims, _custom_op.impl
  3. torch._masked (16.3%)

    • first module to import sympy (15.8%)
  4. torch.functional (4.3%)

  5. torch.export (3.2%)

  6. torch.quantization (2.1%)

  7. torch.utils.data (1.1%)

  8. torch.hub (1.1%)

  9. torch.optim (1.0%)

  10. torch.distributions (0.7%)

We can also see that the sympy (15.8%) and numpy imports (6.3%) are the external imports that take up the most time, totaling around 22% of the import time.

Registrations to the dispatcher

Operators and operator schemas are registered to the dispatcher at static initialization time during import torch. We can visualize this by using py-spy with the --native option to obtain the C++ stack traces for import torch.


py-spy record -o profile.speedscope -f speedscope --native -r 2000 -- python -c "import torch;"

Note that since py-spy pauses the process while collecting samples, the timings in the profile are distorted. However, we can see from the profile that the C++ registrations to the dispatcher are the bulk of the torch._C import.

Zooming in on what happens within each _GLOBAL__sub_I_Register{*}.cpp, we can see many consecutive impl or def calls mirroring the m.impl and m.def that a TORCH_LIBRARY(_IMPL) macro takes.

Registrations to the dispatcher can be done via both C++ (e.g. TORCH_LIBRARY_{*} macros which correspond to the _GLOBAL__sub_I_Register*.cpp above) and via python (e.g. python meta registrations using torch.library). To understand the breakdown of registrations between C++ and python, the following diagram contains a breakdown of schema and operator registrations as well as the number of these that are registered via python generated via this gist.

Effect of lazy imports

In #104368, PEP 562 was used to make the torch/_dynamo and torch/_inductor imports lazy. More recently, the torch/onnx and torch/_export imports were also made lazy. We can see the corresponding time breakdown when these imports are not lazy below

Conclusion

This post gave an overview of what happens when you import torch. If you have ideas or want to help reduce / monitor this, please reach out!

Note: The above analysis was done using an install of the 2.1.0 release binary on an Intel Xeon Platinum 8339HC.

4 Likes

Put up a stack of PRs remove hte import-time dependency on SymPy:

1 Like

It may be possible to do the same thing with NumPy… perhaps. NumPy we do use it on numpy() and a few other methods, so I’m not sure, but it may be worth exploring.

@Lezcano very cool! I ran the profile with your stack

It seems like more larger projects like Pydantic are moving towards lazyfying all the things https://twitter.com/samuel_colvin/status/1717945904773108203 and am wondering if we can do the same at least for export, quantization, data, hub

Also re torch._C and meta_registration are those costs we just need to eat or are there are plausible ways of reducing their import times?

It might be possible to lazyify those modules that you mentioned export, quantization, data, hub. The time shaved off would be on the order of 7%, would this be significant enough for serving from a cold start perspective?

In terms of the torch._C and _meta_registrations imports, these have side effects (registrations to dispatcher) that need to happen so I don’t think we can decrease the number of registrations at import time. Otherwise if the dispatcher state is queried it might be missing a bunch of ops.

From a brief chat with Ed previously there might be some C++ code optimization to speed up dispatcher registrations but don’t have an idea of how much/how fruitful it will be.