Delving into what happens when you `import torch`

mikaylagawarecki · October 20, 2023, 5:04pm

Overview

I recently spent some time looking into import torch time, sharing some learnings from this.

Python provides an environment variable, PROFILE_IMPORT_TIME , that will give the breakdown of how long each import takes.


PYTHONPROFILEIMPORTTIME=1 python -c "import torch;" 2> import_torch_profile.txt

The output timings are visualized with tuna below. Note that the diagram is left-heavy, that is, the longest import within each package is on the left.

From the above, we can see that the top 10 modules by import time are

torch._C (41.1%)
torch._meta_registrations (18.9%)
- imports decomps, refs, prims, _custom_op.impl
torch._masked (16.3%)
- first module to import sympy (15.8%)
torch.functional (4.3%)
torch.export (3.2%)
torch.quantization (2.1%)
torch.utils.data (1.1%)
torch.hub (1.1%)
torch.optim (1.0%)
torch.distributions (0.7%)

We can also see that the sympy (15.8%) and numpy imports (6.3%) are the external imports that take up the most time, totaling around 22% of the import time.

Registrations to the dispatcher

Operators and operator schemas are registered to the dispatcher at static initialization time during import torch. We can visualize this by using py-spy with the --native option to obtain the C++ stack traces for import torch.


py-spy record -o profile.speedscope -f speedscope --native -r 2000 -- python -c "import torch;"

Note that since py-spy pauses the process while collecting samples, the timings in the profile are distorted. However, we can see from the profile that the C++ registrations to the dispatcher are the bulk of the torch._C import.

Zooming in on what happens within each _GLOBAL__sub_I_Register{*}.cpp, we can see many consecutive impl or def calls mirroring the m.impl and m.def that a TORCH_LIBRARY(_IMPL) macro takes.

Registrations to the dispatcher can be done via both C++ (e.g. TORCH_LIBRARY_{*} macros which correspond to the _GLOBAL__sub_I_Register*.cpp above) and via python (e.g. python meta registrations using torch.library). To understand the breakdown of registrations between C++ and python, the following diagram contains a breakdown of schema and operator registrations as well as the number of these that are registered via python generated via this gist.

Effect of lazy imports

In #104368, PEP 562 was used to make the torch/_dynamo and torch/_inductor imports lazy. More recently, the torch/onnx and torch/_export imports were also made lazy. We can see the corresponding time breakdown when these imports are not lazy below

Conclusion

This post gave an overview of what happens when you import torch. If you have ideas or want to help reduce / monitor this, please reach out!

Note: The above analysis was done using an install of the 2.1.0 release binary on an Intel Xeon Platinum 8339HC.

Lezcano · October 25, 2023, 5:14pm

Put up a stack of PRs remove hte import-time dependency on SymPy:

github.com/pytorch/pytorch

Do not import sympy within torch._prims_common

pytorch:gh/Lezcano/243/base ← pytorch:gh/Lezcano/243/head

opened 04:58PM - 25 Oct 23 UTC

lezcano

+128 -36

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1…12038 * #112037 * #112036 * #112035 * __->__ #112034 This is the first of a few PRs that avoid importing SymPy at import time. The pitch here is that we (almost!) do not have SymPy on our API, so this should be feasible. This should speed-up torch imports by a good 15% as per https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589 In this PR we just move a few global imports into local imports.

Lezcano · October 25, 2023, 5:15pm

It may be possible to do the same thing with NumPy… perhaps. NumPy we do use it on numpy() and a few other methods, so I’m not sure, but it may be worth exploring.

mikaylagawarecki · October 25, 2023, 7:18pm

@Lezcano very cool! I ran the profile with your stack

msaroufim · October 27, 2023, 5:03pm

It seems like more larger projects like Pydantic are moving towards lazyfying all the things https://twitter.com/samuel_colvin/status/1717945904773108203 and am wondering if we can do the same at least for export, quantization, data, hub

Also re torch._C and meta_registration are those costs we just need to eat or are there are plausible ways of reducing their import times?

mikaylagawarecki · November 1, 2023, 8:27pm

It might be possible to lazyify those modules that you mentioned export, quantization, data, hub. The time shaved off would be on the order of 7%, would this be significant enough for serving from a cold start perspective?

In terms of the torch._C and _meta_registrations imports, these have side effects (registrations to dispatcher) that need to happen so I don’t think we can decrease the number of registrations at import time. Otherwise if the dispatcher state is queried it might be missing a bunch of ops.

From a brief chat with Ed previously there might be some C++ code optimization to speed up dispatcher registrations but don’t have an idea of how much/how fruitful it will be.

Topic		Replies	Views
A TorchDynamo trace time ablation study compiler	0	557	March 22, 2024
Tracing with Primitives: Update 0 compiler	17	8276	September 26, 2022
Embrace tensor subclass as a Python device registration API hardware-backends	5	439	March 28, 2025
What (and Why) is __torch_dispatch__? frontend API	3	14290	July 2, 2024
State of PyTorch core: September 2021 edition frontend API	1	9383	September 21, 2021

Delving into what happens when you `import torch`

Overview

Registrations to the dispatcher

Effect of lazy imports

Conclusion

Related topics