State of symbolic shapes branch

ezyang · March 2, 2024, 5:33am

State of PT2: Mar 2, 2024 edition

Jeffrey and Nested Int (https://docs.google.com/document/d/1-CR1hP1d-rJgQzCnIRLrwfpm4RbGhXYCesCSnnHZ6r0/edit?usp=sharing). The design process has definitely been a bit of a slog, but we had a lot of forward progress: (1) our compromise between “cpu and cuda nested ints with equivalent data should compare equal” and “nested int API should not have implicit syncs or expensive compute” is that we will report true/false only if we know, without syncing, that two tensors are equal/non-equal, and otherwise we will raise an error (and then there’s an API for explicitly doing compute / trust me these are equal), (2) we should have a separate subsystem responsible for doing this on immutable tensors which is independent from nested ints, (3) we do need a concept of symbolic variable that ranges over nested ints. Some other aspects of the original PR can be replaced with more memoization of nested ints on fake tensor. According to Jeffrey: “put something together that passes preexisting NT tests and its looking a lot cleaner.”
Horace, Brian, Richard and me talked about Retracing Backward (Retrace the backward in AOTAutograd - Google Docs). This is the problem where AOTAutograd wants to trace backwards of a graph ahead of time, but it doesn’t know what memory layout / subclass the tangents will be ahead of time. In the end, we came up with three main proposals. (1) Easy to implement solution: the user just somehow tells us the layout of the tangents, we use it in AOTAutograd, everything stays the same. The main problem is UI for how to actually do this–a manual API is difficult because not obvious what outputs are (graph breaks). Could imagine some sort of machine-only format like PGO. (2) Trace graph post-autograd but pre-subclass, and run the partitioner on this class before running subclass. Kind of invasive: although Horace proposed it, he doesn’t really like it. (3) Trace tangent-dependent backwards graph pre-dispatch, desugar everything else. This proposal relies on a key property which is that the partitioner isn’t actually partitioning the entire graph: we can think of the graph as consisting of three parts: the compute for forward outputs, the compute that depends on tangents, and a no-man’s land of intermediate compute. The partitioner chooses what forward compute to recompute, and whether or not to put intermediate compute in forward or backward, but cannot touch the compute that depends on tangents. In the absence of TorchDispatchMode, only the tangent-dependent compute can meaningfully change based on tangents. So you can still run the partitioner in full fidelity, and you just need to arrange for the tangents-dependent portion of the graph to be untraced (e.g., pre-dispatch tracing) to the degree that you can retrace it later when the tangents show up. This will be pretty difficult to implement, but we agree conceptually it should work, and gives the best UX.
Avik and I knocked heads about what to do about everyone continuously stubbing their toe on guard on data-dependent shapes errors (psst, read Dealing with GuardOnDataDependentSymNode errors - Google Docs). Avik’s big idea is that it would be a lot easier to fix model for errors if we could closely correlate symbols to actual variable names in source code. This is not so bad in Dynamo, but it’s bit of a lift in non-strict export since we don’t have enough hooks in Python to engineer this without a lot of black magic. But we could make it easier for people to probe what size variable any given source variable corresponds to. We talked about some recent problems people ran into: sometimes people are running into lazily evaluated expressions, my opinion is that this was a mistake and we should just guard instead; we also were able to follow up on some of the problems with some fixes.
Brian’s been busy!
- In DTensor land, people are worrying about communication reordering passes, like changing a collective-select into select-collective. Must guarantee every node does the same thing. Graphs on different ranks are not guaranteed to be the same! Design is needed here.
- Even in per-parameter FSDP (aka FSDP2), we are still using the storage resize to zero trick. Jason Ansel is adding storage resizing as a native concept to inductor, so it can be run as early as possible. The resize seems to be necessary, esp in partial graph cases, so gonna have to deal with it.
- FSDP2 is facing an interesting input aliasing problem. We receive the sharded, unsharded and unsharded parameters all as arguments, and unsharded aliases unsharded padded (via an as_strided call). Then, when we copy_ the result of allgather into the unsharded view, this results in a as strided scatter (because it writes into the unsharded padded buffer). But if these are contiguous, we ought to be a lot faster.
Composability sync https://www.youtube.com/watch?v=ACR1WnRScCc
- User defined Triton v custom ops: the resolution was that we will “release” user defined Triton kernels in PT2, but we will encourage people to go straight to custom ops API when writing new code. (However, Richard is also investigating how to get rid of some of the main pain points have about working with custom ops, more below.)
- C++ FX: not gonna happen, we’ll probably do other stuff.
Down the grapevine from Michael Suo: we’re probably going to care about non-Intel CPU performance in PT2 soon… Meta only: https://docs.google.com/document/d/1m_aQaMFF6T62z2kH8yaS9SIuVijHUR7vNn6DutSU1Xo
Greg (channeling Soumith) asked me a big question that is worth thinking about in 2024: how can we make PyTorch less complex? (Alternately, how can we make development in PyTorch go faster?)
Richard has been trying to figure out how to get rid of the schema requirement for custom ops, and he pitched me a pretty interesting idea: what if we FX trace the inner implementation of an operator, stubbing out the (simpler) bits where we actually write to data pointers, and then use that graph to get schema? You can potentially use this to get meta for free (since you’ve just “commented” out the actual compute.)
Sparse arch enablement getting to the point where we need a spreadsheet of doom. Edward to organize at least the first version of it. A snapshot of the enablement issues recently: unbacked SymInts related to guard size oblivious, a Dynamo list append regression, adjust_info_B_num_bits shenanigans.
Flavio’s baby is coming soon, I’ve been encouraging him to focus less on E2E enablement, and just landing stuff to main that we know is needed.
Some communication debugging with the new PL&R team: focus on time to first batch and megawatts saved metrics; metrics as a means to an end of improving compile time. Order of operations doesn’t matter too much, do the right thing. Cross-org comms is hard.
Animesh close to done with C++ guards. Not a clear way to measure the benefit. Some discussion about a new problem where because we don’t inline builtin torch nn modules, we guard on their ID, which leads to spurious recompiling when torch.compile is on smaller pieces of model rather than whole model. Best solution is to trace through (this helps Jack Cao too.)
Export is speccing out a “torch function” granularity operator stack info, similar to the nn module stack info, which you can use to unflatten a linear call. In non-strict export can be implemented as TorchFunctionMode, but needs Dynamo to understand where torch function interposition points are.
Structured logs landed! TORCH_TRACE and use tlparse to parse the result into an HTML report. Still MVP, lots of improvement possible.
Notable new bugs in dynamic shapes:
- Ban or change behavior of TensorVariable.size
- Symbolic shape tests don’t actually test that arguments aren’t specialized
Landed stuff

jeromeku · March 4, 2024, 5:27pm

Are the design docs presented during Pytorch composability syncs available? Mainly interested in the PT2 dequant docs that @Chillee presented during the last meeting.

thiagocrepaldi · March 4, 2024, 5:48pm

How can I learn more about the bullet

Export is speccing out a “torch function” granularity operator stack info, similar to the nn module stack info, which you can use to unflatten a linear call. In non-strict export can be implemented as TorchFunctionMode, but needs Dynamo to understand where torch function interposition points are.

ezyang · March 7, 2024, 3:32pm

@jeromeku Meeting notes are at Composability meeting notes - Google Docs ; sometimes people forget to make the docs public though. (Do you mean last meeting though? Maybe you’re referring to Feb 8?)

@thiagocrepaldi It’s this doc Rationalizing nn_module_stack and source_fn_stack - Google Docs

jeromeku · March 7, 2024, 5:05pm

@ezyang @Chillee
Thanks! Found the aten.dequant doc in the meeting notes but it is not public.

ezyang · March 7, 2024, 5:24pm

State of PT2: Mar 7, 2024

Compiler and distributed offsite!

Benchmarking: Difficult to identify regressions that affect only a single regression, as oncalls only look at the accumulated statistics. Generally, Horace doesn’t trust our benchmarks. Some ideas for making it better: measure only the CUDA kernel times to reduce variance (good for non overhead bound benchmarks), add synthetic benchmarks to test overhead, to test overhead of specific subsystems, run them over and over in a loop microbenchmark style (it was important to jansel that we get a representative set of guards over all the benchmarks though), compute the hypothetical peak speed and compare against that instead of eager (best for operator level, or hard code a specific number to compare against). A bit difficult to prototype new HUD ideas, because no one is actually working on HUD.
Dynamo warm start: goal here is to deal with some use cases like (1) your training job was killed and you want to restart it without having to recompile, (2) you have 10k nodes and you don’t want to compile 10k times, (3) you have a custom operator that you want to run fast with PT2, (4) you are working on a model locally and you don’t want to wait to compile every time. Here’s the plan:
- First, we need to make the compiled output of AOTAutograd serializable (maybe Sam Larsen will do this, it’s a logical follow up from Inductor serializable), this gives you run Dynamo all the time (make this faster!) and then cache hit AOTAutograd, 100% correct.
- Next, we add a YOLO cache option, where we assume you didn’t change any source code and we only test a limited set of easy to serialize guards (probably shape guards) to test if a cache entry is OK. You have to explicitly ask to load from the cache, and you’re expected to manually do cache invalidation yourself. Need to implement side effects like installing globals from Dynamo.
- Finally (and we can choose not to do this in the end), we can try to do robust always on Dynamo caching. Combination of (1) build-system style testing if Python files Dynamo traced over have changed by hashing them, (2) Dynamo guard serialization (e.g., need to deal with ID_MATCH).
We had a User Empathy Day, where a bunch of us took popular OSS models and tried to torch.compile them. For some models, they worked and had speedups, but a lot of models failed. We filed a lot of bugs. Some more details: User Empathy Day - Google Docs

image1920×1440 153 KB
Some discussion about DTensor. Pretty interesting work from the external community: [DTensor] Open Source by leonardo0lyj · Pull Request #8 · volcengine/veScale · GitHub Some interest in moving the main class of DTensor to C++ to make it possible to use in eager mode without PT2. Wanchao is interested in MoE use case for DTensor where you have irregular sharding; the theory is that this should be a subclass of the sharding spec, but this needs more investigation.
Horace has a proposal called “hierarchical compilation”. The motivating problem is when we have a loop over a basic block, we inline everything and uselessly recompile the same block over and over again. Instead, we want to compile it once and reuse it for all the calls. What makes this difficult is there might be Python state update, e.g., updating a loop counter or appending to a list, which you still need to apply in the outer loop. So intuitively, the idea is to recursively invoke Dynamo on the inner block, producing an opaque graph and some residual bytecode. Then, in the outer Dynamo session, you trace through this graph and bytecode however many times in the loop to update your state accordingly, and you keep going. The primary implementation complication is that at the time you do the inner Dynamo session, you need to directly use the outer Dynamo’s VariableTracker state, because there may have been updates to the Python state that you need to see, while at the same time not actually applying your inner updates to the outer variables until the very end.
Francisco presented simple FSDP, a very simple implementation of FSDP using only a custom autograd function for allgather, and a parametrization to allgather parameters before they are used. In combination with selective activation checkpointing and a small 150 LOC FSDP specific optimization pass, they achieve memory usage and performance on parity with the complicated eager implementation. Assuming that you can deal with the compiler stack, this is a great starting point for more complicated userland ideas for scaling.
This spawned a discussion about a bigger tension that showed up in several contexts: hedging eager vs all-in compile. Right now, we have a very compromise strategy, where we are working on torch.compile, but very much as something you put on top of existing working eager code. Francisco shows us that if you give up on eager mode performance and go all in PT2, this is a really interesting point on the pareto curve. But on the other hand, we have projects like DTensor and FP8 where the folks we are engaging with cannot do compile, and the eager perf needs to be good and that means we need to write things in C++. I think we’re still going to be doing the compromise strategy, but it is definitely worth continuing to ramp up docs and understanding about “green field PT2”, as it can be the right choice when there is a champion for PT2 on a project that really benefits from a compiler.

ezyang · March 16, 2024, 11:14pm

State of PT2: Mar 16, 2024

Core offsite!

Christian Purhsch was asking about CPU offloading, and so with Jane and Alban we dusted off the old DTR paper [WIP] Dynamic Tensor Rematerialization(Checkpointing) by MarisaKirisame · Pull Request #42056 · pytorch/pytorch · GitHub and reimplemented it as a TorchDispatchMode doing CPU offload instead of rematerialization. Actually, Anjali Sridhar and Paul Johnson had beat us to the punch with SSD offload [cleanup] remove ssd offload to simplify the FSDP code (#1080) · facebookresearch/fairscale@e71d257 · GitHub three years ago. Jane was interested in this because she’s been talking to someone who had been working on a paged optimizer with UVM; an explicit offloading mechanism in PyTorch would be easier to understand and tweak the underlying heuristics for.
With Horace, we had a lot more discussion about what to do about warm start caching. Some notes in https://docs.google.com/document/d/1XR66WrFv6hspNyyTUsGAp7FvN-aKcckEj-hwfRGAPQ8/edit ; in particular, we are now calling the unsafe YOLO cache “sticky cache”, for the fact that you’re stuck to whatever the first cache entry is.
With Joel and Jeffrey, hashed out some more details on nested ints in NJT. Not going to do multiple dims before jagged dimension yet. Looks like ephemeral sources are here to stay (because nested tensors can be leaf tensors!) Quite a bit of technical work on Jeffrey’s work to implement a generic test equivalence of tensors + save some metadata for them.
Alban is working on a utility for letting subclasses specify pairwise precedence ordering, and then we automatically figure out the correct dispatch order.
Had a chance to catch up with some AI PEs about the state of some of the large scale training jobs going on at Meta. A big theme is that everything is on fire all the time: nodes are failing extremely frequently and while infra can work on making restarts fast, when we double the number of nodes this is likely going to be untenable. Need some other way to change the equation. It’s a bit difficult to see how non-distributed PyTorch team can help (no Python in the training job maybe?) but interesting to hear about the troubles. It really does feel like the right organizational fix is to spin up a dedicated team to build something purpose-built for one particular situation (as opposed to PyTorch, a framework, writing general purpose tools) and figure out what actually works. Also, the PEs are very unhappy with NCCL (it’s fast, but too hard to diagnose problems). Was a meeting with NVIDIA about it, Meta only: https://docs.google.com/document/d/19D5i78bFghgtQINeLuWc9IwU1oYJlEJ6vAj0xcMzbGs/edit#heading=h.gd9dkggvhcpv
Meta-only: I wrote an explainer about ads workloads at Meta, check it out if you haven’t seen it! Redirecting...
Meta-only: we’re going to focus on full model compilation for APS models (deprioritizing PyPer models), FYI.
We published a public version of the roadmap for PT2 core: PT2 Core - H1 2024 Roadmap [PUBLIC] - Google Docs
Monthly status review https://docs.google.com/document/d/165xjqnK1lwBaqXcfQMcK5i436qjduRB8YboYacra9Eo/edit?ouid=113771289360643292799&usp=docs_home&ths=true&dl=CgYKBAgCEAM happened. Some notable things: Python 3.12 supporting targeting mid April (thanks William Wen), oops we need to do some feature development on static runtime.
Meta-only: James Wu wrote a doc about tlparse next steps https://docs.google.com/document/d/1DuBWbYIIdykgd8h6-1U6LKBmfwsVf3gipuIo9bvK5bs/edit#heading=h.uf6901tzzloe
Meta-only: torchrec export unsharded doc in progress: https://docs.google.com/document/d/1pIXFqGT1L8ovplLoR_w2omlPfMcGZdagZR0I8joaB2A/edit#heading=h.1z9tvuv29fx5
Vincent Moens coming to core (tensordict hurray!)

ezyang · March 24, 2024, 10:23pm

State of PT2: Mar 24, 2024

Back to work after two weeks of traveling.

We’re going to let PyTorch developers add calls to OpenTelemetry C++ API from inside the PyTorch codebase, to make it easier to add library level instrumentation that you might be interested in when running large training jobs. By default, OSS distribution of PyTorch will no-op all of these telemetry calls, but if you rebuild PyTorch from source you can point them to your favorite observability platform. We’ll also make calls from Python possible once we figure out the dependency situation. Meta only: Redirecting...
The AO folks were asking about NF4 dtype. NF4 is a quantization scheme with 16 values which are normally rather than uniformly distributed around zero. The ask was whether or not it should get a dedicated dtype (like torch.bfloat16) in core. Unlike FP8, it doesn’t have direct support in silicon, so it fails the test for core support put out in Supporting new dtypes in PyTorch - Google Docs . However, @cpuhrsch pointed out that not having an actual dtype for the type poses some UX problems for a tensor subclass implementing it. In particular, what does the dtype field of this subclass report? Right now (modeled off of FP8), this dtype field reports what floating point type the quantization scheme is seeking to simulate; typically bfloat16, EVEN though it’s actually fp8/nf4 under the hood. But this is very awkward: if I say fp8_tensor.to(torch.bfloat16), this will no-op, because Tensor.to looks at the dtype field before deciding to do the conversion and thinks “Oh yes, this is already bfloat16 (because that’s what it was lying about)” and does nothing. An added complication to the problem is in autograd, where we expect precision of gradients to match precision of primals, to the point where autograd will automatically insert conversions to make things match up. So in fact, for a class like FP8Tensor, you want to lie that your dtype is the dtype of your gradient so the autograd engine doesn’t reduce the precision of your gradients (which must typically be in higher precision than FP8). You could imagine an extremely BC-breaking change to the autograd engine to not require dtype matching, but this doesn’t absolve you of the problem of determining what the precision of your gradients should be… and knowing what the precision of the primals were is very useful information.
We discussed hierarchical compilation at composability sync. Check out the very minimal design doc at Hierarchical compilation - Google Docs @anijain2305 and @laitho90 are working on phase 1.
William Wen talked about progress to Python 3.12 support at Dynamo team meeting. It seems changes to the eval frame API are the biggest blockers right now. Actually, once we finish 3.12, it’s time to immediately start working on 3.13, which is going beta soon
We’ve been working on getting a grip on all of the issues in the GitHub issues tracker. Apologies for any notification spam (and rip my inbox LOL.)
Jack Cao’s still been working on inlining into NN modules for tracing backwards single graph, and one thing he’s noticed is that quantization is relying on the NN module structure from Dynamo, so flattening it away is causing those tests to fail. Our current thinking is to keep the legacy behavior unless see a backward() call, in which case we keep inlining, although it would be nice to go further and switch the behavior unless you explicitly opt into the legacy behavior… need to talk to the AO folks some more about this…
Michael Suo is being sent to llama land, wish him the best of luck . Horace has also been thinking about how to better grapple with large scale training. My personal take on the matter is that you probably want to do specific, purpose-built infrastructure at this scale, and then take the lessons to make generalized infrastructure adjustments.
AOTInductor lightweight wrapper from Python is collecting user requirements https://docs.google.com/document/d/1tP_7InSSKQ1zW1HDc2W1juJzIFhlmTPHqcTYLdgpWh0/edit although apparently Purpleberry is going to release something much simpler in the near future.
There’s a lot of interest in AOTAutograd recently, which might soon be an air traffic control problem for this part of the code. Brian still has a stack of DTensor fixes that will be landed soon. Todd Fiala is going to look into reducing fixed overheads from AOTAutograd prompted by this post from Jason AOTAutograd has high fixed overheads · Issue #122029 · pytorch/pytorch · GitHub and we also need to implement AOTAutograd level caching to get the next layer of caching on top of Inductor. There’s also interest in getting more hands on AOTAutograd.
I spent a chunk of the week working on the plan at Factor meta conversion into real tensor -> serializable metadata -> fake tensor; create fresh ShapeEnv when retracing · Issue #121085 · pytorch/pytorch · GitHub and while I completed the refactor, I am a lot more bearish on the idea of using this to solve problem (3). It’s just a huge change to start having separate ShapeEnv at each layer of the stack and it doesn’t even solve half of the reallocation problem (fake tensor repropagation in Inductor). I dusted off Preserve unbacked SymInt on SymNode by ezyang · Pull Request #120816 · pytorch/pytorch · GitHub to directly fix the thing that directly induced this, the “direct” fix seems to work OK now.
torchrec reported a major milestone in PT2 compitability. Meta only: Redirecting... The way I’d characterize this work stream is that the “right” kinds of fixes are being landed, we are emerging into the AOTAutograd/Inductor neck of the enablement work, nothing is happening on JT/NJT land, and we are also getting to distributed land (e.g., Ivan ran into a problem with collectives that turned out to be wait being pointed to the wrong implementation.) Yifu, when are we unifying the functional collectives?!
Notable new bugs:
- Guard on data-dependent expression from torch.debug_has_internal_overlap(self) in copy
- ConstraintViolationError not thrown when constraining dynamic dim to static int - there’s some softness in the constraint checking which probably could be resolved with a more careful audit
- AOIInductor: Dynamic Shapes Specificiaton fails for SAM
- torch.randint with out= kwarg: SymIntArrayRef expected to contain only concrete integers - just a missing meta, but this turned up another issue that Jason fixed
Notable fixes:
- [inductor] Fix issue with randint + symbolic shapes by jansel · Pull Request #122428 · pytorch/pytorch · GitHub thanks Jason Ansel
- Preserve unbacked SymInt on SymNode
- Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate - I got fed up with forgetting what these envvars were called
- Make expected stride test in torch._prims_common size oblivious
- Make check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False
- Simplify Storage meta conversion with PyObject preservation - just a nice refactor, happy that we actually did it

ezyang · March 29, 2024, 7:30pm

State of PT2: Mar 29, 2024

I am on vacation next week, so you get this report early!

Sparse architecture full model compilation
- We’re still assembling a full bug report, but we’ve noticed that one of our recommendation models with embedded collectives and a sparse arch runs successfully with aot_eager but has a NCCL timeout when run with inductor. So… we think it’s the collective reordering problem, or maybe we are timing out because only some ranks are compiling and the others time out waiting for compilation. (For this model, we don’t expect the graphs on the different nodes to be exactly the same.) More investigation soon.
- Most of our models are successfully running with aot_eager, and we are now hitting lots of novel unbacked SymInt issues in Inductor. Inductor: aten.split_with_sizes.default: TypeError: Cannot convert symbols to int · Issue #122937 · pytorch/pytorch · GitHub is one we’ve fully diagnosed but there are others, also with bug reports coming.
- The plan on record for torchrec export is that we eventually will unflatten the modules at load time so that we can run conventional eager module swaps from torchrec. This raises a problem about how to serialize the (fairly complex) metadata associated with embedding bag modules. In particular, to use fbthrift in OSS for the serialization or not? Meta only: https://docs.google.com/document/d/1f1rTwmSvCZ8ae2rNqv3qeN6lPPbOm9mcJewhTygGKL0/edit#heading=h.jt4gc2ykbmeo
- There was a meeting on Tuesday, flagging difficulties in exporting for PyPER FX. However, after the meeting, Avik was able to work his way through all of the problems, getting to the end of the export process. We encouraged the folks working on the export to writeup problems as WP posts and post them to our working group.
- Yifu has a pretty interesting bug fix [dynamo] delete graph_out_{n} after restoring local vars by yifuwang · Pull Request #122658 · pytorch/pytorch · GitHub which may explain some of the persistently high memory usage we see in torch.compile modules with graph breaks. There were several levels to the problem. First (what Yifu fixed), when we called the compiled FX graph and got graph outputs, we saved a tuple of all graph outputs to a local before restoring them to appropriate actual locals per Tensor. But we forgot to deallocate this local! Memory leak. But we also noticed a second potential problem: when we call into the resumption frame for a graph break, we use the normal Python calling convention, which will keep all arguments live for the entirety of the subsequent call. If this is not desired (e.g., del a variable later), then you will also leak. This one needs fixing, and probably will be done by changing the calling convention into graph break resumptions.
Brian, Jason, Ansel and I discussed an interesting pattern in AOTAutograd involving parameter creation inside of of the AOTAutograd compilation region. This is a bit tricky to deal with because (1) you can’t actually allocate new leaf variables inside of a custom autograd function and (2) you can’t preallocate the parameter outside of the custom autograd function, as that would increase memory usage, the allocation really has to happen on the inside. Jason explained to us that the approach is to have a sacrificial input parameter that requires grads and is leaf but is zero size and will get set_ to the actual data in the middle of the tensor. We explored some alternate designs but this indeed seems to be the most reasonable way to do it.
Richard’s latest custom_ops API is close to releasing, preview it here: Add torch.library.custom_op by zou3519 · Pull Request #122344 · pytorch/pytorch · GitHub
We had more discussion about NF4 dtype at composability. One interesting proposal from Horace was that NF4 shouldn’t actually be a tensor subclass, it should just be a parameterization. Horace and I ended up chatting a bit about tensor subclasses and how we are ending up having to a lot of point-to-point work to make sure they compose with each other. Which is not great: you want things to automatically compose. Not really sure what to do about this though. Watch the whole thing at https://www.youtube.com/watch?v=dftqojk2opA (it’s at the end).
Did you know that the number of devices CUDA advertises can change in the middle of a single process execution? Me neither! When CUDA is not initialized yet, NVML will report differing device count depending on the setting of CUDA_VISIBLE_DEVICES environment variable. We were improperly caching it. Don't cache device_count if we haven't initialized CUDA yet by ezyang · Pull Request #122815 · pytorch/pytorch · GitHub fixes the problem and in the process closes five bugs; always nice when you can solve a long standing bug like this.
OSS benchmark suite appears to be insufficient to properly track compile time improvement work. We knew this intellectual but Oguz is running into this on the ground. Expect some goal adjustments soon.
I’ve been working on properly tackling a cluster of issues around (1) allowing dynamically varying int/float inputs, (2) dealing with guards on float values, (3) eliminating direct computation on float in produced graphs. There are a few potential approaches that can be taken, but based on some experiments I’ve done this week, my current thinking is that we will do everything as sympy expressions (using opaque sympy functions for floating point operations to avoid sympy from oversimplifying them), and then doing a post facto translation to Tensor compute before we send things to backends.
Notable new bugs:
- Inductor: aten.split_with_sizes.default: TypeError: Cannot convert symbols to int
- Symbolic shapes _check_is_size, guard_size_oblivious TorchScript-ing compat
- topk, bmm, max ops - Multiple dispatch failed for ‘torch.ops.aten.size’; all torch_dispatch handlers returned NotImplemented: - this one I’ve noticed in other places too
- Some coverage bugs that shouldn’t be hard to fix: torch.matmul - Cannot call numel() on tensor with symbolic sizes/strides torch.gather - Cannot call numel() on tensor with symbolic sizes/strides torch.nn.ChannelShuffle graph break due to unsupported shape inference
Notable fixes:

ezyang · April 15, 2024, 3:19am

State of PT2: Apr 13, 2024

I’m feeling lazy so short report today.

Monthly review was this week. Big item in this corner was nested tensor for inference. We talked a bit about the PoR at composability sync https://www.youtube.com/watch?v=9JoHZLSLb14 and it seems Sherlock has been waylaid into working on it.
I spent my return from vacation working on a bunch of unbacked SymInt related bugs that had piled up while I was gone. A particular tricky set that I finished diagnosing this week was the remaining blocker for Inductor compilation on ShardedEBC (though it still hangs when you run it). Meta only status: https://docs.google.com/document/d/1ftC2LOtC9ULOQn4OfEpRd9JZc06esAW2Rt67m2Re7EU/edit#heading=h.u9pn2mueqcry
James Wu is going to be working on AOTAutograd caching, stay tuned for design doc.
Some behind the scenes negotiating going on with OpenTelemetry. A lot of the friction is dealing with the dual environments that PyTorch is used in at Meta: fbcode (which has a lot of preexisting libraries and infra that our PEs would prefer we use) vs conda on Mast (which has nothing). We might end up writing our own implementation of the OpenTelemetry APIs so that it plugged directly into our preexisting infrastructure in fbcode, but this doesn’t really solve the conda on Mast situation.
Ivan Kobzarev has a proposal to allow us to constant propagate arbitrary tensors, so that when you call tolist() on them you get out constants. Piggy backing off of fake tensor constant propagation (but note that today, fake tensor is only willing to constant propagate numel 1 tensors.)
Horace thinks the important thing about distributed compilation is overlap, but actually if you just make sure things are appropriately queued, you get it “automatically”. Just need this capability. In some sense this is lazy scheduler but taken even further.
Thanks Brian for writing up AOTAutograd mutation invariants https://docs.google.com/document/d/1VA-qREPiS0KSNb8RlQs66n4-U6rUU965HSvdOuqAC-w/edit I talked about them in this week’s podcast
Notable bug reports:
- maybe_evaluate_static performance problem with unbacked SymInts
- Dynamo unsupported: dynamic padding - interesting case where our min/max reasoning is soft
- Substitutions result in unbacked SymInts showing up before their definition sites - time traveling substitutions are bad news when dependent types are involved
- Inductor GuardOnDataDependentSymNode in cat
- Accurately typed ValueRanges / convenient OpsHandler dtype inference - this should be possible, it just needs to be done
- Dynamo unsupported: Dynamic slicing on data-dependent value is not supported - needs to be looked into
- torch.compile dynamo fails indexing into array from internal mutable state - needs to be looked into
Notable fixes:
- Fix guard_size_oblivious on non-symbolic expression
- Don’t intersect when clamping for size oblivious

ezyang · April 20, 2024, 7:07pm

State of PT2: Apr 20, 2024

ShardedEBC forward passes numeric tests with Inductor (Meta only: Redirecting... ) after this giant patch set Reimplement unbacked symbol bindings in Inductor by ezyang · Pull Request #124394 · pytorch/pytorch · GitHub reimplementing unbacked bindings. One of the changes the stack makes is it prevents max bound being refined to two on size-like unbacked SymInts; this is somewhat controversial, so some notes about this at An alternative, size oblivious semantics in PyTorch - Google Docs
Composability sync was a lively discussion about James’s AOTAutograd caching proposal (short version: looks promising!) and some follow up on Joel’s NJT saga post https://www.youtube.com/watch?v=uCGqTWwr7Mk Concurrently with the meeting, Richard Zou was asking “What if we were willing to just reimplement all of the autograd formulas for NJT?” Because pointwise operators you don’t actually have to reimplement the autograd, you can just make differentiable decompositions, so the problem is just weird operations like sum over ragged dimension, but you should just sweat them out manually. Richard says he’s going to attempt it when he’s back from vacation in a month.
Llama 3 is out! So cool!
Wanchao noticed that when you torch.compile an NN module, the hooks are not compiled, but only for the top level module (any inner hooks are compiled), and only if you compile the module directly as opposed to a function calling the module. This is because OptimizedModule actually only compiles forward. Animesh looked into fixing it but apparently making us actually compile hooks at the top level is exposing a lot of latent bugs (whoa).
Notable new bugs:
- torch.histc does not work with torch.compile(dynamic=True) - should be simple enough to fix
- [export] Llama3 export with dynamic shapes fails with constraint violations - a case for hybrid dynamic?
- inductor::_reinterpret_tensor() Expected a value of type ‘List[int]’ for argument ‘size’ but instead found type ‘tuple’ - this one is pretty weird, an integer is showing up as a float instead
- Inconsistent argument parsing behavior for torch.ops bindings between Tensor and FakeTensor - I’m not sure this actually matters but it was pretty funny
- Equality between SymBool is extremely busted - Probably doesn’t matter, but this needs some rolled up sleeves in symbolic
Notable fixes:
- Don’t clamp slices generated from cat kernel
- FakeTensorProp assert consistency of sizes when metadata previously existed
- Ivan Kobzarev has been on a roll with performance fix PRs for symbolic shapes: [sym_shapes][perf] Do not calculate hint in advice_is_size [sym_shapes][perf] _find not update unchanged replacements [sym_shape][perf] eval_static: guards, unbacked compute once [sym_shapes][perf] Skip assert in check_is_size [sym_shapes][perf] Use sympy xreplace instead of subs

ezyang · April 28, 2024, 7:50pm

State of PT2: Apr 28, 2024

We had another meeting with the OpenTelemetry crew to figure out what to do next. As written, the integration seems to be good for llama training (and we got some code pointers for some of the existing logging, see Meta only: https://docs.google.com/document/d/1qL7BwL5uK_AS8IqDTgULWiLwG2ssN8_RQsVndDMrqBY/edit ). It seems to be less good for more aggregate counters: James March related to us an abstraction called “wait counter” which doesn’t exist in OpenTelemetry’s API. Wait counter addresses the following use case: suppose you want a counter that ticks up while you are blocking on something. The naive way to implement this is to save the start and end time of the blocking operation, and bump the counter with the delta when the operation ends. But if the operation hangs, you will never find out about the time spent blocking, since the end event never occurred. The wait counter maintains a separate thread which continuously ticks up the counter while you are waiting, so this doesn’t happen. James is concerned that we have a lot of pretty good abstractions in our internal stack already, and just doing OpenTel may lead the way for people just adding a lot of bad instrumentation. Something to keep an eye on.
FlexAttention by Driss and Horace is on main! Flex attention lets you write custom attention functions with customized pieces like the score mod, without having to write Triton from scratch. PR: ScoreMod API by drisspg · Pull Request #121845 · pytorch/pytorch · GitHub
Natalia was telling me about an interesting problem where people want to customize the “gradient accumulation” operation that occurs when you implicitly reuse a variable multiple times. It’s similar to how in a linear type system you would have to “dup” a value to use it multiple times, and then you would customize the backwards of the dup.
Animesh was telling me about how we’re making good progress on C++ guards: specifically, with C++ guards, we are beating guard performance for what we have today, even when NN module guards are enabled (which we currently don’t enable, because they are too slow). So this is paving the way for us enabling NN module guards by default. Animesh also pointed out to me that we don’t need Jack Cao’s inline through NN modules logic, because we can just start inlining through NN modules and so we only need to know how to inline through torch. Animesh was also telling me about how we’re doing a bad job keeping track of internal bugs and issues, perhaps it is a good time to setup a task tracking system for the Q&A group.
I spent some time with Brian discussing exactly how we should be handling dynamic sizes and tensor subclasses, related to [test fix] try to re-use cached symnodes across dynamo and AOTAutograd by bdhirsh · Pull Request #120090 · pytorch/pytorch · GitHub . The general shape of the problem is that although all symbolic sizes are explicitly passed as inputs to the FX graph, when AOTAutograd creates a subclass on the fly on the inside of the traced function, it doesn’t necessarily reuse the SymNodes, and this results in “don’t have proxy for SymNode”. So the correct fix appears to be to ensure that when processing subclasses, we must also generate extra inputs as necessary in the input calling convention. Any sizes that are needed to construct the outputs should be int outputs of the compiled graph.
We hit another milestone this week: on Flavio’s branch, we can successfully run inductor with forwards and backwards on Redirecting... This is thanks to fixing a number of Inductor codegen bugs, as well as some small improvements for backwards support specifically related to unbacked SymInts. I spent some time talking to Ivan Kobzarev about what’s next. It looks like low hanging fruit for symbolic shapes optimizations are all gone, so we are going to have to do some more in depth analysis. There are a few newer problems which Ivan will post to the WG. There’s still some stuff to work out regarding streams, although the current plan is to use simple pipeline to avoid needing to deal with this for now. There are some new paths that want to be enabled but are also suffering from GuardOnDataDependentSymNode problems, Add propagate_real_tensors mode for unbacked by ezyang · Pull Request #125115 · pytorch/pytorch · GitHub to help on this.
Alban related to me that MPS wants to support not actually doing anything to tensor data when you say to(‘mps’), since unified memory, but this violates semantics as to is specified to return a fresh tensor. This is a good use for copy on write tensors, which Kurt has been continuously working on.
Notable new bugs:
- [dynamo] Unexpected SymBool appearing in “is_causal” inside scaled_dot_product_attention()
- Variant of slice that doesn’t perform clamp behavior - this is somewhat controversial, the counterproposal is to be smart enough to infer that clamps cannot happen
Notable fixes:
- guard_size_oblivious in unbind - this appears to have not actually fixed the user’s problem though lol
- Big stack: Don’t ignore fresh unbacked symbols in AOTAutograd forward analysis Try to reuse old symbol name rather than new symbol name when renaming Improved unbacked SymInt input support in Inductor [Reimplement unbacked symbol bindings in Inductor] (Reimplement unbacked symbol bindings in Inductor by ezyang · Pull Request #124394 · pytorch/pytorch · GitHub) etc
- Handle size/etc accessors in FakeTensor, support accessing symbolic types from toInt/etc in IValue - this fixed a bunch of longstanding fake tensor and dynamic shapes bugs
- Restore CompileContext as well in backwards - big deal for tlparse
- Make numel equal test size oblivious in reshape_symint - also solved a user issue

ezyang · May 5, 2024, 4:25am

State of PT2: May 4, 2024

ASPLOS was this week. According to those who went, the PT2 tutorial was well received. One of the spicier moments was when Jason Ansel had a call for improved benchmarking in academic research… and then the next talk had a bunch of questionable evaluations lol.
We had a minor breakthrough regarding all the flaky tests in PyTorch: actually, a lot of them are not actually flaky, they just depend on test ordering. You can use GitHub - asottile/detect-test-pollution: a tool to detect test pollution breakthrough to create minimal repros when this happens! We’ve already produced minimal repros for two bugs this way.
The OpenTelemetry saga just doesn’t want to die LOL. Rajesh Nishtala from AI Infra Training is looking around in our corner of the woods to see how he can help. James March has concluded that he’d much prefer it if we weren’t using the OpenTelemetry APIs directly, but had some indirection in front of it which we could route to our internal infra directly. Chirag and I spent some time comparing OpenTelemetry and fb303 APIs, and the big delta is that modern fb303 is very macro-based to avoid repeated string parsing (whereas OpenTelemetry is much more virtualized / willing to tolerant indirections for ergonomics.) James has enlisted Victoria and Andrii to come up with an API; we’re interested in knowing how important it is for the API to be macro-based. I plan on paying some more attention to this.
Some stuff I learned from 1:1s:
- Horace feels he understand Spark now. The innovation of Spark over traditional Hadoop is that in Hadoop, you had to put everything back to disk after every step. Spark lets you keep it in memory, but now you have a fault tolerance problem, which you solve with provenance. Horace is thinking that ubiquitous provenance is a good primitive for systems to offer researchers, and then researchers can build their parallelization strategies on it. A big crux of the problem is finding a good split between system and research, so that systems team have a leverage point, while researchers can figure out what they need.
- James update on AOTAutograd caching, as told by Brian: big finding is that it’s not so easy to cache the FX graph produced by AOTAutograd, because it contains stuff in the meta dict that is not easily serializable. There’s some work on fake tensor serialization that could help here (at minimum, you must have accurate alias information, as Inductor depends on it; also, symbolic shapes will be difficult), but my personal perspective is it’s better to go straight to consolidated cache for everything. We’ll see what James ends up doing.
Unbacked SymInts!
- What I’m most worried about this week is Ivan’s latest set of issues involving repeatedly re-viewing tensors between 1d and 2d, for non-variable batch codepath. They seem to be not so easy to resolve, in part because it’s hard to crisply define what exactly needs to work and what doesn’t. Symbolic shapes unable to reason: Ne(Mod(u0*u2 + u1*u2, u0 + u1), 0) · Issue #125307 · pytorch/pytorch · GitHub
- Ivan’s other issue is a stack overflow that happens when you have too many chained operations on ints Redirecting... . The root cause of this is that we trace int operations into the FX graph lazily, but this means you can end up with an arbitrarily large chain of thunks that then blows your stack. The current idea for fixing this is to only do lazy tracing inside meta functions; for regular tracing, don’t do it. Ivan to see if this works.
- Greg had a really good idea, so now I’m working on some data dependent shapes puzzlers. The idea is they’ll be designed to help you speedrun the last year of data dependent learnings we’ve done and quickly get up to speed. I’ve only written two so far though, busy busy.
- Shaz Qadeer is exploring the data dependent shapes area. So expect some stuff from him soon!
- Propagate real tensor is here Add propagate_real_tensors mode for unbacked by ezyang · Pull Request #125115 · pytorch/pytorch · GitHub
- We discussed unbacked SymInts at PT2 export again, mostly rehashing some points about proper usage of propagate real tensor, and using locals of frames to give better information about what size variables mean. Avik had a really good suggestion that propagate real tensor can generate deferred runtime asserts, I should add this.
There was an interesting SEV that occurred when someone fixed a meta function, causing a large amount of graph breaks. The root cause was that the newly added meta function forced specialization, so the frame in question started getting recompiled and hit the cache limit, and this caused cascading problems for inner frames.
tlparse 0.3.17 released, the big change is color coded compile ids in the stack trie. Meta only: Redirecting...
It turns out we spend a lot of time generating stack trace strings in Meta prod for exceptions that don’t end up showing those stack traces to users. Some patches to make this generation lazy landed. It was more involved than it seemed because it’s important to ensure that the lazy generation process is thread safe.
Notable bugs:
Notable fixes:
- Hotfix: restore CPP guard string in structured trace - this is the reason why guards.html is empty in all tlparses recently
- Codegen runtime asserts in Inductor
- Add propagate_real_tensors mode for unbacked
- A lot of minor torch._check / size oblivious minifixes

ezyang · May 13, 2024, 3:46am

State of PT2: May 12, 2024

Data dependent shapes
- We tried propagate real tensors on an internal model and its triggering resize storage size wrong errors, need to do some (ugh) debugging
- Data dependent shape puzzlers is out, check it out at: GitHub - ezyang/data-dependent-shape-puzzles: Puzzlers regarding data-dependent shapes in PT2
- Made a bit of progress on Implement native support for float inputs in Dynamo and ShapeEnv by ezyang · Pull Request #125325 · pytorch/pytorch · GitHub , there’s a blurb about it in composability
Horace was recently thinking about whether or not we really need data dependent shapes. This was prompted by some discussions he’d been having with some researchers who were doing some MoE style stuff with DtoH syncs (aka some sort of data dependent routing)… and it was just too expensive and they were eventually going to end up just rewriting their model so that it didn’t need those syncs. We talked about this a few times over the week.
- First, Horace was curious whether or not data dependent shapes was really needed in the recommender system use case. Here, the dynamism arises from sparsity in features: we are given some inputs and we need to partition them for model parallelism, but there may be imbalance in how many features we send to each partition. Horace was curious whether or not we knew some sort of maximum partition size, so we could simply pad out or employ some sort of device-side sparsity on the other end. Dennis said that yes, in principle, you could OOM due to imbalance, but it never actually happens in practice, and people don’t want to pad, it uses more memory and makes it slower. The extra memory is not technically a problem, but people evaluate modeling changes against memory usage (even if there is headroom), so they will get annoyed at you if you unnecessarily increase the memory usage of your model. Additionally, truncation is a big no no, since that means numerical difference and that means you have to do full on evaluation (as opposed to numerically identical optimizations). And yes, the D2H syncs aka comms are the main issue, but this is sort of understood and minimized to a single all2all communicating splits. And we know the comms are very exposed and a lot of the optimization about sparse arch is making sure you aren’t screwed over by comms.
- So then we talked a bit about block sparsity, which is the most recent tool Horace has been looking to use in situations where you wanted data dependent symbolic shapes. Horace brought up too: block sparse attention, and block sparse to implement MoE ala MegaBlocks (https://arxiv.org/pdf/2211.15841). There is also some data-dependent modeling going on here too: you can implement MoE by performing a boolean mask to extract out the tokens that should go to a particular expert, but this incurs a DtoH sync. Instead, you can avoid syncs altogether by permuting the input tokens, grouping the tokens that should go to various experts together. MegaBlocks shows you can do this without dropping or padding, simply by designing a block sparse matrix multiply kernel that can take a permutation with uneven assignments (padded up to some reasonable multiple of tiling) and maintain occupancy. Now, there is no data dependent shapes at all: simply an irregularly sparse data structure and some custom kernels. So what does this mean for nonzero/boolean masking? My intuition here is that folks are always welcome to write their models in a careful block sparse way, but for the rest of us, we are hoping for some sort of higher level UX which can compile down to this. What does this UX look like? At least for the MoE case, it seems like some sort of vmap on a jagged data structure (where the jaggedness represents imbalance) could be one possible way to represent this.
Did you know that non_blocking device transfer doesn’t work in Inductor? We just… don’t do anything about it. It should likely be handled similarly to communication ops.
Jason Ansel brought up a spicy topic: is functionalization more trouble than its worth? This was precipitated by set_ functionalization taking a long time. In particular, Jason pointed out that Inductor largely does understand how to deal with mutation. We discussed this a bit: it’s probably still worth functionalizing most things by default, but for a certain limited subset of things, we may want to consider permitting mutating operations to propagate past AOTAutograd, and fix up all our FX passes to treat them as strict barriers.
AI Infra is working on a higher level observability library (which could interface on top of OpenTelemetry), that will have some higher level abstractions that we’ve found quite useful in our data preprocessing stack! More Meta only context: Redirecting...
H2 planning is on the horizon. Gregory is interested in working out how we could avoid putting too many commitments on our roadmap.
Notable bugs:
- Allocating unbacked SymInts doesn’t work in mutating operator that gets auto functionalized I need to fix this
- Compiling with Inductor, DDP, and Dynamic Shapes Results in Errors - some naughty code that needs to be rationalized
Notable fixes:
- Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward
- Fix ‘Could not infer dtype of SymBool’ on torch.tensor call
- Memoize local_scalar_dense calls, refactor all memos - long overdue, though note this won’t memoize x.tolist()
- Get rid of tabular and sizes, beef up verbosity of output graph - I should follow up to make graph an alias of output_graph, … urggh so lazy…
- Also pull size/stride info from example_value - major major improvement to graph_code printout from Dynamo, now you see sizes
- Introduce torch.utils._sympy.symbol - document all the symbol prefixes in Inductor
- Don’t call item() into torch.scalar_tensor uselessly

ezyang · May 18, 2024, 9:54pm

State of PT2: May 18, 2024

Ramblings from 1:1s
- I had a chance to chat with Dima about some of the MoE/sparsity stuff that I’d been talking to Horace about last week. Some things that stuck out to me: (1) Sparsity is pretty interesting! Recommendation systems got there first, but people are pretty interested in it for LLMs now too. A lot of commonality: it’s funny to see recsys style permutes showing up in MoE models. (2) MoEs are a pain to deploy, a bit of skepticism here. They suffer from extra nonlinearity since small perturbations could switch you from one expert to another. If you’re doing some sort of padding/truncation during training, you need to faithfully follow this in inference or you will be out of distribution. There’s also a lot of variation in MoEs: do you have a lot of experts, or very few? If you don’t have that many experts, padding is just not that big of a deal. Horace also mentioned that expert choice routing is becoming a thing, where experts choose which tokens they want to see, which reduces imbalance. (3) Megablocks kernels are apparently not that good. They also have another implementation based on grouped matmuls (gmm; different from grouped convolution, the idea is if you have a bunch of variable size matrix multiplies, you can issue a single kernel to do all of them). To implement GMMs efficiently in CUDA in the presence of imbalance, you need some sort of userland scheduler to take care of sending blocks to the right place. A big challenge about working with sparsity is every algorithm seems to want a different, bespoke data structure for metadata they need about sparsity. Millions of sparse formats, with a scheduler that routes through them minimizing extra metadata compute that is needed?
- Horace has been thinking about pipeline schedules. It’s not uncommon to see people manually laying out their pipeline schedules in a spreadsheet block by block, Horace wants to make it easy for people to write these schedules in userspace. One challenge is that autograd having an implicitly define backwards() means you can’t easily say “OK, now I want to run only the weight computation for this block of the program”. You want to be able to call the various phases of the backward explicitly. I asked Horace if people shouldn’t just explicitly put their spreadsheet in code, this is not a difficult frontend improvement. But Horace wants some flexibility to, e.g., change the number of pipeline stages. Single controller can help.
Ads PT2 update
- Shuai: propagate real tensors works for one of his diff stacks, eventually gets to a NCCL timeout, hang on all_to_all. Not obvious what caused this. One thing to note is propagate real tensors could trigger extra collective calls, potentially quite problematic. Recommendation to go work on fixing bypasses.
- Flavio: NaN problem shows up even without torch.compile, is a baseline problem
- Ivan: has a prototype hiding views inside custom ops to get through non-VB path. Still need to figure out what the “right” way to do this is.
We spent a bunch of time in composability and then export meeting talking about pre-dispatch export, functionalization and tensor subclasses. The proximal cause of the problem is that there is some use of NJT tensor subclasses in MHA, and this is preventing us from being able to export it. The bigger question we were grilling was whether or not it made sense to have subclasses in the export format at all. No one really seemed to know, no clear opinion/philosophy. Tentative new plan is that we want to have eliminated subclasses entirely for all inference export formats, one way or another. This can be done by exporting post-dispatch, and just working harder not to decompose some composites (what you get in the end is very similar for regular networks, but the way the sausage is made is different). A lot of people wanted to know what Suo’s opinion was.
Yanbo, Sam and I sat down to do some adjustments to internal oncall runbook. The biggest delta was revisiting what exactly the oncall should be doing to all Q&A questions. The work of the oncall is finding the right person to follow up on the question (could be the oncall, could be someone else). That person most voluntarily accept the question as theirs.
We have a new PyTorch PE team. I’ve been pitching them some projects:
- There is an observability follow up that might be up the PE teams alley. The issue here is that when Mastercook proposals use new operators, they may not be compatible with PT2, but this problem will only be discovered at runtime when a proposal that uses PT2 is combined with the proposal that adds new custom operators. So the problem here is essentially, when they’ve selected all the MC combos, can we quickly find out all of the new custom operators that haven’t been exercised before based on the given proposals, and then give them to someone so they can do some sort of audit before the combo process begins. We … sort of have operator data already today, even for eager mode, but I don’t think we have a way to say “given this set of MC proposals, select all the operators seen during the testing runs leading up to the MC”
- (1) Make Bento (our internal colab equivalent) hot reloading work with PyTorch Python source code. Basically, I should be able to edit code in pytorch/ and have Bento auto reload it without having to restart the kernel. There is some interaction with module import initialization that will require some PyTorch internals understanding. (2) Double check that inline C++ extensions work in Bento notebooks. (3) Make multi-GPU training work from Bento notebooks with PT2.
Planned tightening invariants in inductor:
- Every Sympy expression has an accurate is_integer, reflecting whether or not we have an int or float in the source Python or Tensor compute. It is illegal to do things like say Trunc(float).is_integer, because truncation always produces a float in Tensor semantics
- Int/float polymorphism for ops in op handler is greatly restricted. You are allowed to have a single operator which works for both int and float only if, for all target platforms, you codegen equivalent code for both int and float. For example, add satisfies this. We mostly follow this today for operators; for example, mod and fmod are treated separately. I’m not sure if any change is necessarily needed, but there is some softness in division/minimum-maximum/pow which I will be checking carefully
- The above also applies to Sympy functions. FloorDiv is most notably affected (right now, it is written as if it also works on float inputs)
- Implicit promotion is forbidden. Inductor op handlers largely follows this rule, but our Sympy functions do not
Notable new bugs
- Inductor: Codegen for sympy Trunc is incorrect - not an issue affecting anyone, but I was code reading and like, this cannot possibly be correct
- Can not use torch.compile with dynamic=Ture when using multi-threads
- Improve reasoning for size oblivious equations involving max()
Notable fixes:
- Add symbolic_shape_specialization structured trace - needs a tlparse integration
- Add VariableTracker.debug_repr - this is still slightly broken, need to diagnose more
- Generate runtime asserts when propagate real tensor is used, Make propagate_real_tensor more safe and Fanatically correct real tensor cloning for propagate_real_tensors - propagate real tensor follow ups
- Don’t assert about pending when we are peeking
- Move compute unbacked bindings call to track_tensor_tree
- Improve dead code elimination of unnecessary int arguments
- Make wrapIndexOnce check async, avoid DtoH sync on index_put_
- Permit trivial solves for floating point equality in ShapeEnv

ezyang · May 31, 2024, 9:08pm

State of PT2: May 31, 2024

Two weeks of updates. Roadmapping is underway. Are we ambitious enough, are we executing well?

Big ticket item in dynamic shapes development item is: Complete revamp of float/promotion sympy handling by ezyang · Pull Request #126905 · pytorch/pytorch · GitHub
We convened to discuss our lack of cold start compile time improvement. There was generally disagreement about which way we should go on this. My personal opinion is we need to setup some deterministic metrics, and then do as many 1% improvements as we need until we get to nirvana.
The FloorDiv / Modulus problem: The FloorDiv / Modulus problem - Google Docs
Some mumbling to myself:
- Brian and I are going to work on a compiling subclasses retro two weeks from now
- Horace: Should be easy to implement “run the forward, and then get a backward continuation you can call later, with explicit saved tensors” using saved tensor hooks. We resolved on using torch dispatch mode for finding free variables.
- Animesh: I was pitching to him that bug fixing matters in Dynamo, in fact, it should be the focus. When Dynamo is boring, we can move on (modulo regular maintenance like CPython upgrades)
- Shaz: reports that having a special case for view for N-D to 1-D, and 1-D to 2-D, was enough to solve Ivan’s test case. Validating this next week.
Notable new bugs:
Notable fixes:
- Non-variable batch is working thanks Ivan
- Unconditionally assign symbolic shapes as locals
- Add guard_size_oblivious to vector_norm
- Don’t assume compare_arg is fx.Node

ezyang · July 7, 2024, 2:04am

State of PT2: Jul 6, 2024

I’m back after having a baby.

torch.compile, the missing manual - Google Docs is out! I think it’s pretty cool. You should read it if you do anything at all with torch.compile/PT2.
We’ve been internally deploying remote cache at Meta, and one of the interesting bumps we’ve hit along the way is that if you make a single rank compile much faster because it cache hits, while other ranks cache miss, this is an easy way to end up with a NCCL timeout! I wrote about this publicly a bit in torch.compile, the missing manual - Google Docs but we are having discussions about how exactly to deal with cases when compilation across ranks happens asymmetrically. The most likely to work short term idea is a way to add time to the NCCL timeout of ranks when compilation happens on other ranks, but we’re also looking at adding more restrictions to compilation in a distributed setting, e.g., all ranks must compile at the same time.
Notable fixes:
- Make sympify’ing SymInt/etc produce their sympy expression - We’re pretty sure this will nail the dreaded 125.000000 showing up in integer position bug
- Print float with full precision, don’t truncate - This is a pretty funny problem that was discovered by Animesh tightening up guards. It should fix excess recompilation in some cases, although it’s not clear this ever was actually a problem in real world code.
- Enable TORCH_TRACE by default on Conda on Mast - Someone at Meta please tell me if this actually worked or not LOL. (Note: you can easily manually turn it on by running your job with TORCH_TRACE=/logs, so it’s not a big deal now)
- Stop immediately specializing common constants 0/1 for plain int - We nearly gave up on this one but it turned out we didn’t need to fix that many bugs. Nice to have, especially helps with things like step counters.
- Make sym_min/sym_max handle Numpy scalars - IDK man, solves an internal problem, but y’all using np.int64 you’re weird LOL
- Stop updating hints - Beefy performance fix for unbacked code!
- Don’t mark conversion to float as is_integer = False - This fixes a nasty False != True assertion error. On close inspection Sympy’s behavior here makes sense, but it was very unintuitive!
- Fix typo in floordiv solver code that affects flipped relation - This one’s a real howler, I’m glad it was easy to fix
- Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache - mark_unbacked finally works with this, it didn’t really work prior to this.
- Make are_strides_like_channels_last size oblivious - Makes channels last code with unbacked SymInts work better
- Add mark_unbacked - This got landed while I was on vacation, but it’s pretty nice, it essentially treats an input size as unbacked so you won’t recompile it for the 0/1 cases.

Topic		Replies	Views
PyTorch 2.1: automatic dynamic shape compilation, torch.distributed.checkpoint, torch.compile + NumPy, torch.export prototype, and more! Release Announcements	0	1000	October 6, 2023
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	65089	July 29, 2024
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation compiler	7	17059	March 9, 2023
A TorchDynamo trace time ablation study compiler	0	533	March 22, 2024
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1966	September 22, 2023