This is a little more than two week’s worth of updates, covering PSC week, Edward on vacation and July 4th holiday.
Dynamic shapes by default is landed. To be clear, this is “automatically enable dynamic shapes if recompiling due to size changes.” Most models running PT2 should not see any difference, as they are static already. If your model has dynamism, expect dramatically lower compilation times at the cost of some E2E performance. There may be performance regressions, please file bugs if you encounter any. You can use TORCH_LOGS=dynamic to diagnose if dynamic shapes is doing something. Check also the Meta only post
Internal telemetry for dynamic shapes.Add signpost_event to dynamic_shapes adds a hook which we use internally to record all uses of dynamic shapes. You can check if dynamic shapes was actually used when free_symbols is non-zero.
Roadmap review for H2 2023. We had roadmap review for PyTorch teams last week. Dynamic shapes presence on the roadmaps looks like this: (1) we have a bunch of internal enablement plans which require dynamic shapes to be well supported, make sure we are on point here (Meta only), (2) we’re really interested in getting good inference performance on LLMs comparable to SOTA, e.g., llama (there’s some kv-cache / cuda graphs pieces here), (3) there’s still jagged/nested tensor work to do. On a more atomic level, the infra investments that dynamic shapes need to make are probably (a) two level guards for backwards shape guards, (b) improved accuracy/compile time debugging tools, (c) more aggressive symbolic reasoning enabled by translation validation, (d) obvious inductor compilation perf improvements, e.g., from split reductions, (e) Unbacked integers for eager mode. I’d also like to finally get vision_maskrcnn and detectron2 working on PT2, but LLMs take priority over this.
Which operators specialize their inputs? In the old days, dynamic shapes enablement would typically fail because of missing meta functions. These days, things usually don’t fail, but you may end up having specialized and recompiling anyway. @anijain2305 has been working on sweeping operators to find out which arguments get specialized, to help folks have a better understanding of what will be dynamic versus not.
vision_maskrcnn went back to failing, seems flaky.
eca_botnext26ts_256 and mobilevit_s timed out due to translation validation being enabled. #104654 fixed it (to be visible in next perf run.) Compilation time increase also appears to be due to TV.
Dynamic shapes now support mode=“reduce-overhead” (CUDA graphs). Conventional wisdom was that dynamic shapes are incompatible with CUDA graphs, because any given CUDA graph recording can only work for a single static shape, and CUDA graphs requirement of hard coded memory addresses means that each CUDA graph takes up quite a lot of CUDA memory. However, this conventional wisdom is wrong: (1) multiple CUDA graphs can share the same memory pool, as long as you don’t have any live tensors from one pool to the next (this is precisely what CUDA graph trees by @eellison implements), and (2) recording a CUDA graph is much, much cheaper than running the entire PT2 compilation stack, so it is profitable to compile a dynamic program once and then CUDA graph it multiple times. Enable cuda graphs for dynamic shapes by ezyang · Pull Request #105064 · pytorch/pytorch · GitHub realizes these gains and switches our dynamic shapes benchmark configuration to use CUDA graphs, resulting in hefty performance gains with only a modest increase in compile time. Importantly, these benchmarks cover our _generate inference benchmarks, which actually make use of multiple sizes as sequence length varies. There’s more to be done here: our memory usage for this use case can be suboptimal, because the caching allocator doesn’t know that it’s OK to waste space for small allocations by fitting them inside larger allocations for a larger dynamic size. We also observed that folks using this CUDA graphs trick tend not to generate CUDA graphs for every size, but instead prefer to linearly sample sizes and pad; we should make it easier to do this (perhaps with a padding tensor subclass.) One cool result is a 6x performance improvement on cm3leon, a newly announced multi-modal model from Meta AI.
New API: torch._dynamo.maybe_mark_dynamic.Add torch._dynamo.maybe_mark_dynamic lets you suggest that we should try compiling a tensor dynamically, but doesn’t raise an error if it gets specialized (unlike mark_dynamic).
Infer valid input sizes from programs. Horace has wanted this for some time, and with Yukio’s recent Z3 translation validation work landed, it turned out to be pretty easy to write a PoC to exhaustively search the space of valid inputs, using guards to turn us away from portions of the space we’ve seen before. Check it out at dinfer.py · GitHub. If anyone is interested in productionizing this, it would be a neat little project to (1) put this code in PyTorch and put a nicer API on it (note that as written, you have to specify the input dimensions and dtypes of input tensors, so you’ll need to figure out a good way of specifying or inferring this info), (2) improve the solving code to minimize the generated sizes for an equivalence class, and (3) use it for something cool; e.g., you could use it to automatically generate sample inputs for OpInfo tests. Tag me (@ezyang) as reviewer if you send a PR!
lit_llama is finally landed in torchbench.https://github.com/pytorch/benchmark/pull/1730 At time of writing this model is in canary models because the weight download is a little flaky. This is the only 7B model in our benchmark suite and there’s a bit of pain associated with this; for example, we can’t run accuracy tests on this model, because accuracy tests are implemented by holding two copies of the model in memory, which we can’t do at 7B parameters.
Notable bug fixes.
Transmute refined SymInt into int makes it more likely you’ll get an int rather than a SymInt if the SymInt got specialized into a constant. This sometimes caused some bugs with downstream components that can handle SymInt but choke on int.
Inductor backend for CPU inference extremely slow - this bug actually seems to be fixed on main thanks to dynamic shapes. Moral of the story: if you want dynamic shapes, use a nightly! We have soooo many improvements.
Now passing: hf_Longformer (this used to fail with ValueError: Cannot view a tensor with shape torch.Size([4, 12, 1024, 513]) and strides (6303744, 513, 6156, 1) as a tensor with shape (48, 4, 256, 513), this is thanks to Brian Hirsh finally landing his AOTAutograd longformer fix), vision_maskrcnn (flaky), eca_botnext26ts_256 and mobilevit_s (used to timeout; maybe the speedup from CUDA graphs was enough to get it under the timeout again)
Now failing: DebertaV2ForQuestionAnswering (failing accuracy due to cudagraphs, failing on inductor_with_cudagraphs too), cait_m36_384 (OOMing on accuracy due to increased CUDA graph memory usage)
Speedups: The majority of our speedups are due to the enablement of CUDA graphs for dynamic shapes. Some notable models and their speedups: BERT_pytorch (1.7698 → 3.3071), hf_GPT2 (1.7728 → 2.0056), basic_gnn_gin (1.3151 → 2.4841). The improvements on HF and TIMM models are much more modest since these are not super overhead bound models. Note that these numbers are still behind inductor_with_cudagraphs, because we are still losing some optimizations from running the PT2 compiler stack without static shapes.
Slowdowns: dlrm (infra failure due to cudagraphs, failing on inductor_with_cudagraphs too), hf_T5 (2.0252 → 1.8939, oddly enough–could this be due to memory pressure? But even more weirdly, hf_T5_large imporved perf)
Comptime/Memory: By in large compilation time did not increase, but for our training setup this is expected as we only actually run at one batch size, so you are simply measuring the cost of a single CUDA graph recording. As expected, memory compression ratio gets worse, due to standing allocation from CUDA graphs.
Speedups: torchbench numbers are actually a huge mixed bag. Here are some of the wins: BERT_pytorch (2.2317 → 2.4529), basic_gnn_edgecnn (1.7809 → 1.8732). Note that for some reason many of the GNN variants are failing performance on inference (but not accuracy), cm3leon_generate (1.3037 → 5.7822, WOW! This is consistent with some perf analysis Armen and I did months ago, where I concluded that cm3leon was hella overhead bound), hf_T5_generate (2.2619 → 8.2081), hf_T5_large (3.1690 → 5.1747)
Slowdowns: A lot more models did worse with CUDA graphs enabled, including LearningToPaint (1.9209 → 1.6812), resnet18 (1.7779 → 1.4028), shufflenet_v2_x1_0 (1.9882 → 1.6010), squeezenet1_1 (1.8625 → 1.0040), yolov3 (2.0997 → 1.8843). It’s not entirely clear what’s going on here, but we will note that there was sizable dip in CUDA graphs performance without dynamic shapes too this week on torchbench. There is an across the board performance regression on TIMM models (and a slight regression on HuggingFace too.)
Comptime/Memory: Comptime generally got worse across the board, but not too much worse. Particularly notable are the generate models: hf_T5_generate (881 → 1032), cm3leon_generate (131 → 203). CUDA graphs is not free, but given that we’re running at much more than two sequence lengths, you can see the bulk of the compile cost is the PT2 stack. For the most part, memory usage stayed fairly stable, interestingly enough.
What’s next?
I think I want to investigate the memory planning situation with CUDA graphs a bit more; I also think it’s a good time to teach Inductor how to deal with data-dependent ops (without having to graph break on them.)
Whole model compilation for sparse architecture in recommendation models.@anijain2305 has been looking at improving the ability to slap torch.compile on an arbitrary function and have it just work. One of the more challenging situations is when we try to compile the sparse architecture of recommendation models; e.g., code that interacts with [torchrec.sparse/(https://github.com/pytorch/torchrec/tree/main/torchrec/sparse). In one example, a KeyedJaggedTensor is being compiled, but it contains a list of 500 integers, each of which varies over time and participates in many guards. This is a worst case scenario for dynamic shapes compile time. However, we are also running into lots of graph breaks, which are resulting in us trying to compile smaller fragments than we should. There will be a mix of fixing graph breaks (some of them are due to data dependent output size operators like nonzero–time to fix this!) and otherwise figuring out what else needs to be done.
CUDA graphs memory planning is lower priority for now (@eellison may take a look, but higher priority is actually being able to turn on CUDA graphs in prod situations; a big problem here is when we fail to compile the entire extent of the model, causing CUDA graphs to increase overall memory usage.) It looks like we definitely need data-dependent op support in inductor though, based on sparse arch investigation.
Data dependent shape support in Inductor. I got an end to end PoC of a pointwise and then reduction with hacks working in Inductor: gist:1293a41299604c44310341b7540eabcb · GitHub The main gaps: (1) optional optimizations failing to retrieve hints (Triton size hints (pick 8192 to prevent the block size from shrinking), multiple of 16 hints (pick something not multiple of 16), 32-bit indexing), (3) buffer reuse (key’ing on the str rep is fine, use sympy_str), (4) updating wrapper codegen to create bindings to i0 variables. In general, it seems it’s pretty useful to have accurate maximum size information, for which ValueRanges is an incomplete fix because we don’t support symbols (s0) in bounds. Another trick we plan to implement is a special irrefutable guard, where if we guard on an unbacked symint, we instead just assume it is True and add a runtime assertion. One question is whether or not we always can get dynamic shapes working no matter what. It seems that in Inductor, we usually can just turn off optimizations to avoid guards. So it seems we just need to get host-side torch.cond working to handle everything else. Some fixes for these are in: If we can’t statically prove 32-bit indexing OK, only add guard if hint exists, Provide a refined upper bound for nonzero when original numel is static
An initial plan for KeyedJaggedTensor. After studying some of the models that use KJT and trying to get export working on them, here are some of the initial findings:
You can remove the list of integers from KJT before tracing a model, which will cause the model to perform a data-dependent access to populate these integers as unbacked integers. However, when we try to use these integers to do a tensor_split, we immediately hit guards we cannot prove. The guards should be provable via sum(lengths) == values.shape[0] but our symbolic reasoning is not strong enough. These guards are for errors, so they should be bypassable by irrefutable guards (guards which, if they fail, imply you would have errored anyway. In this case you can convert the guard into a runtime test.) This is worth pursuing further. In any case, you expect to have 500 unbacked symints, symbolic reasoning must be fast enough to deal with it.
If you don’t remove the list of integers, you need some way to prevent them from 0/1 specializing. In export, you can simply require every sparse feature be populated to size 2 and hope it generalizes to 0/1. In eager, we probably will just have to specialize KJT to treat these integers specially. Big benefit to this strategy is you’re not hard-blocked on guards on unbacked SymInts, since there’s always a hint; don’t need any sum(lengths) reasoning since guards are discharged by checking the underlying values. Cannot actually do this in export because export does not support SymInt inputs–I plan to fix this.
Export with KJTs doesn’t work because KJTs are not a supported input. Direct fix Add pytree support to KeyedJaggedTensor by ezyang · Pull Request #1287 · pytorch/torchrec · GitHub; indirect fix is rewriting the export calling convention from pytree specs to a dictionary of “FQN” (Source.name()) really to Tensor. In that case, to pass a KJT named id_list_features, you would actually pass three tensors, id_list_features._values, etc.
More details at Meta-only doc (sorry, non-public due to details about Meta prod models).
Export for QAT. QAT wants to do whole-graph transformations on a pre-autograd FX graph. Export sort of supports this with pre_dispatch export. What is likely going to happen is this turns into the IR format that export is going to use. Pre-autograd functionalization is unlikely to happen; you only get some mild normalization. Still unresolved how to work this into the overall QAT workflow API, since export isn’t really keen on exposing this mid-point IR (which is kind of incoherent.)
More on KJT/torchrec. I had a nice discussion with Dennis van der Staay about torchrec and work on sparse arch. Some new information: (1) this workstream is almost certainly going to involve distributed later, because applying PT2 to post-torchrec sharded models is going to involve tracing past communication primitives–this also implies I’m going to want to get FakePG working on torchrec, (2) working on unit tests should be a pretty good idea, but there’s still some basic infra work to do (laid out last week), (3) not really expecting concrete performance improvements as sparse arch is going to be typically communication bound, so this is a mostly “we think this is promising, and the investment is not too big, because we’ve already done so much with dynamic shapes so far.”)
Pre-dispatch export. We’ve agreed to allow QAT to short-term publish a new export interface that produces a pre-dispatch FX graph with ATen operators which is suitable for graph transformations and training. The long term goal will to be have pre-dispatch functionalization which is the invariant the export team wants to allow this to be worked into torch.export proper. Pre-dispatch will generate an ExportedModule so that the APIs match.
Fake export. Export now supports exporting entirely fake modules/inputs. This means to export a model you don’t have to actually load its weights into memory; you can load it in a fake mode and still export it. This means we have some delicate code in Dynamo for dealing with two concurrent fake modes (but it’s not so bad: the outer fake mode is typically disabled while we do Dynamo analysis.) Only ONNX supports torch.load’ing models in fake mode at the moment.
Improved user stacks in Dynamo.torch._guards.TracingContext.extract_stack() now always accurately reports a user stack from anywhere in Dynamo, and we reliably use it for reporting real stacks for exceptions (previously, they used an entirely different mechanism.)
Improved error messages for non-local inputs in export. See Improve error message when export encounters non-local input for the details. This isn’t complete; follow through is also to make this work for outputs, and also work a little harder with the pytree representation (probably this week.)
Dynamo change in attitude. Many folks are concerned that Dynamo is just “endless” bugs. I pitched Animesh and Voz on a new attitude to fixing Dynamo bugs, which is that we should imagine the platonic ideal implementation of Dynamo as a faithful reimplementation of CPython in Python. Then, fixing a bug should not just be moving code around to fix a particular problem, but instead improving the local vicinity of code to bring it closer in line to this ideal. An example I used a lot explaining this was dict.keys support (bug fix is changing its type from tuple to set; real fix is to accurately model dict views.) To do this well, you need to regularly look at CPython code, and Dynamo may need to grow some new abstractions (perhaps a proper implementation of Python’s object model, Python traceable polyfills).
The big perf increase in torchbench is due to maml getting removed from the benchmark set (it slows down a lot under PT2 and was depressing the score). clip, hf_Whisper, llama_v2 are new models added thanks to @msaroufim !
What’s next?
There are a lot of things that need doing
Finish overhauling export input/output pytree matching (probably not dumping the pytree in/out spec, but I think if we tree_map into positional identifiers we can reliably detect KJT missing situations)
I’m trying something a little different, expanding the update to cover a wider variety of topics beyond dynamic shapes, mostly centered around things that I personally have involvement in (this is a lot of things, so you should be getting pretty good coverage this way!)
Benchmarking
Inductor CI/perf is upgraded to CUDA 12 / gcc 9. This doesn’t seem to have any appreciable effect on perf, but we did it so we could do the next item.
torchrec_dlrm is back. They were disabled a few months ago because of fbgemm nightly related flakiness. The flakiness has been resolved by building fbgemm/torchrec from source in the Docker image. These are now installed as part of the general torchbench installs, and should help some of the work we are doing on jagged tensors (since many important operators are currently implemented in fbgemm).
Algorithmic efficiency. Frank Schneider posted about how PyTorch was slower than JAX in their upcoming algorithmic-efficiency benchmark suite. A bunch of us, spearheaded by @msaroufim, jumped in to take a look at what was going on. Status updates at https://docs.google.com/document/d/1okqKS32b0EhWQSFFoSV6IjGlYM4VhNYdxBPjdlFIw5w/edit (Meta-only). I personally have an interest in the dlrm side of things, since I’ve been working on sparse arch recently; after fixing some mild bugs, I was able to show parity on criteo1tb dlrm between PyTorch nightly and JAX on an A100x8 (PyTorch score: 7703.403180360794, JAX score: 7703.041719198227), although the number of evals varied, so I’m not sure if this a threat to validity. Unfortunately, this does not necessarily help their problem, which was an OOM. To make further progress on this, we may need some tools to help us understand why torch.compile memory usage is higher.
Export
Pre-dispatch export, part 2. We had more discussion about pre-dispatch export in the Friday export meeting. @suo in particular was arguing that from a frontend perspective, it would make more sense to export pre-dispatch IR by default, and have the further post-dispatch lowerings be an extra pass on top that is opt-in by backends. One of the identified barriers to doing this is pre dispatch functionalization; the other is nondifferentiable decomps. nkaretnikov is going to take a look at core_aten_decompositions to see which of these are differentiable and which are not. In other news, torch.export is going platinum Expose torch.export() API by gmagogsfm · Pull Request #106904 · pytorch/pytorch · GitHub
Tracing FSDP.@voz wrote a post Redirecting... (Meta-only) about the state of tracing FSDP in Dynamo. The key info is that on a branch, he can trace everything through and get identical results on a single forward-backward to eager. There’s a lot of fixes that need to land to main; from his post:
The value of various small changes to FSDP to make this work vs adding fixes in dynamo (Pretty easy, preferring dynamo ofc but for some mostly no op shuffling, we do FSDP as well)
TypedStorage - is it tensor-like/tensor-associated enough to go in the graph? Do we need to add some ops for doing tensor typed storage data ptr comparison / checking free, etc?
Working through the cudastream story, in particular around wait_stream and such
Lot’s of little bug fixes here and there
Coverage for missing comparisons, bytecode ops, general coverage gaps like attr access on FSDP modules, setting data on a tensor, etc.
pytrees slow again for DTensor. Junjie and Rodrigo have been trying to improve DTensor’s eager perf, and we spent the first half of composability sync talking about it. Rodrigo had a hack to pre-compile pytree applications into Python code but apparently this doesn’t help that much: gist:5427cabfab6421d4e104905345f94a50 · GitHub . Another suggestion from the meeting was that after Brian’s subclass supports lands, maybe you could torch.compile each op individually with backend=“eager”.
Data-dependent all2all. Will Feng got all2all collective working in inductor https://github.com/pytorch/pytorch/pull/106655/ This is notable because all2all collective has data-dependent output shape. It looks like unbacked symints worked here!
Custom ops
Custom ops. Richard tells me he is going to add a class-based API for custom ops, to make it easier to define everything all in one place. More on this soon I assume!
SkolemSymNodeImpl.@jw3468 is going to make size() work on jagged tensor by introducing a new concept to SymInt provisionally called SkolemSymNodeImpl. This is a special SymInt which is not symbolic (it can show up in eager mode) but only compares equal to itself (aka is a skolem variable). We will use this to represent jagged dimensions. All jagged tensors that have the same offsets tensor get assigned the same skolem variable, if you have different offsets tensors you can’t add them together because their skolem variables don’t match. More details at https://docs.google.com/document/d/1e-R_818YA4VlVTlozu5eyzRIV6TzyvSPDm9DMEw_4xg/edit (Meta-only)
Time to get rid of functional VariableTracker? VariableTracker in Dynamo is an immutable data structure: when a mutation happens, you allocate a fresh VariableTracker and then replace old VariableTrackers with the new one. This is because we have checkpointing functionality that is used to rewind old VariableTracker. However, this is a bit of pain from the modeling side, as every Python data structure has to be reimplemented to have purely functional operations. An alternate design is to allow direct mutation of VariableTrackers. To do checkpoints, we simply restart Dynamo analysis to “go back in time” by stopping execution at the point where we would have checkpointed (a deepcopy could also work, though I’m not a fan.) Speculate subgraph would be implemented by simply denying all mutations or doing some crazy thermometer continuation thing. This would help make Dynamo more metacircular and reduce the work needed to support new container types, of which we often need to support a lot.
Dynamic shapes
expect_true irrefutable guards. I talked through this in the last 20min of composability sync. Check https://github.com/pytorch/pytorch/pull/106720 ; this is enough to make splits on unbacked SymInts work.
Boolean masking, at last.@yanboliang is looking into a pre-autograd FX transform that replaces boolean mask updates with torch.where calls. One annoying detail is how to deal with Dynamo tracing the boolean masks in the first place, when Inductor can’t deal with boolean masks if you can’t eliminate them? Our idea, in lieu of fixing Inductor to work with data-dependent shapes (which we are working on), is to attempt to eliminate all data-dependent ops in a pre-dispatch pass, and if it is not possible, restart Dynamo analysis saying “you need to graph break on this op next time.”
Notable fixes.
SymInt’ify tile. This one needed for algorithmic-efficiency criteo1tb dlrm.
Some accuracy regressions. torchbench: hf_BigBird, vision_maskrcnn (flaky). It’s not clear what broke hf_BigBird; possibly the CUDA 12 upgrade. Need to investigate. AlbertForQuestionAnswering improved accuracy!
The huge perf improvement across the board is thanks to Peter Bell’s work https://github.com/pytorch/pytorch/pull/106747 optimizing split reductions. This is not full runtime split reductions: instead Peter uses whatever the hint was at the time we compiled to plan the split reduction, and then we use it for all subsequent runs. This makes it more important to warm up Inductor with the “right” size hint to start; see also Padded tensor subclass · Issue #105325 · pytorch/pytorch · GitHub ; there was also another user complaining about other cases where we made suboptimal decisions if the first kernel we compiled with wasn’t representative
A lot of change on the last day; some improvements and some regressions (but mostly regressions). Maybe CUDA 12 update related, need to check. hf_BigBird also failing here too. RobertaForQuestionAnswering failing accuracy now
Trying to understand how to read PT2 logs? Check out Logging docs - Google Docs for some quick orientation. (In other news, guards logging this week has been improved to show you which line of user code caused a guard to be added! Take a look and don’t be afraid to give feedback on it.)
Have you ever wanted to store lots of tracebacks for logging/debugging purposes, but were afraid to do so by default because it might be too expensive to do so? There is a new class in torch.utils._traceback called CapturedTraceback which makes it extremely fast to save a Python traceback (something like 20x faster than running a full traceback.extract_stack()), so it should change the calculation about whether or not you are willing to store tracebacks by default. We have already used this to keep fine-grained information about guard provenance to start. Note that CapturedTraceback DOES hold references to code objects, and these references can cause cycles (because of co_extra), so good practice is to make sure you clear these traces once you know you no longer need them.
I spent some time debugging reference cycles this week (due to CapturedTraceback), and Alban pointed me at objgraph for visualizing references/referents. It’s pretty cool, you should definitely check it out.
We are continuing to do a terrible job at not causing reference cycles in our compiler data structures. Folks have noticed that we leak compiled models even when the model objects are deleted. This generally emerges when a reference cycle goes through a non traversable object (C++ shared references or co_extra on code objects); the result cannot be deallocated unless we explicitly break the reference cycle, which we generally do not do. Animesh is going to take a whack at the model finalization problem. On the bright side, Animesh did push a fix for a long standing bug [dynamo][eval_frame] Set destroy_extra_state deleter as part of co_extra by anijain2305 · Pull Request #107117 · pytorch/pytorch · GitHub related to code object deallocation (admittedly rare, but it can happen);
Yidi Wu wanted to know if we could guard against backends changing https://github.com/pytorch/pytorch/pull/107337 so you don’t have to call dynamo.reset() anymore. The motivation was to allow people to seamlessly use eager backend along side their compiled backend. Difficult!
When doing passes on FX graphs, it can be annoying to keep fake tensor metadata up-to-date. Horace is looking into some incremental rebuilding of the metadata, stopping re-propagation once you notice that the fake tensor lines up with the old values.
Distributed
Handling backward hooks in Dynamo is kind of difficult. There is a discontinuity between hooks on inputs and hooks on intermediates; hooks on intermediates, in particular, have to somehow be reflected in whatever graph gets differentiated by autograd, but at the same time these hooks may have arbitrary Python bits that need handling by Dynamo. It seems the problem is made easier if we have combined forward-backward tracing in Dynamo, at which Dynamo knows enough about the backward structure to bypass AOTAutograd entirely. It might also be possible to just do cuts prior to going to AOTAutograd, which will impede optimization. It might be possible to bypass this problem for FSDP if hooks are only on parameters and outputs. Lots of difficulties…
dlrm is now passing on dynamic shapes which is cool. RobertaForQuestionAnswering was fixed (not clear what fixed this; it’s in the 35cca799ff42182a1b7f1ee4d0225ee879b7c924…384e0d104fd077d31efafc564129660e9b7a0f25 range). Some other wins (and some regressions, most importantly sam) from Unfuse bias add before pointwise ops by eellison · Pull Request #106912 · pytorch/pytorch · GitHub, also some other unexplained changes like convnext_base and jx_next_base in this same commit range (which sort of makes sense, @eellison landed a bunch of perf related changes)
We were on break for two weeks because I went on vacation, and I didn’t have time to do a report before/after vacation lol.
Executive summary
PyTorch 2.1 branch cut. The cut was three weeks go (right when I went on vacation lol) and we’re reaching the end of the cherry-pick window. Track ongoing cherry picks at: https://github.com/pytorch/pytorch/issues/108055
Blueberries offsite was this week! The blueberries workstream is focused on accelerating SOTA transformer models using PT2, quantization, sparsity and other techniques. Some highlights: MFU is coming to the benchmark suite, some direct improvements to important models, int8 dynamic quantization with tensor subclasses. Many of these are not published yet, keep your eyes peeled at PTC!
Aug 24 https://www.youtube.com/watch?v=H6EUSsvDmbw - we spent time going over recent KJT progress (to be reduxed below), and Voz reported progress on tracing FSDP with hooks (also to be reduxed below)
Aug 31 - not livestreamed publicly, I wasn’t there, but apparently there was some discussion about streams for tracing FSDP (no minutes alas)
Distributed and PT2
Tracing FSDP with Voz is deep in the weeds on backwards hooks support. We are attempting to implement hooks in a way that doesn’t require consolidated forward-backwards. The general strategy is (1) have Dynamo emit graphs that have register_hook calls on intermediates (register_hook calls on inputs must not go in the graph, they have to happen as part of residuals), (2) write these register_hook calls in such a way that when AOTAutograd runs, the actual hook code (which is arbitrary Python code and is not safe to run in tracing) is not run, but instead we run a meta function (which performs any needed metadata mutation) and then insert a call function to the original Python function (which will show up in backwards), (3) have compiled backwards take care of compiling this call function in the end.
Per parameter FSDP is looking pretty legit. Andrew Gu has been looking at the performance of per-parameter sharding (where parameters managed by FSDP aren’t shoved into a single flat buffer) and has found that we only really pay a penalty of 5% with per-parameter sharding but get better memory usage. Meta only: Redirecting...
This is not really PT2 related, but there’s an interesting set of posts about the future of Sympy circulating around: Towards a new SymPy: part 1 - Outline — blog documentation Funnily enough, the part of Sympy which Oscar calls out as “overused” (the symbolic expression system) is precisely the part we actually care about. Maybe a good reason for us to figure out some way to note use this part (me, personally, I want a compact representation and hash consing.)
I discussed this in a bit of detail in composability three weeks ago, but work on supporting fine-grained KJTs is going very well. This week, I worked with Michael Suo to get APS sparse arch tracing all the way through. I managed to get it going all the way through (though it failed on some seemingly unrelated problem.) So fine-grained tracing definitely seems like it will work, even if we generate tons of crappy guards. My plan for next week is to make a serious attempt at tracing multi-node model parallel sharded torchrec_dlrm.
This week, when I had spare time in the offsites, I worked on fixing a few one-off bugs. There were several that were pretty easy to nail:
Peter Bell is very close to landing inductor IR support for scan https://github.com/pytorch/pytorch/pull/106581 which allows for native cumsum/cumprod support. Now all we need is for someone to add a higher order op that feeds into this and we will have torch.scan!
Someone should add a “realize” operator to PT2, which would force materializing a tensor rather than allowing fusions across it. Christian Puhrsch would find this useful for ensuring epilogue fusion occurs on int8 mm (today, regular fusion causes the pointwise operation to get fused into a later reduction, instead of fusing the pointwise into the matmul)
ABI compatibility for AOT Inductor is continuing to proceed apace slowly, but one agreement is that we’re probably going to also only have the ABI compatible codegen for OSS as well.
In the PT2 weekly meeting, we discussed H100 benchmarking. There are a lot of interlocking parts to this: we need to upgrade Triton to get their H100 improvements, and not everyone on the PyTorch team has access to an H100. Still looking for someone to sign up for this.
CUDA graph updates are a thing now: 1. Introduction — CUDA C Programming Guide There may be some opportunities here. Elias says: “It mostly helps with eliding input copies. For the most part, removing input copies only really matters when you torch.compile only part of your model and leave the rest of the model in eager. This use case is pretty unlikely to train well anyway since you’ll still need to bifurcate the memory pool.” However, personally, I also think CUDA graph updates could be pretty useful for allowing you to deallocate the pool of memory needed by a CUDA graph, only reallocating it when it’s time to run the CUDA graph again.
Some big refactors that are in progress: refactoring skipfiles / allowed functions (talk to Yanbo), refactoring guard trees (talk to Animesh)
A bunch of new contributors being onboarded to Dynamo: Quansight is working more on Dynamo issues, and Jack Cao from PyTorch XLA is looking to help us with consolidated forwards-backwards-optimizer support in Dynamo as it is essential for XLA Dynamo perf.
KJT tracing updates: Tracing torchrec_dlrm with distributed sharding manages to get to the wait on the sharded embedding table lookups, at which point we are stuck on a complicated custom autograd function. Voz to take a look after finishing up intermediate backward hooks. In other news, the production folks on the workstream have finished getting rid of layer splitting for disables only, so they’re now quite interested in compiling through as well. Lots of foundational work that still needs to be done; hoping for Q4 but is very aggressive! Meta only: https://docs.google.com/document/d/1VTGEh0MqadAsuRy0s5u39wQhNwMSVgCgYewivMcBbuU/edit#heading=h.jknt1mqmztph
Animesh is going to be working on improving guard evaluation overhead, but there is still some disagreement among Voz, Jason and Edward about two major things: (1) should we port guards to C++ and do the rest of the scheme all in one go, and (2) should we stay in the “one compiled function, one check function” regime, or go straight to Voz’s one shared check function for everything.
Some folks from the Cinder team came to the PT2 weekly to talk about some challenges of running PyTorch with lazy imports. One big problem is the way Dynamo implements skipfiles by traversing modules to find all identifiers attached to them; this plays poorly with lazy imports. Other trouble points include decorators which put identifiers into global state, and our dispatcher registration mechanism.
Horace is complaining about compile time still kinda slow while he’s been working on llama. Profiling shows pytree is still big culprit (20%); we also spend a lot of time doing map_aggregate in FX (10%). Some discussion about reviving our fake tensor propagation rules caching idea.
Chien-Chin told us about the new PT2 DDP plans. We cannot directly trace DDP because it is implemented in C++, and we cannot easily port it to Python because the implementation is complicated by bucketing. So the idea is to implement a Python non-bucketed DDP, and rely on compile to optimize it away.
Horace told us about developments in LLMs. One thing he wants is dequant primitives in PT2: a way to take int3/int4 packed values and unpack them into a larger tensor, with the idea that PT2 would compile away the memory traffic. In general he doesn’t think we should directly do this in PT, as there are so many quantization formats.
Dynamic shapes
Last week I mentioned opcheck testing is usable, but Richard Zou is still evolving it on user feedback. A recent new change is to put the xfails into a JSON file so it can easily be automatically updated. However, there are still complaints from folks that it’s too hard to understand what goes wrong when a test crashes. Richard is going to investigate a two stage process now, where by we separate generation of test inputs and actually running the tests. To ensure generation of test inputs is kept up to date, we only need a single new test which runs all of the tests in the test file in one go and xrefs what tests are exercised with what we have recorded.
Horace wants a version of Tensor where some of the sizes are stored on device. This would allow you to perform a data-dependent operation without synchronizing; and you would still save on memory traffic because you would have kernels mask out memory loads when they go out of bounds of the dynamic shape. In some sense, this is a specialization of jagged tensor where everything in the jagged dimension has the same size.