State of symbolic shapes branch

ezyang · July 5, 2023, 4:27am

State of symbolic shapes: Jul 4 edition

Previous update: State of symbolic shapes branch - #58 by ezyang

Executive summary

This is a little more than two week’s worth of updates, covering PSC week, Edward on vacation and July 4th holiday.

Dynamic shapes by default is landed. To be clear, this is “automatically enable dynamic shapes if recompiling due to size changes.” Most models running PT2 should not see any difference, as they are static already. If your model has dynamism, expect dramatically lower compilation times at the cost of some E2E performance. There may be performance regressions, please file bugs if you encounter any. You can use TORCH_LOGS=dynamic to diagnose if dynamic shapes is doing something. Check also the Meta only post
Internal telemetry for dynamic shapes. Add signpost_event to dynamic_shapes adds a hook which we use internally to record all uses of dynamic shapes. You can check if dynamic shapes was actually used when free_symbols is non-zero.
Notable bug fixes.
- Allow Unequality in top level IR too and Support printing inequality in ExprPrinter - fixes HuggingFace StableDiffusion
Notable new issues.
- [torch.compile] Guards failures due to storage offsets in new nightly - I believe this was fixed by reverting Specialize storage_offset - Does not cover automatic dynamic by voznesenskym · Pull Request #104204 · pytorch/pytorch · GitHub

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 7ae100628e). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	89%, 57/64 → 91%, 58/64	98%, 45/46	100%, 60/60	88%, 7/8 → 100%, 8/8
Speedup	1.11x → 1.08x	1.59x → 1.58x	1.19x → 1.21x	1.30x
Comptime	67s → 78s	99s → 152s	110s → 134s	31s → 78s
Memory	0.94x → 0.80x	1.00x → 1.01x	1.01x → 1.00x	1.59x → 0.76x

vision_maskrcnn is responsible for the pass rate increase, but it’s fake: accuracy runs pass, but performance runs are still failing. Tracking issue: vision_maskrcnn: AssertionError: expected size 368==368, stride 156==28 at dim=0 · Issue #104653 · pytorch/pytorch · GitHub
Major HF compilation time regression is due to Re-enable low memory dropout by eellison · Pull Request #103330 · pytorch/pytorch · GitHub which is being reverted
Memory compression change is due to Add num_elements_per_warp as an triton_config by ipiszy · Pull Request #103702 · pytorch/pytorch · GitHub ; we are discussing how to deal with it, eellison’s position is that the change is “not real” (because it’s just due to an extra 250MB used for Triton autotuning)

Inference dashboard (as of 7b3242d5f7). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	88%, 63/72 → 86%, 63/73	100%, 46/46	100%, 60/60	58%, 7/12
Speedup	1.52x → 1.53x	1.64x	1.72x → 1.73x	1.92x → 1.96x
Comptime	24s → 28s	38s → 45s	30s → 34s	45s → 53s
Memory	0.82x → 0.67x	1.15x → 1.11x	1.06x → 0.84x	1.11x → 0.86x

New model added: DALLE2_pytorch
Compile time regression on Jun 29 is not entirely clear; maybe it is https://github.com/pytorch/pytorch/pull/104142

ezyang · July 10, 2023, 1:47am

State of symbolic shapes: Jul 9 edition

Previous update: State of symbolic shapes branch - #60 by ezyang

Executive summary

Roadmap review for H2 2023. We had roadmap review for PyTorch teams last week. Dynamic shapes presence on the roadmaps looks like this: (1) we have a bunch of internal enablement plans which require dynamic shapes to be well supported, make sure we are on point here (Meta only), (2) we’re really interested in getting good inference performance on LLMs comparable to SOTA, e.g., llama (there’s some kv-cache / cuda graphs pieces here), (3) there’s still jagged/nested tensor work to do. On a more atomic level, the infra investments that dynamic shapes need to make are probably (a) two level guards for backwards shape guards, (b) improved accuracy/compile time debugging tools, (c) more aggressive symbolic reasoning enabled by translation validation, (d) obvious inductor compilation perf improvements, e.g., from split reductions, (e) Unbacked integers for eager mode. I’d also like to finally get vision_maskrcnn and detectron2 working on PT2, but LLMs take priority over this.
Which operators specialize their inputs? In the old days, dynamic shapes enablement would typically fail because of missing meta functions. These days, things usually don’t fail, but you may end up having specialized and recompiling anyway. @anijain2305 has been working on sweeping operators to find out which arguments get specialized, to help folks have a better understanding of what will be dynamic versus not.
Translation validation landed! Re-land: Turn translation validation on for tests and accuracy runs by default. was reverted last week, but has successfully relanded for real. This paved the way for simplification improvements including Value range refinement using uni-variate expressions., which are important because they reduce the number of guards we emit in the end.
Notable bug fixes.
- We landed a few fixes to help fix issues in https://github.com/fxmarty/accelerated-pytorch-transformers-generation/: Generalize sympy.Rel test to sympy.logic.boolalg.Boolean, Allow for torch.sym_int to return int while tracing; there’s a few more coming too
Notable new bugs. None of these are user bugs; they were all filed by the team

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of dd6c38cb59). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	91%, 58/64 → 89%, 57/64	98%, 45/46	100%, 60/60 → 97%, 58/60	100%, 8/8 → 88%, 7/8
Speedup	1.08x → 1.11x	1.58x → 1.60x	1.21x → 1.20x	1.30x
Comptime	78s → 97s	152s → 124s	134s → 178s	78s → 40s
Memory	0.80x	1.01x → 0.97x	1.00x	0.76x → 0.73x

vision_maskrcnn went back to failing, seems flaky.
eca_botnext26ts_256 and mobilevit_s timed out due to translation validation being enabled. #104654 fixed it (to be visible in next perf run.) Compilation time increase also appears to be due to TV.

Inference dashboard (as of dd6c38cb59) This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	86%, 63/73	100%, 46/46 → 98%, 45/46	100%, 60/60	58%, 7/12
Speedup	1.52x	1.65x → 1.64x	1.73x	1.92x → 1.96x
Comptime	28s	44s	34s	53s
Memory	0.67x	1.11x	0.84x	0.86x

GPT2ForSequenceClassification is having some trouble across the board on all configurations; it’s currently failing accuracy.

What’s next?

Edward: Keep helping HF on their llama optimization; two level guards for backwards

ezyang · July 15, 2023, 8:25pm

State of symbolic shapes: Jul 15, 2023 edition

Previous update: State of symbolic shapes branch - #61 by ezyang

Executive summary

Dynamic shapes now support mode=“reduce-overhead” (CUDA graphs). Conventional wisdom was that dynamic shapes are incompatible with CUDA graphs, because any given CUDA graph recording can only work for a single static shape, and CUDA graphs requirement of hard coded memory addresses means that each CUDA graph takes up quite a lot of CUDA memory. However, this conventional wisdom is wrong: (1) multiple CUDA graphs can share the same memory pool, as long as you don’t have any live tensors from one pool to the next (this is precisely what CUDA graph trees by @eellison implements), and (2) recording a CUDA graph is much, much cheaper than running the entire PT2 compilation stack, so it is profitable to compile a dynamic program once and then CUDA graph it multiple times. Enable cuda graphs for dynamic shapes by ezyang · Pull Request #105064 · pytorch/pytorch · GitHub realizes these gains and switches our dynamic shapes benchmark configuration to use CUDA graphs, resulting in hefty performance gains with only a modest increase in compile time. Importantly, these benchmarks cover our _generate inference benchmarks, which actually make use of multiple sizes as sequence length varies. There’s more to be done here: our memory usage for this use case can be suboptimal, because the caching allocator doesn’t know that it’s OK to waste space for small allocations by fitting them inside larger allocations for a larger dynamic size. We also observed that folks using this CUDA graphs trick tend not to generate CUDA graphs for every size, but instead prefer to linearly sample sizes and pad; we should make it easier to do this (perhaps with a padding tensor subclass.) One cool result is a 6x performance improvement on cm3leon, a newly announced multi-modal model from Meta AI.
New API: torch._dynamo.maybe_mark_dynamic. Add torch._dynamo.maybe_mark_dynamic lets you suggest that we should try compiling a tensor dynamically, but doesn’t raise an error if it gets specialized (unlike mark_dynamic).
Infer valid input sizes from programs. Horace has wanted this for some time, and with Yukio’s recent Z3 translation validation work landed, it turned out to be pretty easy to write a PoC to exhaustively search the space of valid inputs, using guards to turn us away from portions of the space we’ve seen before. Check it out at dinfer.py · GitHub. If anyone is interested in productionizing this, it would be a neat little project to (1) put this code in PyTorch and put a nicer API on it (note that as written, you have to specify the input dimensions and dtypes of input tensors, so you’ll need to figure out a good way of specifying or inferring this info), (2) improve the solving code to minimize the generated sizes for an equivalence class, and (3) use it for something cool; e.g., you could use it to automatically generate sample inputs for OpInfo tests. Tag me (@ezyang) as reviewer if you send a PR!
Enabling automatic_dynamic_shapes in fbcode, for real this time. It turns out that I failed to actually turn things on in fbcode last time, so actually do it for real this time: Switch automatic_dynamic_shapes to True by default in fbcode. This got reverted once for breaking an internal model unit test (Incorrect ValueRanges analysis · Issue #105097 · pytorch/pytorch · GitHub, fixed by Perform value range analysis with rationals when possible by lezcano · Pull Request #105137 · pytorch/pytorch · GitHub, thanks @Lezcano for the speedy fix.) At time of writing, the PR has not actually hit fbcode yet.
lit_llama is finally landed in torchbench. https://github.com/pytorch/benchmark/pull/1730 At time of writing this model is in canary models because the weight download is a little flaky. This is the only 7B model in our benchmark suite and there’s a bit of pain associated with this; for example, we can’t run accuracy tests on this model, because accuracy tests are implemented by holding two copies of the model in memory, which we can’t do at 7B parameters.
Notable bug fixes.
- Transmute refined SymInt into int makes it more likely you’ll get an int rather than a SymInt if the SymInt got specialized into a constant. This sometimes caused some bugs with downstream components that can handle SymInt but choke on int.
- Fix AttributeError(“‘constexpr’ object has no attribute ‘type’”); another fix for HF llama
- Coming soon: Immediately compile backwards graph in AOTAutograd if dynamic shapes will fix “guard ignored, could cause correctness problems” warning. It’s waiting on review right now.
Notable new bugs.
- Inductor backend for CPU inference extremely slow - this bug actually seems to be fixed on main thanks to dynamic shapes. Moral of the story: if you want dynamic shapes, use a nightly! We have soooo many improvements.
- StableDiffusion with dynamic=True still recompiles

CI skips. -3, -1, -1, -2 (no change).

**Training dashboard (as of 7b4d080496). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	89%, 57/64 → 92%, 59/64	98%, 45/46 → 96%, 44/46	97%, 58/60 → 98%, 59/60	88%, 7/8 → 100%, 8/8
Speedup	1.11x → 1.52x	1.60x → 1.66x	1.20x → 1.27x	1.30x → 1.93x
Comptime	97s → 86s	124s → 120s	178s → 142s	40s → 42s
Memory	0.80x	0.97x	1.00x → 1.01x	0.73x → 0.69x

Now passing: hf_Longformer (this used to fail with ValueError: Cannot view a tensor with shape torch.Size([4, 12, 1024, 513]) and strides (6303744, 513, 6156, 1) as a tensor with shape (48, 4, 256, 513), this is thanks to Brian Hirsh finally landing his AOTAutograd longformer fix), vision_maskrcnn (flaky), eca_botnext26ts_256 and mobilevit_s (used to timeout; maybe the speedup from CUDA graphs was enough to get it under the timeout again)
Now failing: DebertaV2ForQuestionAnswering (failing accuracy due to cudagraphs, failing on inductor_with_cudagraphs too), cait_m36_384 (OOMing on accuracy due to increased CUDA graph memory usage)
Speedups: The majority of our speedups are due to the enablement of CUDA graphs for dynamic shapes. Some notable models and their speedups: BERT_pytorch (1.7698 → 3.3071), hf_GPT2 (1.7728 → 2.0056), basic_gnn_gin (1.3151 → 2.4841). The improvements on HF and TIMM models are much more modest since these are not super overhead bound models. Note that these numbers are still behind inductor_with_cudagraphs, because we are still losing some optimizations from running the PT2 compiler stack without static shapes.
Slowdowns: dlrm (infra failure due to cudagraphs, failing on inductor_with_cudagraphs too), hf_T5 (2.0252 → 1.8939, oddly enough–could this be due to memory pressure? But even more weirdly, hf_T5_large imporved perf)
Comptime/Memory: By in large compilation time did not increase, but for our training setup this is expected as we only actually run at one batch size, so you are simply measuring the cost of a single CUDA graph recording. As expected, memory compression ratio gets worse, due to standing allocation from CUDA graphs.

Inference dashboard (as of 7b4d080496). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	86%, 63/73 → 88%, 64/73	98%, 45/46	100%, 60/60	58%, 7/12
Speedup	1.52x → 1.50x	1.64x → 1.76x	1.73x → 1.62x	1.96x → 2.94x
Comptime	28s → 36s	44s → 46s	34s	53s → 72s
Memory	0.67x → 0.68x	1.11x	0.84x → 0.85x	0.86x → 0.87x

Now passing: hf_Longformer (see training above)
Speedups: torchbench numbers are actually a huge mixed bag. Here are some of the wins: BERT_pytorch (2.2317 → 2.4529), basic_gnn_edgecnn (1.7809 → 1.8732). Note that for some reason many of the GNN variants are failing performance on inference (but not accuracy), cm3leon_generate (1.3037 → 5.7822, WOW! This is consistent with some perf analysis Armen and I did months ago, where I concluded that cm3leon was hella overhead bound), hf_T5_generate (2.2619 → 8.2081), hf_T5_large (3.1690 → 5.1747)
Slowdowns: A lot more models did worse with CUDA graphs enabled, including LearningToPaint (1.9209 → 1.6812), resnet18 (1.7779 → 1.4028), shufflenet_v2_x1_0 (1.9882 → 1.6010), squeezenet1_1 (1.8625 → 1.0040), yolov3 (2.0997 → 1.8843). It’s not entirely clear what’s going on here, but we will note that there was sizable dip in CUDA graphs performance without dynamic shapes too this week on torchbench. There is an across the board performance regression on TIMM models (and a slight regression on HuggingFace too.)
Comptime/Memory: Comptime generally got worse across the board, but not too much worse. Particularly notable are the generate models: hf_T5_generate (881 → 1032), cm3leon_generate (131 → 203). CUDA graphs is not free, but given that we’re running at much more than two sequence lengths, you can see the bulk of the compile cost is the PT2 stack. For the most part, memory usage stayed fairly stable, interestingly enough.

What’s next?

I think I want to investigate the memory planning situation with CUDA graphs a bit more; I also think it’s a good time to teach Inductor how to deal with data-dependent ops (without having to graph break on them.)

ezyang · July 22, 2023, 5:43pm

State of symbolic shapes: Jul 22, 2023 edition

Previous update: State of symbolic shapes branch - #62 by ezyang

Executive summary

llama_v2 is out. @msaroufim has a PR adding it to torchbench suite: llama v2 by msaroufim · Pull Request #1775 · pytorch/benchmark · GitHub
Whole model compilation for sparse architecture in recommendation models. @anijain2305 has been looking at improving the ability to slap torch.compile on an arbitrary function and have it just work. One of the more challenging situations is when we try to compile the sparse architecture of recommendation models; e.g., code that interacts with [torchrec.sparse/(https://github.com/pytorch/torchrec/tree/main/torchrec/sparse). In one example, a KeyedJaggedTensor is being compiled, but it contains a list of 500 integers, each of which varies over time and participates in many guards. This is a worst case scenario for dynamic shapes compile time. However, we are also running into lots of graph breaks, which are resulting in us trying to compile smaller fragments than we should. There will be a mix of fixing graph breaks (some of them are due to data dependent output size operators like nonzero–time to fix this!) and otherwise figuring out what else needs to be done.
Notable bug fixes.
- Immediately compile backwards graph in AOTAutograd if dynamic shapes landed. It appears to not be a complete fix Make guard after freeze a hard error edge case involving export and fake cloning
Notable new bugs.
- Tweak dynamic=False behavior. Once this PR is in, dynamic=False will disable dynamic shapes, which seems like intuitive behavior.
- Error using torch.compile with HF transformers and model mosaicml/mpt-7b. This is einops rearrange messing up on SymInt caching again.
- [inductor] unexpected dynamic shape error encountered in TritonTemplate - Triton templates broken with dynamic shapes, fix in the works at https://github.com/pytorch/pytorch/pull/105295
- Dynamic int not being propagated when used on __setitem__. We haven’t started looking into this yet.

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 0ad93a3d56). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	92%, 59/64	96%, 44/46	98%, 59/60	100%, 8/8
Speedup	1.52x → 1.54x	1.66x → 1.69x	1.27x → 1.28x	1.93x → 1.97x
Comptime	86s → 81s	120s → 107s	142s	42s → 38s
Memory	0.80x → 0.79x	0.97x → 0.96x	1.01x	0.69x

Not really much to say; the slight improvements appear to be within noise.

Inference dashboard (as of 0ad93a3d56). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	88%, 65/74	98%, 45/46	100%, 60/60	58%, 7/12
Speedup	1.50x → 1.55x	1.76x → 1.78x	1.62x → 1.79x	2.94x → 3.03x
Comptime	36s → 35s	46s → 44s	34s → 36s	72s
Memory	0.68x	1.11x	0.85x → 0.84x	0.87x

Across the board timm improvement is due to reverting regressing PR from last week [inductor] Implement bucketize() for dependencies.py by davidberard98 · Pull Request #105102 · pytorch/pytorch · GitHub (bucketization related). This also explains the slight boost in torchbench numbers.
inductor_with_cudagraph_freezing is a one off benchmark run for Add Freezing Option to Benchmarking by eellison · Pull Request #105616 · pytorch/pytorch · GitHub

What’s next

CUDA graphs memory planning is lower priority for now (@eellison may take a look, but higher priority is actually being able to turn on CUDA graphs in prod situations; a big problem here is when we fail to compile the entire extent of the model, causing CUDA graphs to increase overall memory usage.) It looks like we definitely need data-dependent op support in inductor though, based on sparse arch investigation.

ezyang · July 29, 2023, 9:38pm

State of symbolic shapes: Jul 29, 2023 edition

Previous update: State of symbolic shapes branch - #63 by ezyang

Executive summary

Data dependent shape support in Inductor. I got an end to end PoC of a pointwise and then reduction with hacks working in Inductor: gist:1293a41299604c44310341b7540eabcb · GitHub The main gaps: (1) optional optimizations failing to retrieve hints (Triton size hints (pick 8192 to prevent the block size from shrinking), multiple of 16 hints (pick something not multiple of 16), 32-bit indexing), (3) buffer reuse (key’ing on the str rep is fine, use sympy_str), (4) updating wrapper codegen to create bindings to i0 variables. In general, it seems it’s pretty useful to have accurate maximum size information, for which ValueRanges is an incomplete fix because we don’t support symbols (s0) in bounds. Another trick we plan to implement is a special irrefutable guard, where if we guard on an unbacked symint, we instead just assume it is True and add a runtime assertion. One question is whether or not we always can get dynamic shapes working no matter what. It seems that in Inductor, we usually can just turn off optimizations to avoid guards. So it seems we just need to get host-side torch.cond working to handle everything else. Some fixes for these are in: If we can’t statically prove 32-bit indexing OK, only add guard if hint exists, Provide a refined upper bound for nonzero when original numel is static
An initial plan for KeyedJaggedTensor. After studying some of the models that use KJT and trying to get export working on them, here are some of the initial findings:
- You can remove the list of integers from KJT before tracing a model, which will cause the model to perform a data-dependent access to populate these integers as unbacked integers. However, when we try to use these integers to do a tensor_split, we immediately hit guards we cannot prove. The guards should be provable via sum(lengths) == values.shape[0] but our symbolic reasoning is not strong enough. These guards are for errors, so they should be bypassable by irrefutable guards (guards which, if they fail, imply you would have errored anyway. In this case you can convert the guard into a runtime test.) This is worth pursuing further. In any case, you expect to have 500 unbacked symints, symbolic reasoning must be fast enough to deal with it.
- If you don’t remove the list of integers, you need some way to prevent them from 0/1 specializing. In export, you can simply require every sparse feature be populated to size 2 and hope it generalizes to 0/1. In eager, we probably will just have to specialize KJT to treat these integers specially. Big benefit to this strategy is you’re not hard-blocked on guards on unbacked SymInts, since there’s always a hint; don’t need any sum(lengths) reasoning since guards are discharged by checking the underlying values. Cannot actually do this in export because export does not support SymInt inputs–I plan to fix this.
- Export with KJTs doesn’t work because KJTs are not a supported input. Direct fix Add pytree support to KeyedJaggedTensor by ezyang · Pull Request #1287 · pytorch/torchrec · GitHub; indirect fix is rewriting the export calling convention from pytree specs to a dictionary of “FQN” (Source.name()) really to Tensor. In that case, to pass a KJT named id_list_features, you would actually pass three tensors, id_list_features._values, etc.
- More details at Meta-only doc (sorry, non-public due to details about Meta prod models).
Translation validation bisection. We had a case of hint disagreeing with sympy simplification in internal; we’ve also had instances of this in open source, see [dynamo: fix the issue of aten.expand when the source and expaned size are all symbolic size by XiaobingSuper · Pull Request #101173 · pytorch/pytorch · GitHub](integer and real equality). Yukio is thinking of implementing a bisection mechanism for translation validation, so we can find the first guard that actually caused a correctness problem.
Export for QAT. QAT wants to do whole-graph transformations on a pre-autograd FX graph. Export sort of supports this with pre_dispatch export. What is likely going to happen is this turns into the IR format that export is going to use. Pre-autograd functionalization is unlikely to happen; you only get some mild normalization. Still unresolved how to work this into the overall QAT workflow API, since export isn’t really keen on exposing this mid-point IR (which is kind of incoherent.)
Notable bug fixes.
- Change _dynamo.export to be export(f)(*args, **kwargs) and Change _dynamo.explain to be explain(f)(*args, **kwargs) helps avoid ambiguity between user kwargs and export/explain kwargs. It is technically BC-breaking, when you exported a module with no arguments (quite rare!)
- Turn on capture_dynamic_output_shape_ops/capture_scalar_outputs by default for export. Not sure why we hadn’t done this before…
- Make _CURRENT_TRACING_CONTEXT thread local. This occasionally caused a race that typically looked like “fake tensor mode mismatch.”
- Improve FakeTensor to work with mixed meta-cpu embedding bag arguments. This is for you reco system peeps using meta embedding tables with CPU inputs for testing.
- Tweak dynamic=False behavior is in.
- Add missing evaluate_expr for slice_scatter, slight refactor; fixes slice_scatter with SymInt start/end
- Support dynamic shapes in TritonTemplates by ipiszy · Pull Request #105295 · pytorch/pytorch · GitHub - responsible for decent TIMM improvement
Notable new bugs.
- [dynamo.export] symbolic_shapes.GuardOnDataDependentSymNode - lively discussion about irrefutable guards
- llama model failed for dynamic shape path - this is cpu backend specifically
- Tensors always get 0/1 specialization guards, even if they’re not used - discovered by Animesh
- Bug when dealing with fallbacks on CPU · Issue #105853 · pytorch/pytorch · GitHub - not really sure what’s going on with this one

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 1da4115702). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	92%, 59/64	96%, 44/46	98%, 59/60	100%, 8/8
Speedup	1.54x → 1.56x	1.69x	1.28x → 1.35x	1.97x → 2.04x
Comptime	81s	107s → 108s	142s	38s → 39s
Memory	0.79x	0.96x	1.01x	0.69x

TIMM models had the most change. Some of this is from cait_m36_384 which had its batch size changed. Some others are from Support dynamic shapes in TritonTemplates by ipiszy · Pull Request #105295 · pytorch/pytorch · GitHub (ghostnet_100). Some are across the board improvements (e.g., resmlp_12_224)

Inference dashboard (as of 1da4115702). This week on HUD

Metric	Torchbench	Huggingface	TIMM models	Dynamic
Passrate	88%, 65/74	98%, 45/46	100%, 60/60	58%, 7/12
Speedup	1.55x → 1.54x	1.78x → 1.77x	1.79x → 1.80x	3.03x → 3.08x
Comptime	35s → 36s	44s → 45s	36s	72s → 75s
Memory	0.68x	1.11x	0.84x → 0.85x	0.87x

Looks all within noise.

What’s next

Rewriting export input/output spec flattening
Irrefutable guards
Generally more pushing on KJT stuff

ezyang · August 7, 2023, 3:52am

State of symbolic shapes: Aug 6, 2023 edition

Previous update: State of symbolic shapes branch - #65 by ezyang

Executive summary

More on KJT/torchrec. I had a nice discussion with Dennis van der Staay about torchrec and work on sparse arch. Some new information: (1) this workstream is almost certainly going to involve distributed later, because applying PT2 to post-torchrec sharded models is going to involve tracing past communication primitives–this also implies I’m going to want to get FakePG working on torchrec, (2) working on unit tests should be a pretty good idea, but there’s still some basic infra work to do (laid out last week), (3) not really expecting concrete performance improvements as sparse arch is going to be typically communication bound, so this is a mostly “we think this is promising, and the investment is not too big, because we’ve already done so much with dynamic shapes so far.”)
Pre-dispatch export. We’ve agreed to allow QAT to short-term publish a new export interface that produces a pre-dispatch FX graph with ATen operators which is suitable for graph transformations and training. The long term goal will to be have pre-dispatch functionalization which is the invariant the export team wants to allow this to be worked into torch.export proper. Pre-dispatch will generate an ExportedModule so that the APIs match.
Fake export. Export now supports exporting entirely fake modules/inputs. This means to export a model you don’t have to actually load its weights into memory; you can load it in a fake mode and still export it. This means we have some delicate code in Dynamo for dealing with two concurrent fake modes (but it’s not so bad: the outer fake mode is typically disabled while we do Dynamo analysis.) Only ONNX supports torch.load’ing models in fake mode at the moment.
Improved user stacks in Dynamo. torch._guards.TracingContext.extract_stack() now always accurately reports a user stack from anywhere in Dynamo, and we reliably use it for reporting real stacks for exceptions (previously, they used an entirely different mechanism.)
Improved error messages for non-local inputs in export. See Improve error message when export encounters non-local input for the details. This isn’t complete; follow through is also to make this work for outputs, and also work a little harder with the pytree representation (probably this week.)
Dynamo change in attitude. Many folks are concerned that Dynamo is just “endless” bugs. I pitched Animesh and Voz on a new attitude to fixing Dynamo bugs, which is that we should imagine the platonic ideal implementation of Dynamo as a faithful reimplementation of CPython in Python. Then, fixing a bug should not just be moving code around to fix a particular problem, but instead improving the local vicinity of code to bring it closer in line to this ideal. An example I used a lot explaining this was dict.keys support (bug fix is changing its type from tuple to set; real fix is to accurately model dict views.) To do this well, you need to regularly look at CPython code, and Dynamo may need to grow some new abstractions (perhaps a proper implementation of Python’s object model, Python traceable polyfills).
Notable new bugs.
- Case study of torch.compile / cpp inductor on CPU: min_sum / mul_sum with 1d / matmul-like with static / dynamic shapes - one takeaway is that it’s difficult to compile an operator size chunk of code have exercise fine grained control on what dimensions should be dynamic/static (due to automatic dynamic)
- [dynamo] Unsupported to trace through Boolean Tensor indexing

Numbers

As we’re not really doing much on performance numbers recently, I am simplifying this section.

Training. 68cb854d73 Dashboard

Nothing much to report.

Inference. 68cb854d73 Dashboard

The big perf increase in torchbench is due to maml getting removed from the benchmark set (it slows down a lot under PT2 and was depressing the score). clip, hf_Whisper, llama_v2 are new models added thanks to @msaroufim !

What’s next?

There are a lot of things that need doing

Finish overhauling export input/output pytree matching (probably not dumping the pytree in/out spec, but I think if we tree_map into positional identifiers we can reliably detect KJT missing situations)
Make unbacked SymInts work in Inductor gist:1293a41299604c44310341b7540eabcb · GitHub (biggest problem is unbacked SymInt binding in wrapper codegen and the hinting logic)
Irrefutable guards
Write up the plan for sparse arch / KJT
Land pytree support for KJT/JT
0/1 specialization suppression for list of int in KJT

Stuff that probably can wait until later?

Host side torch.cond
DynTensor

ezyang · August 12, 2023, 2:35pm

State of symbolic shapes: Aug 12, 2023 edition

Previous update: State of symbolic shapes branch - #66 by ezyang

Executive summary

I’m trying something a little different, expanding the update to cover a wider variety of topics beyond dynamic shapes, mostly centered around things that I personally have involvement in (this is a lot of things, so you should be getting pretty good coverage this way!)

Benchmarking

Inductor CI/perf is upgraded to CUDA 12 / gcc 9. This doesn’t seem to have any appreciable effect on perf, but we did it so we could do the next item.
torchrec_dlrm is back. They were disabled a few months ago because of fbgemm nightly related flakiness. The flakiness has been resolved by building fbgemm/torchrec from source in the Docker image. These are now installed as part of the general torchbench installs, and should help some of the work we are doing on jagged tensors (since many important operators are currently implemented in fbgemm).
Algorithmic efficiency. Frank Schneider posted about how PyTorch was slower than JAX in their upcoming algorithmic-efficiency benchmark suite. A bunch of us, spearheaded by @msaroufim, jumped in to take a look at what was going on. Status updates at https://docs.google.com/document/d/1okqKS32b0EhWQSFFoSV6IjGlYM4VhNYdxBPjdlFIw5w/edit (Meta-only). I personally have an interest in the dlrm side of things, since I’ve been working on sparse arch recently; after fixing some mild bugs, I was able to show parity on criteo1tb dlrm between PyTorch nightly and JAX on an A100x8 (PyTorch score: 7703.403180360794, JAX score: 7703.041719198227), although the number of evals varied, so I’m not sure if this a threat to validity. Unfortunately, this does not necessarily help their problem, which was an OOM. To make further progress on this, we may need some tools to help us understand why torch.compile memory usage is higher.

Export

Pre-dispatch export, part 2. We had more discussion about pre-dispatch export in the Friday export meeting. @suo in particular was arguing that from a frontend perspective, it would make more sense to export pre-dispatch IR by default, and have the further post-dispatch lowerings be an extra pass on top that is opt-in by backends. One of the identified barriers to doing this is pre dispatch functionalization; the other is nondifferentiable decomps. nkaretnikov is going to take a look at core_aten_decompositions to see which of these are differentiable and which are not. In other news, torch.export is going platinum Expose torch.export() API by gmagogsfm · Pull Request #106904 · pytorch/pytorch · GitHub
dim order coming to Tensor. We probably should have added this API a long time ago, but export really wants this on Tensor so in it goes. [PyTorch][Tensor] Introduce tensor.dim_order by digantdesai · Pull Request #106835 · pytorch/pytorch · GitHub

Distributed

Tracing FSDP. @voz wrote a post Redirecting... (Meta-only) about the state of tracing FSDP in Dynamo. The key info is that on a branch, he can trace everything through and get identical results on a single forward-backward to eager. There’s a lot of fixes that need to land to main; from his post:
1. The value of various small changes to FSDP to make this work vs adding fixes in dynamo (Pretty easy, preferring dynamo ofc but for some mostly no op shuffling, we do FSDP as well)
2. TypedStorage - is it tensor-like/tensor-associated enough to go in the graph? Do we need to add some ops for doing tensor typed storage data ptr comparison / checking free, etc?
3. Working through the cudastream story, in particular around wait_stream and such
4. Lot’s of little bug fixes here and there
5. Coverage for missing comparisons, bytecode ops, general coverage gaps like attr access on FSDP modules, setting data on a tensor, etc.
pytrees slow again for DTensor. Junjie and Rodrigo have been trying to improve DTensor’s eager perf, and we spent the first half of composability sync talking about it. Rodrigo had a hack to pre-compile pytree applications into Python code but apparently this doesn’t help that much: gist:5427cabfab6421d4e104905345f94a50 · GitHub . Another suggestion from the meeting was that after Brian’s subclass supports lands, maybe you could torch.compile each op individually with backend=“eager”.
Data-dependent all2all. Will Feng got all2all collective working in inductor https://github.com/pytorch/pytorch/pull/106655/ This is notable because all2all collective has data-dependent output shape. It looks like unbacked symints worked here!

Custom ops

Custom ops. Richard tells me he is going to add a class-based API for custom ops, to make it easier to define everything all in one place. More on this soon I assume!
Custom op testing. https://github.com/pytorch/pytorch/pull/106903 is here to make it easier to retrofit pre-existing test suites to also test for important operator properties.

Nested/jagged tensor

SkolemSymNodeImpl. @jw3468 is going to make size() work on jagged tensor by introducing a new concept to SymInt provisionally called SkolemSymNodeImpl. This is a special SymInt which is not symbolic (it can show up in eager mode) but only compares equal to itself (aka is a skolem variable). We will use this to represent jagged dimensions. All jagged tensors that have the same offsets tensor get assigned the same skolem variable, if you have different offsets tensors you can’t add them together because their skolem variables don’t match. More details at https://docs.google.com/document/d/1e-R_818YA4VlVTlozu5eyzRIV6TzyvSPDm9DMEw_4xg/edit (Meta-only)
SAM single batch, vmap for nested tensor. @jbschlosser has been working on integrating nested tensor with SAM, and one challenge with SAM is that it is written in a single-batch style, so the first problem is batchifying the model in the first place. Last week, an idea was to use vmap to automatically convert single-batch to multi-batch, and there is a PoC for this (WIP) PoC for vmap + NT by jbschlosser · Pull Request #106786 · pytorch/pytorch · GitHub but there are still a number of spots in SAM which are not so easy to vmap across https://docs.google.com/document/d/1_yiHOBbaI4qFWqBfebjWPHOhkxKW3v-CHu3lj5apv1Y/edit . Joel is going to try a few more days on this, and then pivot if it is still not looking promising.

Dynamo

Pivot on per-NN module caching. @anijain2305 is working on having a separate code cache per NN module, but on Friday with the help of @voz we realized that you actually the problem is separable into two pieces: (1) an enhanced cache size limit policy that knows about NN modules [RFC][dynamo] Separate cache sizes for nn module guard specialization by anijain2305 · Pull Request #107077 · pytorch/pytorch · GitHub and (2) improvements to cache lookup when there are a lot of cache entries (guard trees).
Dynamo eager mode cond. Yidi Wu: to support cond in eager mode, we plan to torch.compile the entire cond operator, manufacturing fresh code objects to ensure that the caches don’t interfere with each other. https://docs.google.com/document/d/1esmHEa0fiktiSw1lvRsPmsbnTYxDSc0t3V9V5V0xK7I/edit#heading=h.pajqpbewbdg7 (Meta-only)
Time to get rid of functional VariableTracker? VariableTracker in Dynamo is an immutable data structure: when a mutation happens, you allocate a fresh VariableTracker and then replace old VariableTrackers with the new one. This is because we have checkpointing functionality that is used to rewind old VariableTracker. However, this is a bit of pain from the modeling side, as every Python data structure has to be reimplemented to have purely functional operations. An alternate design is to allow direct mutation of VariableTrackers. To do checkpoints, we simply restart Dynamo analysis to “go back in time” by stopping execution at the point where we would have checkpointed (a deepcopy could also work, though I’m not a fan.) Speculate subgraph would be implemented by simply denying all mutations or doing some crazy thermometer continuation thing. This would help make Dynamo more metacircular and reduce the work needed to support new container types, of which we often need to support a lot.

Dynamic shapes

expect_true irrefutable guards. I talked through this in the last 20min of composability sync. Check https://github.com/pytorch/pytorch/pull/106720 ; this is enough to make splits on unbacked SymInts work.
Boolean masking, at last. @yanboliang is looking into a pre-autograd FX transform that replaces boolean mask updates with torch.where calls. One annoying detail is how to deal with Dynamo tracing the boolean masks in the first place, when Inductor can’t deal with boolean masks if you can’t eliminate them? Our idea, in lieu of fixing Inductor to work with data-dependent shapes (which we are working on), is to attempt to eliminate all data-dependent ops in a pre-dispatch pass, and if it is not possible, restart Dynamo analysis saying “you need to graph break on this op next time.”
Notable fixes.
- SymInt’ify tile. This one needed for algorithmic-efficiency criteo1tb dlrm.
- [export] Refactor constrain_as_value and constrain_as_size from Tugsuu (was bounced, needs relanding)
Notable new bugs.
- Dynamic shapes support for inductor foreach codegen

Numbers

Training. 03414081ff Dashboard

Some accuracy regressions. torchbench: hf_BigBird, vision_maskrcnn (flaky). It’s not clear what broke hf_BigBird; possibly the CUDA 12 upgrade. Need to investigate. AlbertForQuestionAnswering improved accuracy!
The huge perf improvement across the board is thanks to Peter Bell’s work https://github.com/pytorch/pytorch/pull/106747 optimizing split reductions. This is not full runtime split reductions: instead Peter uses whatever the hint was at the time we compiled to plan the split reduction, and then we use it for all subsequent runs. This makes it more important to warm up Inductor with the “right” size hint to start; see also Padded tensor subclass · Issue #105325 · pytorch/pytorch · GitHub ; there was also another user complaining about other cases where we made suboptimal decisions if the first kernel we compiled with wasn’t representative

Inference. Dashboard 03414081ff

A lot of change on the last day; some improvements and some regressions (but mostly regressions). Maybe CUDA 12 update related, need to check. hf_BigBird also failing here too. RobertaForQuestionAnswering failing accuracy now

ezyang · August 20, 2023, 11:55pm

State of PT2: Aug 20, 2023 edition

Previous update: State of symbolic shapes branch - #66 by ezyang

Executive summary

Public service announcements

Trying to understand how to read PT2 logs? Check out Logging docs - Google Docs for some quick orientation. (In other news, guards logging this week has been improved to show you which line of user code caused a guard to be added! Take a look and don’t be afraid to give feedback on it.)
Have you ever wanted to store lots of tracebacks for logging/debugging purposes, but were afraid to do so by default because it might be too expensive to do so? There is a new class in torch.utils._traceback called CapturedTraceback which makes it extremely fast to save a Python traceback (something like 20x faster than running a full traceback.extract_stack()), so it should change the calculation about whether or not you are willing to store tracebacks by default. We have already used this to keep fine-grained information about guard provenance to start. Note that CapturedTraceback DOES hold references to code objects, and these references can cause cycles (because of co_extra), so good practice is to make sure you clear these traces once you know you no longer need them.
I spent some time debugging reference cycles this week (due to CapturedTraceback), and Alban pointed me at objgraph for visualizing references/referents. It’s pretty cool, you should definitely check it out.

Composability sync https://www.youtube.com/watch?v=LmkFkOBwhks

About modeling quantized dtypes: Edward Z. Yang (@ezyang00) on Threads
Brian Hirsh presented the steps to pre-dispatch functionalization https://docs.google.com/document/d/1Ya1GW_8ErRDy6yPL91WOCSnpAvfqOJy5nxueRXEMONI/edit#heading=h.h1bbyxvt0r56 (Meta-only)

Inside baseball

We are continuing to do a terrible job at not causing reference cycles in our compiler data structures. Folks have noticed that we leak compiled models even when the model objects are deleted. This generally emerges when a reference cycle goes through a non traversable object (C++ shared references or co_extra on code objects); the result cannot be deallocated unless we explicitly break the reference cycle, which we generally do not do. Animesh is going to take a whack at the model finalization problem. On the bright side, Animesh did push a fix for a long standing bug [dynamo][eval_frame] Set destroy_extra_state deleter as part of co_extra by anijain2305 · Pull Request #107117 · pytorch/pytorch · GitHub related to code object deallocation (admittedly rare, but it can happen);
Yidi Wu wanted to know if we could guard against backends changing https://github.com/pytorch/pytorch/pull/107337 so you don’t have to call dynamo.reset() anymore. The motivation was to allow people to seamlessly use eager backend along side their compiled backend. Difficult!
Efficient transformer inference needs in-place scatter, being worked on at Changes needed for transformer inference (includes Elias's cudagraph by Chillee · Pull Request #106192 · pytorch/pytorch · GitHub This is not so easy to do because we generally don’t like dealing with mutation in Inductor but Horace thinks he has a plan.
When doing passes on FX graphs, it can be annoying to keep fake tensor metadata up-to-date. Horace is looking into some incremental rebuilding of the metadata, stopping re-propagation once you notice that the fake tensor lines up with the old values.

Distributed

Handling backward hooks in Dynamo is kind of difficult. There is a discontinuity between hooks on inputs and hooks on intermediates; hooks on intermediates, in particular, have to somehow be reflected in whatever graph gets differentiated by autograd, but at the same time these hooks may have arbitrary Python bits that need handling by Dynamo. It seems the problem is made easier if we have combined forward-backward tracing in Dynamo, at which Dynamo knows enough about the backward structure to bypass AOTAutograd entirely. It might also be possible to just do cuts prior to going to AOTAutograd, which will impede optimization. It might be possible to bypass this problem for FSDP if hooks are only on parameters and outputs. Lots of difficulties…

Dynamic shapes

KJT proddy stuff
Notable new dynamic shapes bugs:
- a pile of cpu perf regressions somehow [Inductor][cpu][dynamic shapes] some models perf regression on 2023_08_13 nightly release [Inductor][cpu][dynamic shapes] some models perf regression on 2023_08_14 nightly release
- [inductor] [dynamic shape] 5 HF models fails with Constraints violated using transformers v4.31.0 - because HF did something specialization unfriendly
- Torch randn cannot take symbol shapes as shape argument.
- Some optimizer stuff [Dynamo] Unable to Trace AdamW Optimizer when there is LR Scheduler Optimizers should use learning rates passed as tensors directly

Numbers

Training. 68b9bf9671 Dashboard

Not too much action this week. However, a bit of flakiness on the inside; may need some follow up. torchrec_dlrm added last week has shown up with dynamic shapes but is failing, so it still needs work. Early week TIMM improvement is from [inductor] make thread order consistent with loop order by shunting314 · Pull Request #106827 · pytorch/pytorch · GitHub . Some of later week TIMM improvement may be from Make Nd tensors hit fused addmm pass by eellison · Pull Request #106911 · pytorch/pytorch · GitHub (but stats were not run on that PR.)

dlrm is now passing on dynamic shapes which is cool. RobertaForQuestionAnswering was fixed (not clear what fixed this; it’s in the 35cca799ff42182a1b7f1ee4d0225ee879b7c924…384e0d104fd077d31efafc564129660e9b7a0f25 range). Some other wins (and some regressions, most importantly sam) from Unfuse bias add before pointwise ops by eellison · Pull Request #106912 · pytorch/pytorch · GitHub, also some other unexplained changes like convnext_base and jx_next_base in this same commit range (which sort of makes sense, @eellison landed a bunch of perf related changes)

ezyang · September 10, 2023, 7:10pm

State of PT2: Sep 8, 2023 edition

Previous update: State of symbolic shapes branch - #67 by ezyang

We were on break for two weeks because I went on vacation, and I didn’t have time to do a report before/after vacation lol.

Executive summary

PyTorch 2.1 branch cut. The cut was three weeks go (right when I went on vacation lol) and we’re reaching the end of the cherry-pick window. Track ongoing cherry picks at: https://github.com/pytorch/pytorch/issues/108055
Blueberries offsite was this week! The blueberries workstream is focused on accelerating SOTA transformer models using PT2, quantization, sparsity and other techniques. Some highlights: MFU is coming to the benchmark suite, some direct improvements to important models, int8 dynamic quantization with tensor subclasses. Many of these are not published yet, keep your eyes peeled at PTC!
PyTorch conference registration filling up fast. If you want to go and haven’t registered yet, you should register at PyTorch Conference | Linux Foundation Events

Composability sync

Aug 24 https://www.youtube.com/watch?v=H6EUSsvDmbw - we spent time going over recent KJT progress (to be reduxed below), and Voz reported progress on tracing FSDP with hooks (also to be reduxed below)
Aug 31 - not livestreamed publicly, I wasn’t there, but apparently there was some discussion about streams for tracing FSDP (no minutes alas)

Distributed and PT2

Tracing FSDP with Voz is deep in the weeds on backwards hooks support. We are attempting to implement hooks in a way that doesn’t require consolidated forward-backwards. The general strategy is (1) have Dynamo emit graphs that have register_hook calls on intermediates (register_hook calls on inputs must not go in the graph, they have to happen as part of residuals), (2) write these register_hook calls in such a way that when AOTAutograd runs, the actual hook code (which is arbitrary Python code and is not safe to run in tracing) is not run, but instead we run a meta function (which performs any needed metadata mutation) and then insert a call function to the original Python function (which will show up in backwards), (3) have compiled backwards take care of compiling this call function in the end.
Per parameter FSDP is looking pretty legit. Andrew Gu has been looking at the performance of per-parameter sharding (where parameters managed by FSDP aren’t shoved into a single flat buffer) and has found that we only really pay a penalty of 5% with per-parameter sharding but get better memory usage. Meta only: Redirecting...
DDP optimizer brittleness. We currently support pipelining DDP code with PT2 by manually splitting graphs into multiple AOTAutograd functions so that backwards isn’t run too soon. The code here is kind of janky: I ran into two separate bugs that only happend when optimize_ddp was on: [DDP PT2] TypeError: convert_frame_assert.<locals>._convert_frame_assert() missing 2 required positional arguments: 'hooks' and 'frame_state' · Issue #107637 · pytorch/pytorch · GitHub and [optimize_ddp] moco - NameError: name 's2' is not defined · Issue #108877 · pytorch/pytorch · GitHub . Pritam has also been complaining about the graph break strategy: torch.compile graph breaks should be independent of DDP buckets · Issue #108966 · pytorch/pytorch · GitHub Will tells me that Chien-Chin is working on some new DDP strategy, but it appears to be centered around starting with a non-parallelized graph. Hopefully we can present it at composability this week. Note that DDP cannot be easily traced as it is implemented in C++.

Dynamic shapes

Avik is proposing a change to the dynamic_dim API currently used to express dynamism in export API. Instead, they will adopt a Python typing generics style solution, where you bind generic variables for dynamic dimensions batch = Dim("batch", max=64) and then use this to annotate types on input tensors x: TensorType[batch, K, N]. This is very reminiscent of [discussion] Expressing tensor dimension semantics / constraints through typing / constraints blocks. Constraints block could be scripted/traced and help for tracing/script execution and codegen · Issue #40373 · pytorch/pytorch · GitHub Meta only: https://docs.google.com/presentation/d/168U7XK72C_WSsZpGESP6Cho9udh193fi0gfjxCNcJ4E/edit
This is not really PT2 related, but there’s an interesting set of posts about the future of Sympy circulating around: Towards a new SymPy: part 1 - Outline — blog documentation Funnily enough, the part of Sympy which Oscar calls out as “overused” (the symbolic expression system) is precisely the part we actually care about. Maybe a good reason for us to figure out some way to note use this part (me, personally, I want a compact representation and hash consing.)
I discussed this in a bit of detail in composability three weeks ago, but work on supporting fine-grained KJTs is going very well. This week, I worked with Michael Suo to get APS sparse arch tracing all the way through. I managed to get it going all the way through (though it failed on some seemingly unrelated problem.) So fine-grained tracing definitely seems like it will work, even if we generate tons of crappy guards. My plan for next week is to make a serious attempt at tracing multi-node model parallel sharded torchrec_dlrm.
This week, when I had spare time in the offsites, I worked on fixing a few one-off bugs. There were several that were pretty easy to nail:
While working on the meta implementation for nms I played around with Richard Zou’s opcheck testing: https://github.com/pytorch/pytorch/pull/106903 It still needs some improvements ([RFC] Run only one pytest parametrization when generating optest by ezyang · Pull Request #108936 · pytorch/pytorch · GitHub Use a bit-identical test for mutation test by ezyang · Pull Request #108935 · pytorch/pytorch · GitHub optests improvements based on torchvision usage on nms by ezyang · Pull Request #108929 · pytorch/pytorch · GitHub) but I was able to get it to work end-to-end. Seems pretty promising!

Inductor fun

Peter Bell is very close to landing inductor IR support for scan https://github.com/pytorch/pytorch/pull/106581 which allows for native cumsum/cumprod support. Now all we need is for someone to add a higher order op that feeds into this and we will have torch.scan!
Someone should add a “realize” operator to PT2, which would force materializing a tensor rather than allowing fusions across it. Christian Puhrsch would find this useful for ensuring epilogue fusion occurs on int8 mm (today, regular fusion causes the pointwise operation to get fused into a later reduction, instead of fusing the pointwise into the matmul)
ABI compatibility for AOT Inductor is continuing to proceed apace slowly, but one agreement is that we’re probably going to also only have the ABI compatible codegen for OSS as well.

Performance

Flash Attention 2 is close to landing: Flash Attention v2 by drisspg · Pull Request #105602 · pytorch/pytorch · GitHub but it is currently stuck because it takes a lot of memory to compile, causing CI problems.
In the PT2 weekly meeting, we discussed H100 benchmarking. There are a lot of interlocking parts to this: we need to upgrade Triton to get their H100 improvements, and not everyone on the PyTorch team has access to an H100. Still looking for someone to sign up for this.
CUDA graph updates are a thing now: 1. Introduction — CUDA C Programming Guide There may be some opportunities here. Elias says: “It mostly helps with eliding input copies. For the most part, removing input copies only really matters when you torch.compile only part of your model and leave the rest of the model in eager. This use case is pretty unlikely to train well anyway since you’ll still need to bifurcate the memory pool.” However, personally, I also think CUDA graph updates could be pretty useful for allowing you to deallocate the pool of memory needed by a CUDA graph, only reallocating it when it’s time to run the CUDA graph again.

Dynamo

There was a pretty notable pytree API BC breakage which caused some internal problems: Serialize pytree to json string by angelayi · Pull Request #106116 · pytorch/pytorch · GitHub
Some big refactors that are in progress: refactoring skipfiles / allowed functions (talk to Yanbo), refactoring guard trees (talk to Animesh)
A bunch of new contributors being onboarded to Dynamo: Quansight is working more on Dynamo issues, and Jack Cao from PyTorch XLA is looking to help us with consolidated forwards-backwards-optimizer support in Dynamo as it is essential for XLA Dynamo perf.

Numbers is on break this week due to A100 runners down: apt-get install nvidia-docker2, Could not get lock /var/lib/dpkg/lock-frontend · Issue #108862 · pytorch/pytorch · GitHub

ezyang · September 17, 2023, 6:36pm

State of PT2: Sep 15, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

KJT tracing updates: Tracing torchrec_dlrm with distributed sharding manages to get to the wait on the sharded embedding table lookups, at which point we are stuck on a complicated custom autograd function. Voz to take a look after finishing up intermediate backward hooks. In other news, the production folks on the workstream have finished getting rid of layer splitting for disables only, so they’re now quite interested in compiling through as well. Lots of foundational work that still needs to be done; hoping for Q4 but is very aggressive! Meta only: https://docs.google.com/document/d/1VTGEh0MqadAsuRy0s5u39wQhNwMSVgCgYewivMcBbuU/edit#heading=h.jknt1mqmztph
Animesh is going to be working on improving guard evaluation overhead, but there is still some disagreement among Voz, Jason and Edward about two major things: (1) should we port guards to C++ and do the rest of the scheme all in one go, and (2) should we stay in the “one compiled function, one check function” regime, or go straight to Voz’s one shared check function for everything.
Some folks from the Cinder team came to the PT2 weekly to talk about some challenges of running PyTorch with lazy imports. One big problem is the way Dynamo implements skipfiles by traversing modules to find all identifiers attached to them; this plays poorly with lazy imports. Other trouble points include decorators which put identifiers into global state, and our dispatcher registration mechanism.
Horace is complaining about compile time still kinda slow while he’s been working on llama. Profiling shows pytree is still big culprit (20%); we also spend a lot of time doing map_aggregate in FX (10%). Some discussion about reviving our fake tensor propagation rules caching idea.
Meta only: We’ve had a lot of PT2 related SEVs recently. There’s been some initial investigation classifying what happened https://docs.google.com/document/d/1bMoQEoBlZ4vwsUztH1dEeEETHuNJXEj1uItQH8Cd7jo/edit#heading=h.dnms1ad3rdvu and some suggestions on what to do next https://docs.google.com/document/d/1jhwgscFWe_G8JDSSRbFpkZKy02cw8wipyuNYaHXF4Rg/edit . A lot of the problem stems from insufficient / flaky downstream testing. Michael Suo is leading the charge here.
Unrelatedly, there is also some external feedback (Meta only: https://docs.google.com/document/d/1Ss3idfGSTV4GWElOe6pgw9JkJ_b0p5AaEdkf-Yzil7M/edit) that PT2 speedups are promising but hard to actually work reliably. A lot of it has to do with distributed, e.g., torch.compile/triton holding GIL during compilation and CompiledKernel call results in deadlocks during distributed training · Issue #109074 · pytorch/pytorch · GitHub and TorchInductor workers use "fork" which doesn't work in a multithreaded environment · Issue #108586 · pytorch/pytorch · GitHub

Inductor

ABI compatible AOTInductor made a bit of progress this week, with [inductor] Make AOTInductor runtime interface ABI compatible by desertfire · Pull Request #109450 · pytorch/pytorch · GitHub and https://github.com/pytorch/pytorch/pull/109391 by Bin Bao.
Will Feng looking into improved item() and tolist() support in Inductor: [WIP] Add .item() and .tolist() support in Dynamo/Inductor without graph break by yf225 · Pull Request #109262 · pytorch/pytorch · GitHub

Composability sync hit a lot of topics this week. Composability meeting notes - Google Docs Topics that weren’t otherwise covered in this doc:

Elias told us about how SDPA pattern matches (and others; both inference and training patterns supported) are now compiled ahead of time, making it a lot cheaper to do lots of patterns. We took advantage of that to add a lot more patterns to match other SDPA variants. Add Python serialization to Pattern Matcher patterns by eellison · Pull Request #108894 · pytorch/pytorch · GitHub
Chien-Chin told us about the new PT2 DDP plans. We cannot directly trace DDP because it is implemented in C++, and we cannot easily port it to Python because the implementation is complicated by bucketing. So the idea is to implement a Python non-bucketed DDP, and rely on compile to optimize it away.
Horace told us about developments in LLMs. One thing he wants is dequant primitives in PT2: a way to take int3/int4 packed values and unpack them into a larger tensor, with the idea that PT2 would compile away the memory traffic. In general he doesn’t think we should directly do this in PT, as there are so many quantization formats.

Dynamic shapes

Last week I mentioned opcheck testing is usable, but Richard Zou is still evolving it on user feedback. A recent new change is to put the xfails into a JSON file so it can easily be automatically updated. However, there are still complaints from folks that it’s too hard to understand what goes wrong when a test crashes. Richard is going to investigate a two stage process now, where by we separate generation of test inputs and actually running the tests. To ensure generation of test inputs is kept up to date, we only need a single new test which runs all of the tests in the test file in one go and xrefs what tests are exercised with what we have recorded.
Horace wants a version of Tensor where some of the sizes are stored on device. This would allow you to perform a data-dependent operation without synchronizing; and you would still save on memory traffic because you would have kernels mask out memory loads when they go out of bounds of the dynamic shape. In some sense, this is a specialization of jagged tensor where everything in the jagged dimension has the same size.
Notable bug fixes:
- Add meta and OpInfo for _embedding_bag_dense_backward
- Add torch.distributed get_rank and get_world_size to constant_fold_functions

Numbers

This is nearly a month worth of numbers!

Training. 34ddf08f27 dashboard

mobilevit_s in timm models no longer runs, it looks like it’s due to flash attention, Elias will be fixing it along with the pattern matcher PRs.
Performance regression in torchbench between 0cfc5899f9bade72c7e18666e2006b003b5848bc..3a79621c9dce17f77fbddc06aab21f6bc477f313. Testing inductor-A100-perf-nightly · pytorch/pytorch@ad74286 · GitHub inductor-A100-perf-nightly · pytorch/pytorch@4948bbc · GitHub
Hugging Face improvement from flash attention v2 landing again
Meaningful compile time regression everywhere, with no obvious culprit. Example model:

image1112×460 31.9 KB

Inference. 34ddf08f27 dashboard

A lot of torchbench improvement: detectron2_fcos_r_50_fpn, doctr_reco_predictor, drq, llama, pyhpc_turbulent_kinetic_energy all now pass accuracy.
cudagraphs freezing accuracy improvement in timm models, likely from some major bugfixes for freezing
pytorch_stargan had huge perf improvement c2ac0da445cfe3d848342926f9cd4422bd35bfe2…781b7ebe912ec24cbd917cd548b748b1650ab6a2
HuggingFace regression due to pin update Problems hit when upgrading the version of HF used in CI · Issue #108145 · pytorch/pytorch · GitHub
Fairly large aot inductor regression due to ABI changes.

ezyang · September 24, 2023, 8:34pm

State of PT2: Sep 23, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

Yanbo has been making good progress on understanding the state of our skipfiles/allowlist situation. Here is my attempt to record what he described to me in our 1:1.
- First, what do these things do? For any given frame, we can make one of three decisions on it: inline - the default decision; skip - we never start Dynamo on this frame, and we induce a graph break instead of inlining into it (BUT, skipped functions may have overrides in Dynamo that allow us to avoid a graph break); allow in graph - we don’t inline into the function, but instead directly put it into the graph (and run it to do fake tensor propagation.) Skipfiles and allowlist control whether or not we do something different from the default decision.
- Yanbo’s theory is that allowlist should be explicitly enumerated function-by-function. This makes sense; there’s a fixed set of operations we can actually put in the graph (coinciding with Torch IR; see composability sync), and they have to be audited to ensure they don’t do naughty stuff like mutate Python state.
- Suppose that we didn’t care about compile time / Dynamo bugs at all. In theory, it shouldn’t be necessary to have a skip list at all, because you’d expect Dynamo to independently work out that something couldn’t be compiled and graph break. There is a big nuance here though: the torch module is skipped! Most of the time, this skip is bypassed for other reasons, e.g., a torch function is allowed in graph, or a submodule is explicitly allowed for inlining. But by default we won’t actually compile anything in torch (and this can be quite surprising for PyTorch devs!)
- Chesterton’s fence rules everything around me. Sometimes we have manual implementations of functions (like nn.Module.parameters) which are unnecessary, because they were added back when Dynamo’s Python language support was not so good, but now we can just inline into these functions, but some seemingly benign skip rules are load bearing and cause problems. So many of Yanbo’s initial refactors will be oriented around preserving semantics as much as possible, while improving code organization.
Jason, Edward, Animesh and Voz got together to discuss some design questions about guard trees raised last week. The conclusion was that we are NOT going to do combined guard tries, Animesh’s plan as original scoped as is. One interesting thing I learned from this discussion was that guards with source-based guard structure deal poorly with guards that mention two sources, but Jason proposed a way to deal with this case: instead of directly having a guard like x.size(0) == x.size(1), instead have assignment statements like let s0 = x.size(0) and let s1 = x.size(1), and then have an extra guard that only makes reference to this local scratch space s0 == s1. These extra guards get run immediately once you notice that all of its free variables have been assigned. Jason’s argument is that size guards can be very fast to run if we compile them, so it doesn’t matter if they get run unnecessarily early. Some very rough meeting notes: https://docs.google.com/document/d/1EbrR9o7Loi_fU1MHNJAxxCOItn0dv44hLF3NB2pZ1nE/edit#heading=h.o7t8ttlom4nx
Lazos suffered a bit from some bikeshedding about how he should write some new VariableTrackers, but hey, at least we got a doc out of it: Which VariableTracker should I use - Google Docs
PSA: when you print Dynamo logs, they come with a [0/0] marker that says what frame you are compiling, and which recompile of that frame you are on. [19/0] means you are compiling the 19th interesting frame for the first time, while [1/20] means you are recompiling the 1st frame for the 20th time (probably bad!) Occasionally, you will see [0/0_1]; this means that we restarted analysis on 0/0 for some reason, so this is the second (1 is zero-indexed) time around compiling it.
I mentioned to jansel that there are a number of dynamo refactors I’d kind of like to firm up: mutable variable trackers, a more accurate python object model, variable tracker cleanup (we have a big VariableTracker subclass hierarchy), more metacircularity (so that constant folding is a lot easier to implement.) Hopefully we can hit some of these during the offsite.

Composability

We had a very spicy session at composability sync this week on Torch IR https://youtu.be/FSPNXppkcjg Composability meeting notes - Google Docs
https://docs.google.com/document/d/17O1R57oOZp_fK4dRf83UiM4fH6h6nblxjizTrbHP8BY/edit The crux of the matter is what to do about “Torch IR”, which is conceptually a PyTorch program capture representation that is produced by fx.symbolic_trace: an unnormalized format that contains precisely torch API operations that are part of PyTorch’s public API. It is a coherent concept that is used by folks today, but its denormalization makes it difficult to write sound analyses/passes on. Some argued that because it’s so difficult to use, we shouldn’t expose it, while others argued that the horse escaped from the barn. We were able to agree in the meeting what Torch IR is and what guarantees you should expect from it, but it’s still an ongoing discussion how this should relate to export.
Zachary DeVito’s been working on single controller distributed paradigm for PyTorch, where we issue commands of entire Dynamo traced graphs for external nodes to run. This is being done via a tensor subclass, but it is a bit unusual in that it doesn’t match other tensor subclass applications, where we don’t actually want to trace into the subclass itself, we just want to trace tensor operations on it “as if it were a normal tensor.”
Apparently optree GitHub - metaopt/optree: OpTree: Optimized PyTree Utilities is slowly making progress into becoming an optional dependency for PyTorch: if it is installed, PyTorch pytree APIs will transparently make use of it instead, for hefty speedups. Pytrees are a big performance cost for PT2 compile time, so it will be nice for this to land; so nice that Richard Zou has been wondering if we shouldn’t actually just make this a required dependency.
COW tensor is alive again, thanks to Kurt Mohler accepting a mission to finish of Mikey Dagitses work. Implement Copy-on-write (COW) tensors · Issue #109833 · pytorch/pytorch · GitHub
Discussions on internal reliability continue. It’s going to be some sort of multi pronged approach where we negotiate with PGs what testing we absolutely must do, while at the same time improving testing on poorly exercised parts of PT2 (e.g., PT2+FSDP) and working out periodic tests that are covered similarly to how we cover benchmark results in OSS.

Dynamic shapes

PSA: We’ve covered this before, but someone asked me about this so I’ll repeat it: our plan for PT2 support for table-batched embeddings is that we will eventually support capturing embeddings both with and without the fusion preapplied (e.g., by module swapping). It’s your choice then whether or not to reuse preexisting optimization passes, or write a new PT2 based optimization .
PSA: When registering C++ implementations (meta or composite) that take SymInt, make sure to pass SymInt by value and not by const reference. This applies to composite types like optional<SymInt> too. Unfortunately, we don’t check this properly right now: https://github.com/pytorch/pytorch/pull/109727 is wending its way in but I have to fix all the people who did it wrong first lol.
Adnan Akhundov has been working on PT2 compiling https://docs.google.com/document/d/1q1Rccii_A1xRsZETGPmrOebjKPOhbtWZ4bz9XrI6AdA/edit#heading=h.cxi2qmayvqhr (Meta only) and triggering a lot of unbacked SymInt missing features that we’ve been meaning to patch in but haven’t gotten around to yet. We landed a lot of improvements this week driven by this workstream:
- Use constrain_range_as_size for nonzero/repeat_interleave
- Allow inferring size-nature from sizes passed to empty constructor
- Handle unbacked symints in Triton size hints
- Handle unbacked symints in buffer reuse calculation
- In progress: Add support for item() and nonzero() codegen in Inductor - just needs deps
- There’s still a lot more to do. There’s a lot of use of size hints in inductor which need to be rewritten to deal with the unbacked case when no hint is available. Another unusual thing about Adnan’s setup is he is using AOTInductor, so we’re also getting a lot of extra asserts from add_runtime_assertions_for_constraints_pass.py which also don’t compile atm.
Jeffrey Wan has worked out that we should represent singleton symints (to be used to represent ragged dimensions) as sympy Atoms, which compare only equal to themselves. This is better than Symbol because they don’t show up as free symbols this way.
Notable bug fixes:
- Implement traceable torch.tensor when you have SymInt/SymFloat inputs
- Make SymFloat behave symmetrically with float in torch.tensor
Notable new bug reports:

ezyang · October 8, 2023, 8:16pm

State of PT2: Oct 8, 2023 edition

Previous update: State of symbolic shapes branch - #70 by ezyang

We were on break last week as I was on vacation.

Executive summary

Compiler/distributed offsite was last week! PyTorch Conference talk slides are due to Linux Foundation end of this week!

Dynamo

Our initial take on mutable variable trackers was “well, it is probably technically feasible, but it’d be a lot of work and the ROI is not obviously there.” It came up again this week, though,for perf reasons: RFC / Discussion - Mutable Variable Trackers - Google Docs from Voz
We discussed accurate python object model: definitely something we should do for user defined objects, maybe Fidget-Spinner will work on it. We have some Dynamo bugs recently relating to classes with nontrivial metaclasses (like abc) and multiple inheritance.
We have a proposal for guarding on Dynamo configuration, which should make it a lot easier to tweak config options: Dynamo guard on global configuration · Issue #110682 · pytorch/pytorch · GitHub One notable choice we make is that outer-most torch.compile config wins; if this would be annoying for you please comment on the issue.

Tracing FSDP

We talked about the relative importance of landing tracing FSDP quickly during the offsite. The general consensus was that, while this is an important capability, the more time pressing problems are optimizing tensor parallel compute (as it’s harder to manually get optimal overlapping in this regime) and tracing DDP (which Chien-Chin is working on.)
During the offsite, we came up with a full plan for Dynamo-level support for propagating hooks to backwards. The primary complication is that, in full generality, a backward hook installed in a Dynamo compiled region may be arbitrary Python code that would vary from run to run, but we emphatically do not want to guard on it (nor can we, since we didn’t inline into the function.) In the simple case, the function is constant from iteration to iteration and we can bake it into the backwards graph (this is what is currently implemented); in the complicated case, Dynamo must construct the residual function, and then somehow pass it to AOTAutograd compiled function, so AOTAutograd knows to know that particular function is what should be invoked when backward rolls around. This can be done but it’s all quite fiddly. For FSDP we don’t need it in full generality because it’s a constant function.
More folks are collaborating on Voz’s experimental FSDP tracing branch: GitHub - pytorch/pytorch at voz/fsdp_autograd3 To run things on the branch just say torchrun --standalone --nproc_per_node=2 fsdp.py (will run with compiled forwards, but NOT compiled backwards). Current status is that compiled forwards works, compiled backwards does not. The problem is that compiled autograd has to do a pre-pass with fake tensors to construct the FX graph, but during this pre-pass it is unable to run hooks, and that means parameters aren’t the sizes it is expecting.
Not quite FSDP, but putting it here: on the subject of single controller, Haiping Zhao also looking at it this problem space, much more from the distributed side. He, Zach and Horace have been chatting.

Core

We spent a bit of time talking about optimizer in the offsite. @janeyx99 summarized the discussion at Meta only: Redirecting... and Meta only: https://docs.google.com/document/d/1JJhRCl8F51nH_Ke8Yd_BV3scAmv5V4eSVnle5D8__po/edit My brief summary: we’re going to make optimizer support taking parameters in arbitrary pytree structure, rather than forcing just a list of parameters (which gives you the awful integer indexed structure where you have to reverse engineer which parameter is what.) It’s not BC-breaking, but people who use this API will have a much easier to work with state dictionary.
Composability sync this week was all about quantization https://www.youtube.com/watch?v=7WhgpAIvxHU Composability meeting notes - Google Docs The resolutions:
- uint2/uint3/uint4 support in core to be prototyped as Python subclass by torchrec folks (lead by Ivan Kobzarev)
- Decent chance we are going to get dequantize operators that can show up in export IR
- We will have a pattern matcher that will let you compile regular Python calls in PyTorch IR into appropriate ATen matchers, mirroring how Inductor’s pattern match infra works. This will support just returning fx.Nodes to you so you can do arbitrary transformations, instead of just doing a replacement.
Richard Zou has been trying to convince people to use the new operator registration API, but he has been noticing that people really like the old fashioned autograd.Function API, because it doesn’t require them to do work for things they don’t care about (e.g., supporting other transforms). Since we need to support this anyway, we are going to make sure Dynamo’s support in this regime is good.
AOTDispatch subclass PR is approved, close to landing! AOTDispatch subclass by bdhirsh · Pull Request #104483 · pytorch/pytorch · GitHub

Inductor

AOTInductor is currently working hard on GPU model support, but some folks have been poking at it for CPU, overhead sensitive workflows. There will likely be some work done in this area, cool increase in scope.
Inductor strategy will be presented to upper leadership soon. Meta-only slides: https://docs.google.com/presentation/d/1M6W5YuXhfCkngmjPC8a_EfdMs_P6wXzsAbr_LKkuisU/edit#slide=id.g2887f2cdaba_0_23 (They’re pretty interesting, I recommend reading them if you have access.)

Export

Export input/output matching is being a problem again. This is the AssertionError: traced result #1 (<class 'torch.Tensor'>) is not among graph-captured output. Someone should rewrite this code.

Dynamic shapes

ysiraichi is transitioning to PyTorch XLA, so he will have less time to work on dynamic shapes specifically.
There was a torchrec_dlrm update last week; the main change is that I redid the torchrec changes assuming variable batches, and after fixing bugs it all worked out smoothly. Still blocked on complicated autograd.Function support. https://docs.google.com/document/d/1VTGEh0MqadAsuRy0s5u39wQhNwMSVgCgYewivMcBbuU/edit#heading=h.34z2pradlobb
Notable bug fixes:
Notable new bug reports:

Numbers

I guess I’m doing these monthly now.

Training. 1b34238d67 dashboard

nanogpt, stable_diffusion_text_encoder, stable_diffusion_unet newly added torchbench models
mobilevit_s now passing timm_models
Not sure what’s going on with blueberries readout lol.
Compile time does seem to have gotten worse. @Chillee has been complaining about compile time, although a lot of it is in Dynamo tracing. It is hard to see the effect of Dynamo in our current benchmark suite because it is heavily Inductor biased. Some improvement from guard hashing, but Horace says it’s only 1-2 seconds.
Unattributed speedup on timm_efficientdet

Inference. 1b34238d67 dashboard

Lots of enablement in aot inductor, I like to see that pass rate go up
Some speedups are attributable to nanogpt being added
1% improvement in HF, Horace updated some loading logic
HF inference: better flash attention matching at low inference +19%, some of this is also FlashAttention v2
Some ups and downs with Yanbo’s for equiv invocation, letting us hit baddmm

ezyang · November 6, 2023, 12:04am

State of PT2: Nov 3, 2023 edition

Previous update: State of symbolic shapes branch - #71 by ezyang

Sorry about the month’s delay! Between more vacation and PTC there wasn’t much time to do a writeup over the weekend.

Executive summary

Big tickets

PyTorch Conference happened! Thanks everyone who attended, there were lots of fun discussions. You can watch the talks at https://www.youtube.com/watch?v=dR0lHxt3Tjo&list=PL_lsbAsL_o2BivkGLiDfHY9VqWlaNoZ2O . Some fun in person discussions that I had: (1) with Pierre Guilmin, you can now torch.compile complex tensors: Add complex tensor with subclassing by pierreguilmin · Pull Request #48 · albanD/subclass_zoo · GitHub This is actually going to be the preferred way to torch.compile complex numbers as you Triton doesn’t support interleaved layout and you’re not going to get efficient matrix multiply that way anyway (because the built-in instructions don’t support complex.) (2) Ho Young Jhoo and Nuno Lopes had some interesting work on automatically pipelining NNs, it was quite interesting. (3) Jack Cao and I sketched out what single step graph should look like in Dynamo, track progress at [WIP] Dynamo single step graph by JackCaoG · Pull Request #112296 · pytorch/pytorch · GitHub
PyTorch 2.2 release is coming! The branch cut will be Dec 1.

Dynamo

We’ve been having lots of discussions about what it will take to get Dynamo to the same level of stability as long running compiler projects like HHVM or LLVM. Some thoughts about refactoring pieces at Refactoring Dynamo for stability - Google Docs As a smaller step, @voz has organized a weekly triage meeting for PT2 issues, separate from the regular PT2 weekly.
@suo is taking a serious look at getting torchbind to work on PT2. Some basic design notes at PT2 torchbind - Google Docs ; we also discussed this at composability sync
In the land of tracing FSDP, there is currently some grunging about in PyTorch’s accumulate grad implementation. It is fairly complicated, including logic that checks the reference count to decide whether or not to reuse a buffer inplace or not. There is some debate about whether or not there should be an accumulate grad aten op (@jansel implemented one that lowers all the way to inductor), or it should be traced through by Dynamo.
Some work towards reducing the amount of guard administration is being made here [export] Skip guard propagation for export only. by zhxchen17 · Pull Request #112685 · pytorch/pytorch · GitHub which should materially improve Dynamo tracing speeds
Some news about compiled optimizer from
@mlazos https://docs.google.com/document/d/1oHyw0RULF7UKZBCrOdVOzshlhiBFuTT7zKZd2xOGc3k/edit

Core libraries

Ying Liu has been working on a tensor subclass for async execution. We discussed it in composability sync. The idea is that you can trigger an operation (typically communication) on a side stream, as well as some follow on operations, without having to literally move the follow on operations to the point where a sync happens. This also means that code in torchrec that has to be manually written as a pair of custom autograd functions for req/wait can be written in an intuitive, autograd style. We have a version that does this manually with callbacks (only queue kernels onto the stream at some known later point in time) and Ying is working on another version that uses streams only. One interesting thing we noticed that when you schedule allreduce in forwards first, backwards will naturally schedule it last, but you actually want the allreduce to happen ASAP! @albanD suggested we may be able to add an API to modify the priority order of autograd backwards, could be useful.
There will be a new repo https://github.com/pytorch-labs/ao for some of the new quantization schemes we’re working on. We discussed this in composability sync.
I did a long overdue update to record_stream docs at Add a note about performant record_stream use. by ezyang · Pull Request #112526 · pytorch/pytorch · GitHub after having some more discussions about it with @eellison who was trying to get cuda graph trees to work with record stream.
We’ve been talking about this with Vincent for a while, but there is now a proposed PR to add TensorDict to PyTorch core, check it out: [RFC] Tensordict integration by vmoens · Pull Request #112441 · pytorch/pytorch · GitHub

Dynamic shapes

repeat_interleave dynamic shapes support was reverted due to S376879, this revert may itself have caused a sev S377088. It turns out that this diff was not related to the SEV, so we are relanding it.
torchrec dlrm with sharding and inductor works end-to-end, all the way through! Many of the changes have been merged upstream to torchrec/FBGEMM. Check https://docs.google.com/document/d/1VTGEh0MqadAsuRy0s5u39wQhNwMSVgCgYewivMcBbuU/edit#bookmark=id.bpg458y8bfaa for status. We’re far enough along that the internal folks are going to try to do some enablement on their reco models
We had some discussion about supporting mark_dynamic and automatic dynamic on tensor subclasses. Some of the complication is around the fact that you can have sizes that only occur in the outer tensor but not the inner tensor, and vice versa. Check for notes: https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo/edit#heading=h.3px8g3br0skz
Adnan has been running into a lot of “cannot guard on data dependent SymInts” and this has me wondering if we shouldn’t have a mode that automatically suggests what runtime asserts you ought to have. This lead to Tracing mode for unbacked SymInts using real data · Issue #112749 · pytorch/pytorch · GitHub
If you need to force an input integer to be dynamic using mark_dynamic, one way to hack it is to pass a 0xN tensor instead of an int and then project out the int again with size(1).
Notable new bugs
- AllenaiLongformerBase failing w/ dynamic shapes: “‘Pointwise’ object has no attribute ‘get_stride’”
- [dynamo] .view([..., -1, ...]) fails on Tensors with unbacked SymInts in the shape - the workaround we found was to manually replace i0 with i1 * 12 so that the modulus checks work out
- pack_padded_sequence/pad_packed_sequence support in dynamo
- Operators that return dynamic-shape outputs that require_grad choke in AOTAutograd
- torch2.1.0 DDP+compile+dynamic_shape cause error - workaround with optimize_ddp = False
- [inductor][dynamic] fused_attention pattern could not be matched due to sym_size
Notable fixes

Numbers

Training. 64f326097b dashboard

TIMM improvement is from channels last optimization

Inference. 64f326097b dashboard

3% HF improvement from concat codgen on inference

ezyang · January 13, 2024, 1:55pm

State of PT2: Jan 12, 2024 edition

We’re back from holiday break.

Vibes: I’ve been back from recharge for only a week, and already it feels like an avalanche of bugs! Phew, lots to burn down. I asked Twitter for some time tracking advice: https://twitter.com/ezyang/status/1745523916624330974
PyTorch Dev Podcast is back. https://pytorch-dev-podcast.simplecast.com/episodes/dynamo-variabletracker and more to come. Send me any topic requests. I had considerable decision paralysis about what to record, so for now I’m just going to randomly walk over topics that are salient in my mind.
Dtypes for uint1-7 are going to be a thing. PR at https://github.com/pytorch/pytorch/pull/117208/; the addition of these types is a revision of our dtype addition policy previously discussed at https://docs.google.com/document/d/1O69_acetdC5QgXV5hXyy0oCDPJv1-qg5pc_Y0yMh5kg/edit#heading=h.w86jy6sdyvq5. Note that we recently added barebones support for uint16, uint32 and uint64 in PyTorch.
Jack Cao has a design doc for Dynamo single step graph capture [RFC] Dynamo Single Step Graph · Issue #117394 · pytorch/pytorch · GitHub and we ratified it at composability sync. The way it uses compiled autograd to generate another Python program to inline into from Dynamo symbolic evaluate is pretty cool, check it out.
Did you know that mypy in daemon mode is way faster at type checking than lintrunner?
Landed stuff:
- Stop unconditionally applying hermetic mode - hermetic mode prevented you from doing things like calling torch.compile or returning tensor subclasses from operator implementations registered to dispatcher. I’ve decided multipy is dead and so we can relax this restriction.
- Add AT_DISPATCH_V2 - this cool new macro lets you dispatch to multiple dtypes but without having to count how many extra dtypes you add. Check it out at Dispatch_v2.h
- Prevent unbacked symbol reallocation by forcing unification for unbacked symbol def sites makes progress on the we reallocate unbacked symbols problem. There’s still some other reallocation happening in AOTAutograd that needs to be nailed.
- Add torch._lazy_clone to create COW tensors we have FINALLY landed a public API for making copy-on-write tensors, thanks @kurtamohler. Give it a try.
Some new bug fixes up for review:
- Properly preserve SymInt input invariant when splitting graphs - fixes longstanding optimize_ddp and dynamic shapes interaction bug. Was a lot simpler than I thought it would be.
- Avoid performing replacements when it would unrefine ranges - this PR should actually fix a large class of guard on unbacked SymInt, in part because we now properly preserve ranges from Dynamo into AOTAutograd.

ezyang · January 20, 2024, 10:00pm

State of PT2: Jan 20, 2024 edition

I did some live streamed bug fix sessions, which you can watch on Youtube. Check it out!
We had an internal SEV review. One thing that stood out to me was that two of the SEVs were compilation slowness stemming from dynamic shapes accidentally being turned on when it shouldn’t be. Dynamic shapes trouble.
Jack Cao and I got a design for dealing with saved for backwards intermediates, which is that we’re going to generate a pre-dispatch ATen FX graph into Dynamo, rather than directly represent torch.*. We’re pretty sure this should work. Track Jack’s progress at [WIP] Dynamo single step graph by JackCaoG · Pull Request #112296 · pytorch/pytorch · GitHub
Will Feng is getting close to finished with his prototype for lazy scheduler. It is a bit complicated, but it seems like it will work. LazyScheduler for operator reordering - Google Docs
Richard Zou is back to designing a new, class-based custom ops API, analogous to autograd.Function. This is based on user feedback where the existing custom ops API is quite difficult to use for autograd. Meta only: https://docs.google.com/document/d/1TVV3sDUv1E8ou1Hk0MeL7e1C6QPNl5UQesiP2MWhDSQ/edit (hopefully public soon)
Richard Zou and co have been working on making the Dynamo CI tests less flaky. They’ve made a lot of progress. Right now, all of the tests in CI are reasonably hermetic (they reset before running) and they’ve clustered the failures. A lot of very simple stuff (e.g., error where Dynamo uses a variable that’s not defined) and then a huge long tail of niche failures. Not many accuracy failures. One big problem is many problems only repro in CI environment and not locally.
A number of new folks from PL&R are ramping up on Dynamo! The extra manpower is much appreciated.
Yanbo Liang is thinking about how to measure compile time in our prod workloads. The challenge is full E2E tests are quite difficult to run. But maybe simple components like Shampoo can be extracted out and tested on their own!
Simon Fan is working on improving testing of compiled autograd by turning it on our torchbench suite. Meta only status: https://docs.google.com/spreadsheets/d/17aCEcAcif-1saHrdqALjfr-ybRMl_uEq7mqje5whPZw/edit?usp=sharing the summary is some pass, some fail. _cudnn_rnn_backward needs meta support, and there are some side/stride mismatch issues. Fire up the minifier! He’s also running torchbench with DDP.
New einops style library einx from the community: Reddit - Dive into anything Seems pretty neat! This is my continued reminder that first class dimensions are also a thing in PyTorch too.
Landed stuff:
- Document and type torch._inductor.virtualized - this is pretty helpful if you’re trying to understand how the global state in Inductor works
- Catch some missing unbacked symbol dependencies - this bug typically manifests as we generate some code and it references i0 but the name doesn’t exist (we DCE’d too much)
Up for review:
- Rename unbacked SymInt prefix to u - the current “i” prefix conflicts with indexing variables, leading to hilarious bugs!
- Fix several bugs related to unbacked SymInt codegen in inductor - this fixes some long standing bugs when we accidentally reallocate unbacked SymInts
- Document OpsHandler protocol - pretty cool doc PR about the ops namespace in Inductor, and what all of the operations mean.
- https://github.com/pytorch/pytorch/pull/117300 (from Oguz), a twisty story since we asked the Triton devs if we could do the analysis directly in Triton, they were not too interested.

ezyang · January 29, 2024, 3:50am

State of PT2: Jan 28, 2024 edition

Calibrations were this week.
Joel, Jeffrey, Alban, Brian and Edward convened to discuss subclass view fakeification again. Subclass view fakeification occurs when we are given a tensor subclass which is a view of another tensor (the canonical example is a nested tensor which is a view of a dense tensor of all the packed data), and we need to convert it into a faithful fake representation so we can simulate operations on it in Dynamo and AOT Autograd. Construction of views in fake tensors is traditionally done by fakeifying the base tensor, and then reapplying a recorded view function which specifies how to “replay” the view on an arbitrary new base. The problem of subclass view fakeification is that these view functions typically hard code size / tensors that are free variables of the view operation, but when fakeifying, these need to be swapped out with symbolic integers and corresponding fake tensors. Joel’s resolution after the meeting was to reify view functions so that this information can be swapped. Notes: https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw/edit#heading=h.vl1gidtprtoo
Shampoo compile time is still a problem. I talked to some folks on Wednesday who were like “our job is stuck in produce_guards” and it turned out to be the exact same tensors_definitely_do_not_overlap guard explosion that caused two other SEVs described at Meta issue: Automatic dynamic shapes can cause compile-time performance cliffs · Issue #118213 · pytorch/pytorch · GitHub. Brian is going to try to fix the tensors_definitely_do_not_overlap problem in a few weeks. I showed Yanbo how to navigate the logs to find the culprit (in this case, just searching for symbolic_shapes logs was enough to identify this as the same problem.) There is some difficulty reliably turning of automatic dynamic shapes (which would help with this problem) that needs to be studied in more detail.
Two interesting new posts: Micro-optimizations for the most micro of benchmarks and [RFC] New Python operator registration API which I highly recommend
I’ve been talking Elias and Mario through the plan to remove unsound 0/1 specialization Eliminate compile time ranges for a simpler analysis · Issue #117361 · pytorch/pytorch · GitHub and people are on board, I am going to implement this next week.
A bit of chatter about what to do about backtraces in distributed interleaving each other. Chip Turner shared that if you pass appropriate arguments to torchrun like torchrun --role mnist-trainer --log-dir /tmp/l -t 3 -r 3 mnist/main.py good things happen. Unfortunately in internal prod we are still shoving everything to stderr but maybe we can change that. Meta only: Redirecting...
We’re using dmypy instead of mypy for typechecking now in lintrunner. Typechecking is a lot faster! If you think there’s some weird cache problem, you can say dmypy stop to restart the daemon.
A lot of dynamic shapes bugs in Inductor specifically this week: Inductor sizevars wrapper assignment DCE hazard · Issue #118385 · pytorch/pytorch · GitHub assert isinstance(value, CppCSEVariable) and value.is_vec · Issue #118379 · pytorch/pytorch · GitHub Inductor mixed device operations not handled correctly, maybe buffer reuse problem? · Issue #118299 · pytorch/pytorch · GitHub . I’ve personally been mucking around a bit in Inductor recently!
Landed stuff:
- Realize inputs to DynamicScalar before unwrapping storage - another “oops we fed the wrong thing to an extern kernel” bug
- Landed from last week: Fix several bugs related to unbacked SymInt codegen in inductor, Rename unbacked SymInt prefix to u, Document OpsHandler protocol
Notable new bugs:

ezyang · February 5, 2024, 4:30am

State of PT2: Feb 4, 2024 edition

Alban working on a concept of “accelerator” for the PyTorch library, to better support non-standard backends. Some pain points including pinned memory (which is currently CUDA specific) and how to write library code like FSDP in a way that can handle multiple devices. Doc at https://docs.google.com/document/d/1TySu95kPLc6kNzlOg1T8IRGHVkezv8cWyweEXYZaxlU/edit#heading=h.x2rqjkjdhxak and we discussed it at composability meeting.
Jeffrey Wan has been working on Factory function support for NestedTensor by soulitzer · Pull Request #117904 · pytorch/pytorch · GitHub . Most notably, singleton SymInts are becoming a nested tensor specific concept, getting tensor stored directly on themselves, and are being supported being passed directly to a tensor constructor, so you can create a nested tensor directly from torch.empty so long as you appropriately pass in a nested int. The motivating reason for all of this work is so that we can reuse all autograd formulas with nested tensors; this is what necessitated making sizes well defined for nested tensor, despite the existence of a jagged dimension.
Brian: Has been working on a lot of internal driven fixes for DTensor and PT2. One particularly notable PR is a new API proposal AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly by bdhirsh · Pull Request #118670 · pytorch/pytorch · GitHub for working around problems where the forward/backward have different tensor metadata when a subclass is involved. In the long term, Brian will likely take on fixing it properly (by having AOTAutograd recompiling when outputs show up and they mismatch what the existing compiled graph needs.) The difficulty here is mostly because partitioning decision has already been made at this point.
Composability sync https://www.youtube.com/watch?v=kSOmyARCbyM (minutes Composability meeting notes - Google Docs) had a discussion about how exactly to handle AsyncCollectiveTensor in PT2. Broadly, our conclusion was to go the simple route: we wait on all input AsyncCollectiveTensors, output tensors from collectives don’t wait and instead produce AsyncCollectiveTensor from Inductor. Some debase about how some applications like FSDP do need fine grained stream control to control memory usage.
Michael Suo asked me, what are non urgent high impact things I (or others) can do? The big one that came to mind was improving the typing in PyTorch codebase. Typing serves as documentation and helps catch errors. It is a virtuous circle: the better typed we are, the more likely new code is to be typed.
Will Feng has picked up FSDP patch set from Voz. Most recent branch does not work end to end: the gradients are always zero. My advice to Will was to work on landing the individual fixes to main and trust that Voz has identified all the missing gaps. In the leads meeting there was some discussion about what to do about this workstream overall after Voz’s departure; jansel still supports this strategically and we are still committed to it.
Tugsuu has been working on pre-dispatch functionalization and non-strict export. I mentioned to him Jack Cao’s work at [Dyanmo Single Step Graph]Inline NNModule's fwd in dynamo instead of leave it allow_in_graph by JackCaoG · Pull Request #118155 · pytorch/pytorch · GitHub
Richard was delighted to tell me that all flaky tests in Dynamo are fixed: keep an eye out for an official announcement.
Did you know that PyTorch actually has two sources of truth? Many internal oriented developers develop on top of fbcode to ensure fast lands into fbcode. This can result in desynchronization of fbcode and GitHub and it is necessary to reorder commits. But commits are not necessarily commutative; when they are not, this results in a “splitrace” (https://docs.google.com/document/d/1LJFKwD_lBudEt4kiUIMh_LkYp0_CFVaUSe64kIPglwo/edit#heading=h.j49wad13wr3z). I chatted with Ivan Zaitsev in case I had some good ideas for how to deal with it. One thing is that potential conflicts between diffs is unavoidable because we are not doing a bors-style landbot. So the name of the game is identifying if commits are likely to be safe to reorder. We have some evidence of commutativity from PRs themselves, as their final test is typically N commits behind their actual merge commit. In a decentralized system, it is best to first ask if you can make it centralized first: much simpler!
Elina Lobanova joining us to help with some observability logging stuff. She told me some good ideas: one that stuck out to me is how we should do feature flagging: we should read them out at training process start and then keep them the same until the job relaunches (Google flag style). This way, you can diff flags between jobs and see if they changed.
I popped into the Inductor weekly sync to get some info about topics. I pitched us rewriting our test suite to have one file per test and more standardization; there was lukewarm reception for this (although one person from the HHVM team was like “yeah, we have 15k test files in HHVM, what’s the problem?”) No one apparently helping anyone with problems internally due to inductor. Unbacked symints enablement workstreams are unblocked. Inductor workstream tracking at https://docs.google.com/spreadsheets/d/1pwW5bRZqIzbi1026JrKo83P1P2Vbf8MG8CF09-48O9Q/edit#gid=0
Notable new developer proposals:
Notable new bugs:
- Dynamo fails on scalar bit shifts when dynamic=True; also some related interest due to power of two calculation (for block size, so maybe not real)
- torch.compile crashes when using DDP and dynamic shapes and torch.utils.checkpoint in 2.2.0 marked high prio
- torch.compile(dynamic=True) compiles forever; moral of story, don’t use dynamic=True
Landed stuff:
- Add documentation for meta device
- Make torch._dynamo.mark_static work inside graph - live streamed at https://www.youtube.com/watch?v=06HuwNR9-uI
- Support symbolic min/max on unbacked SymInt
- Add TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED - useful debugging tool, get a backtrace only when a specific guard string shows up

ezyang · February 11, 2024, 3:59pm

State of PT2: Feb 11, 2024 edition

Follow ups from last week:
- Here’s Elina’s doc on PyTorch killswitches/justknobs: https://docs.google.com/document/d/1Ukerh9_42SeGh89J-tGtecpHBPwGlkQ043pddkKb3PU/edit#heading=h.hcoj4hguk6we One thing that is new to me is that I previously thought maybe our PT2 configs should all have corresponding JK for controlling them, but this is not quite right; rather, the configs should stay configs, but their DEFAULTS should be controllable by JK.
- Will Feng wrote a status update on FSDP, Meta only: Redirecting...
- I’ll be spending some time on Offline log analyzer/formatter · Issue #119063 · pytorch/pytorch · GitHub, follow along development at GitHub - ezyang/tlparse: TORCH_LOGS parser for PT2 . In particular, check out the design notes at tlparse/plan.txt at main · ezyang/tlparse · GitHub Bonus: it’s written in Rust!
Ryan Tremblay reported a worse loss curve when torch.compile’ing a transformers model. This was very difficult to debug, but @bdhirsh eventually pinned it down to a stride mismatch problem in the meta implementation of flash_att_bw: make flash_attn_bw impl correct w.r.t. meta when k and v have different strides by bdhirsh · Pull Request #119500 · pytorch/pytorch · GitHub Great sleuthing! This can potentially affect all transformer models training with torch.compile, so if you are seeing bad loss please try out the one line fix.
Simon Fan published a very nice update on compiled autograd. The post is meta only: Redirecting... but the summary: (1) compiled autograd is now being run on all PT2 benchmark runs on dashboard, (2) some backward ops are missing meta implementations that we need to add e.g., aten._cudnn_rnn_backward, aten._embedding_bag_dense_backward, (3) nicely, the speedup 3-11%, now that a refcounting problem in accumulate grad is resolved, (4) less nicely, peak memory usage is up, compile times doubled and CUDA graphs often does not work. On the subject of compiled autograd and DDP, Chien-Chin reports that nanogpt works, passes accuracy and (surprisingly!) performs better even though there is no bucketing yet. However, there’s a bug with zero_grad where it forces Dynamo to recompile every iteration. Meta only: Redirecting...
There’s a new doc on dealing with GuardOnDataDependentSymNode problems: Dealing with GuardOnDataDependentSymNode errors - Google Docs . We also created a new Meta-only working group for support on these issues: Redirecting...
From Yifu Wang, there’s a new, “native” implementation of functional collectives in Inductor IR and it’s getting close to the point where the old funcols (which are manually implemented with one IR node per collective) can be switched to them. Meta only: Redirecting...
Two big happenings on the nested tensor front. First, we have agreed to stop using the term “nested tensor” as it is ambiguous, and instead use NJT (nested tensor with jagged layout) and NST (nested tensor with strided layout) to disambiguate between the two layout formats. Most development is happening on NJT. We devoted the first half of this week’s composability sync to working on nested ints; the final outcome was that (1) nested ints do NOT have device, (2) they test value equality on CPU tensor, and (3) they optionally can cache a CUDA tensor to avoid repeatedly resending it to device. Jeffrey Wan to follow up on this.
The other composability topic was on activation checkpointing. Check the minutes for some information Composability meeting notes - Google Docs
PT2 and torchbind is in full swing, motivated by need of more complicateed bindings for preproc use cases. Richard Zou and Yidi Wu have been spearheading the effort. A design doc is here: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.ykgiz8p1pud One thing that they think they need is a world token to maintain ordering.
NVIDIA TensorRT team is going to extend torch._dynamo.mark_dynamic to support also specifying min/max constraints.
We historically focus a lot on CUDA performance. Nikita Shulga has been looking into performance of our LLM models against other implementations on CPU. GitHub - malfet/llm_experiments One thing he noticed is that our lower precision matmul implementation on ARM is really bad because we didn’t vectorize it. What should you do if you want to use PyTorch but in a deployment environment where you need no dependencies and low binary size? Probably something like Executorch runtime with a thin frontend pasted on top for dealing with KV cache / beam search / stuff that’s hard to export, and then AOTInductor exported blocks that you are pasting together.
PT-D has been on people’s minds. At high level, one of the things folks are most concerned about is composability: distributed involves a lot of complex features, all of which need to work together. I’m not too worried at a fundamental technology level, because many of the things we want to support are orthogonal and independently make sense, but ensuring we have adequate testing of all the combos is something that has persistently plagued our project even in simpler cases.
Some investigating from Richard Zou about how to productionize torchdim. Here, productionize means eliminating the monkey patching it is currently doing to add support for Dim to functions like torch.sum. One potential approach is to make torch function to support non-Tensor arguments having overrides, ala torch_function objects passed as non-Tensor args should trigger overrides . Another potential is just to directly incorporate the modified logic from torchdim into our C binding code. Both of these are nontrivial work.
Some interesting proposals:
- Free-floating fake tensor without the mode (FakeTensorMode) · Issue #119589 · pytorch/pytorch · GitHub - apparently export wants to be able to serialize fake tensors, this is a way to make it make sense (serializing fake tensors with the mode is a lot more fraught)
Notable new bugs:
Landed stuff:
- Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode - this debug PR is highly recommended if you are dealing with GuardOnDataDependentSymNode errors, definitely worth rebasing past
- Add symbol_guard_limit_before_specialize - we want to set some default for this, but empirically 100 is too low for TIMM models. More investigating needed. Some instrumentation to take a look at this at Add symbol guard counts instrumentation
- Don’t guard if there are unbacked SymInts

ezyang · February 16, 2024, 7:57pm

State of PT2: Feb 16, 2024 edition

Long weekend, so you get the update early this week! And good thing too, this week was jam packed.

We had our monthly execution check in for January. Meta only: Redirecting... https://docs.google.com/document/d/15hmpQKSxtq7cmi4cYMikFWQUxm4vjea_qEes3rE3hOE/edit#heading=h.laybrk1bqzr5 Some highlights that I can share externally:
- Executorch is starting to look at how to handle custom ops, e.g., those in torchvision. Mengwei’s initial thinking is to get people to write their custom ops in ET compatible style, and then adapt them to regular PyTorch dispatcher. https://docs.google.com/document/d/1Z-R31SHiPL_85U9P1pLKuGrn-Oa3ACmegTw1_pt0WWM/edit
- There’s been some agreement to unify on NJT, but Colin pointed out that there is not a completely solid commitment from torchrec side to actually move everything in torchrec to NJT, so there is still some ambiguity here.
- Starting to get some coordination on speeding up PT2 compile time. Meta only: https://docs.google.com/document/d/199LbkPiZdjn2CExEeRl8XDqO4unIPqMEwzyXJljGlY0/edit#heading=h.h2ok4ucbef8y ; notably, fake tensor caching has landed!
- Flash Attention upgrade is in progress Update flash_attention kernel from 2.3.6 to 2.5.3 by drisspg · Pull Request #118935 · pytorch/pytorch · GitHub but blocked on CUTLASS upgrade
- Ivan Kobzarev is working on torchrec / PT2 compatibility. Meta only: https://docs.google.com/document/d/166uw-5GotLgwn1SXFyi1Umd2eAXerG_ofXo2-n-D9Uw/edit
- Moto Hira has switched teams, so torchaudio is now ramping Ahmad to own the library
Some stuff I found out from 1:1s:
- Oguz Oulgen: his team is ramping by fixing bugs in Dynamo, but they’re going to be moving to bigger things soon. The two big themes they’ll be tackling are Dynamo reliability and Dynamo compile time. Oguz is still trying to define what exactly the objectives under these themes will be.
- Horace He is working on selective activation checkpointing. Meta only: https://docs.google.com/document/d/1kNVs5vKkL-CNa2Ufvqex6VgwF8kAqF9aS6ca9pBuB_8/edit but the idea is there will be a level 1 API where you explicitly specify what you want to checkpoint, and a level 2 API where you want to hit some memory budget within the region you’re checkpointing. The level 2 API is difficult to implement without a full graph, so it will likely not be supported in eager mode OR you can compile it, get the list of checkpointing decisions, and then you can feed that into eager mode. This work is based off of a cool prototype fmassa made. Horace hoping this doesn’t take too long, and also still planning to work on single controller, oriented at (1) fine tuning use case and (2) really big cluster use case.
- Animesh Jain: should be able to wrap up C++ guards in the next few weeks, next on list is helping with subclass/Dynamo related interactions
- Brian Hirsh: looked into how to do the non-overlapping test more efficiently (without doing quadratic pairwise checks) but this is actually kind of cursed. The problem is that if you imagine a 2D parameter buffer, the individual inner parameters correspond to non-overlapping tiles, which means that any single tile is NOT contiguous. Need some sort of multidimensional tiling algorithm. Has been working hard on DTensor / PT2 integration, lots of bugs! But it’s very real.
- Avik C: had a chance to brief him on how symbolic shapes works. Looking forward to more collaboration between the eager mode online solver, and export’s offline solver.
- Nikita Shulga: improving LLM model perf on CPU. Llama on ARM in fp16 in PyTorch is less than 1 tok/s (compared to 15 tok/s on llama2.cpp). Also looking into gguf compatibility, quantization subclasses, and checking if it works with AOTInductor.
- Adnan Akhundov has knocking around supporting cond directly in Inductor, for host side, runtime control flow. WIP.
Composability sync - we sat down and worked out the accumulate grad problem, which is blocking compiled autograd x DDP. The new plan is to decompose accumulate grad before it gets to Inductor. Animesh to look at it. Check out Composability meeting notes - Google Docs for notes
Non-strict export is a thing! Non-strict export is an alternate implementation of torch.export.export which uses make_fx tracing rather than Dynamo tracing. If you are suffering from missing Dynamo support, non-strict can help; but it’s not all beneficial, because make_fx is unable to handle some patterns that Dynamo can handle transparently (e.g., x[s0:s1] forces specialization on s0 and s1; torch.tensor does not work). A lot of people are using it now.
Someone asked in leads meeting if tensor subclasses are “public” API surface. Well, they are… but you kind of have to be a PyTorch expert to use it. So we don’t necessarily expect them to be self service.
I’ve been working on improving logging inside Meta, with some implications for OSS too. In Meta, I’m trying to get us to stop muxing all our ranks together and write them to separate files https://docs.google.com/document/d/1tf_gJ3KlKFqjsTa39yVNymDSu57Lv3mHiWi9dzJJOUk/edit (this requires some new API support in torchrun). I’ve also been working on a logparser called tlparse GitHub - ezyang/tlparse: TORCH_LOGS parser for PT2 . I want to do a lot of things with it, but right now what it does is it looks at all the compilations that happened and gives you a nice overview of where they are, by rendering all their call stacks into a trie. Still rapidly evolving. It’s pretty fast too: 500MB/s, which is not the fastest but means a few GB logs is no big deal.
Notable new bugs in dynamic shapes
- PGO-style mode automatic dynamic shapes
- Also attempt to match weaker conditions when replacing deferred runtime asserts
- Complicated guards probably are being repeatedly issued
- torch.compile doesn’t convert all input scalar types to symbolic values - this is the same thing as converting float compute into tensor compute
- Deferred runtime assert is silently elided if unbacked symint gets replaced
Landed stuff from Edward:

ezyang · February 24, 2024, 10:26pm

State of PT2: Feb 24, 2024 edition

A lot of progress on torchrec and PT2 this week.
- We got some clarity on how exactly to deal with the naive implementations on KJT embedding that torchrec uses by default. These are quite inefficient, and we do not actually need to trace them as post-export we will be unflattening modules and doing module swaps. We’ve come to the agreement that it’s not necessary to directly trace these (which is very slow); instead, we will somehow trace some higher level thing (either by tracing table batched code, or putting in dummy functions for the things to be module swapped.) Internal xref: Redirecting...
- Flavio has been working on fbgemm metas. These are annoying because the fused optimizers are all code generated in C++, so it’s most convenient to do the meta in C++. Good progress: - Add codegen for embedding backward meta functions by q10 · Pull Request #2347 · pytorch/FBGEMM · GitHub
- Paul successfully used non-strict export on the model he was working on. Summary at (Meta-only) https://docs.google.com/document/d/1dB_8-RL3Mm9sD8wqW_9dbHTQw2JVeeug9teU2F_DCu4/edit#heading=h.5teu622jmqei . Adding tolist support to non-strict export: [PT2] tolist() support for FunctionalTensor by PaulZhang12 · Pull Request #120508 · pytorch/pytorch · GitHub Still some problems involving overspecialization to be looked into (internal xref: Redirecting...).
- Ivan has been working on tracing through torchrec async comms, rewriting them so they are in synchronous style without custom autograd function and therefore PT2 traceable. Interesting blocker: derivatives are missing on functional collectives; also some necessary primitives like reduce_scatter_v are missing
- Colin is going to move into the dynamic shapes space per Adnan.
ghstack 0.9.0 is out. This release has a big new feature: you can now specify a subset of commits in your stack. Simply say ghstack submit HEAD~ to push only HEAD~ (but not HEAD). Commit ranges are also supported. There’s also a number QoL improvements: we now no longer include the PR title inside our generated head/base commits, we no longer strip @ from email addresses in commit messages, and there’s a smoother GitHub auth token flow. Most open issues in our bug tracker were fixed. There’s also an experimental --direct feature which lets you generate PRs that merge directly into main, but it’s not tested to work with pytorchbot, this is mostly useful if you’re using ghstack on your own repository. Internal xref: Redirecting...
Structured logging:
- MVP is out at Add structured trace logs by ezyang · Pull Request #120496 · pytorch/pytorch · GitHub ! After generating the logs, you can parse them with GitHub - ezyang/tlparse: TORCH_LOGS parser for PT2 . It’s not a great choice for local development (you can use TORCH_COMPILE_DEBUG for that), but it’s really handy if you’re running remote jobs and the only way to conveniently get information out of them is via logging.
- Meta-only: I’m waiting on Kurman for an alternate implementation in torchrun that will help us conveniently avoid interleaving multiple ranks of logs together. ETA Monday.
Composability sync was pretty spicy: Jason Ansel reopened the question of whether or not we want to be supporting a pre-dispatch serialization IR at all. Composability meeting notes - Google Docs .
- The composability minutes don’t have the full story: after the meeting, there was a bunch of back channeling with Horace and Jason, and we got back to “OK, I guess we need to support pre-dispatch IR”. One major thing that convinced Jason was that we don’t actually play on exporting FSDP-ized models (instead, the model will be unflattened post export and the FSDP applied at that point). For Horace, solving the org problem of needing to move your model around to different transforms and not at all in one go seemed insoluble without the export format. Horace to enumerate all the cases where we do side tables that you can’t export with pre dispatch: checkpointing and user defined Triton kernels.
- On a side note, there was some late night discussion about what to do about subclasses and pre-dispatch export. My explanation to Michael Suo was that if a subclass can only be desugared after autograd, then your pre dispatch IR must include the subclass, and your target runtime must know how to deal with the subclass. In some sense, this is not surprising at all, because pre dispatch IR really hasn’t had any of PyTorch’s internal subsystems resolved ahead of time, so you are going to need, e.g., a full on autograd engine to actually run it (in practice, these export IRs are going to target PyTorch again). Sometimes, subclasses can be implemented before autograd, but you often give up quite a bit to do so. For example, for DTensor to be pre-autograd, would give up the capability to have a different sharding pattern between forwards and backwards; for NestedTensor to be pre-autograd, necessitates every operation nested tensor to have a separate ATen operator with its own derivative formula (as many nested tensor operations cannot be implemented just be desugaring). In some sense, this is not surprising: people are using __torch_dispatch__ because there are things you can’t do unless you’re below autograd!
Per-parameter FSDP is very serious business. With some upcoming training use cases that I cannot describe here in public, per-parameter FSDP is in serious contention for being the distribution mechanism. Will Feng has been focused on torch.compile’ing per-parameter FSDP (internal xref: Redirecting...), and generating more AOTAutograd features that Brian has been helping consult with.
I had a chance to ask Brian what was going on with FP8 and DTensor composability. There is progress going on here (contrary to my impression), and it seems the current design problems revolve around cases where first DTensor then FP8 ordering wants to be violated. In particular, in some cases DTensor wants to do something special when a conversion to FP8 happens. The current thinking is that to_fp8 will be a dedicated ATen operator that DTensor can override, and we just need to figure out how to dispatch this to the FP8 subclass (similar to backend select, we have a dispatch problem since no argument to the operator actually takes in an FP8 tensor) before we finally get to proxy tensor (since we don’t want to_fp8 to show up in the final, desugared of subclasses graph.) Brian is consulting on this.
The state of NestedTensor has been on my mind.
- Jeffrey Wan has a fairly major change to nested int out: [NJT] Store vec on nested ints by soulitzer · Pull Request #119976 · pytorch/pytorch · GitHub The implementation in the PR is actually different than the design Christian advocated for in the meeting, where sequences are canonically CPU but can be cached on CUDA. The main problem with Christian’s design is if you are given only a CUDA lengths tensor, Christian’s design as is forces an immediate sync to establish the CPU source of truth. More discussion necessary.
- Joel Schlosser has subclass view-ification out Subclass view fake-ification via reified ViewFuncs by jbschlosser · Pull Request #118405 · pytorch/pytorch · GitHub It seems pretty close, just needs some detail work.
- Basil Hosmer’s old fold prototype was back in the news. The context is that I finally internalized something Michael Suo was telling me, which is that stock NJT cannot handle torchrec KJT, which requires two dimensions before jagged dim. How exactly should this be modeled with nested ints that Jeffrey is working on? If a nested int corresponds to a Seq dim in Basil’s formulation, fold tells us pretty directly how to model this: (feature, batch, jagged, embedding), where jagged is a Seq dim. It is a little unfortunate that the stride note was never written, but with only a single jagged dimension, intuitive multiplication of Seq with int does what you want. Not sure if anyone else is on board with this, need to get more alignment. One thing fold doesn’t answer is how you should resolve degrees of freedom of if Seq’s data is stored on CPU or solely on CUDA. torchrec makes one particular choice, but there are others! Ivan tells me, however, that typically you don’t want to store stuff on CPU if you can avoid it.
- Michael Suo has been pushing us to think more about the long term state of NJT and torchrec. This is not in the plans right now, but in the long term, we would ideally have NJT be the backend representation for JT and KJT. But torchrec has pretty stringent eager performance considerations, so it is not at all clear how you are ever going to actually manage this. This is somewhat reminiscent of the situation with complex tensors, where we have a C++ implementation, but for PT2 a Python implementation would be much preferable (but we can’t get rid of the C++ implementation because eager perf would suffer.)
Oguz/Chip thinking about super big jobs. When you have so many nodes, when one node fails you’re going to have to restart. This is going to happen a lot. This means warm start matters a lot, and the model is not changing. We’ve already got a memcache thing going on for compiled Triton kernels: think bigger.
Richard has been thinking more about custom ops; he has a new custom op API proposal [WIP] python custom op API by zou3519 · Pull Request #120345 · pytorch/pytorch · GitHub . One of the tender design questions is “why do people want to put plain PyTorch operations in their custom operator, the so-called traceable operator”? Reference this old document for some use cases: PT2 black box escape hatch - Google Docs Another interesting idea that popped up: if someone puts a Triton kernel in their CUDA implementation of a custom op, how do we get this to Inductor in a way that it can understand and directly incorporate the Triton in, without generating an actual call to the custom op? One idea is for operators to have a more “structural” implementation, e.g., TritonKernel, FXKernel, where it’s not an opaque Python callable and you can poke at it from the compiler to get the important information. And you could automatically generate these by using Dynamo. Food for thought.
Torchbind fakeification is making progress by Yidi Wu at Add torch.library.impl_abstract_class to fakify torchBind class by ydwu4 · Pull Request #120045 · pytorch/pytorch · GitHub . There’s some API bikeshedding on what exactly the API for getting the fake version of a torchbind object and testing its guard should be.
James March was complaining to me that OSS C++ logging has no timestamps on NCCL logs like “watchdog timeout”. This probably is not hard to fix, someone should check it out.
Animesh Jain has been steadily working on C++ guards, it’s a big stack of PRs that is slowly landing. He is planning to move into the accumulate grad question, which we had discussed in composability last week. We worked out some more implementation details: fixing .grad handling in Dynamo is probably the tall pole in the tent, desugaring of accumulateGrad will happen in Dynamo via a polyfill, we will never handle the refcount == 1 case.
Mengwei Liu has a proposal for letting you load inline Executorch kernels into regular PyTorch Google Colab
Chip’s been thinking about big training jobs! Some Meta only docs to look at: https://docs.google.com/document/d/1gN8UmuqBxTX0MROSFkApjHFY0jzeE_tMTbuRPoxgWHk/edit also Internal Login
There’s a Dynamo bug burn down coming up soon, organized by Jane Xu and Richard Zou. Meta only: Redirecting...
Xiaodong is worried that DTensor is too focused on parallelism in llama, and there are other contexts which it is not well adapted to.
lucidrains will “be available in San Francisco for contracting, private tutoring, or full-time hire in March 2024”, per his website.
I didn’t pay much attention to the weekly compile time meeting but it seems like stuff is happening. Minutes at: https://docs.google.com/document/d/199LbkPiZdjn2CExEeRl8XDqO4unIPqMEwzyXJljGlY0/edit?usp=sharing
Notable bugs everywhere:
- Do not wrap output with input device inside _to_copy by laithsakka · Pull Request #119868 · pytorch/pytorch · GitHub multiple folks ran into this (manifesting as making a fake tensor when you already have a fake tensor), thanks Laith for working on the fix!
- Dynamo is silently suppressing shape guards from asserts · Issue #118417 · pytorch/pytorch · GitHub this has been a long standing issue but someone complained about it again on GitHub so Tugsuu actually going to look into this
- Change default torch_function behavior to be disabled when torch_dispatch is defined by albanD · Pull Request #120539 · pytorch/pytorch · GitHub Alban is finally removing the “default torch function is not what you want with torch dispatch” footgun
- Horace complaining TORCH_COMPILE_DEBUG doesn’t work reliably, if you have relevant bug reports send them in.
- Brian has been looking into a strange SymNode tracing problem [test fix] try to re-use cached symnodes across dynamo and AOTAutograd by bdhirsh · Pull Request #120090 · pytorch/pytorch · GitHub The logic here is all kind of cursed, not really clear what the right approach is. One idea is that sympy expressions actually could just always be constructed from scratch, removing dependence on tensors (which has caused problems before); this is similar to how Inductor IR does it.
Notable bugs in symbolic shapes:
- Tensors with multiple unbacked SymInt dims don’t work in Inductor
- Second forward call of a compiled model (exact same input shapes, strides) is extremely slow due to cuda graphs - this is not a bug per se, but a case where we do exactly what you asked us to do, which is sometimes not what you want (tracing each dynamic size separately for cuda graphs)
- Some more user requests got synthesized into these potential improvements for symbolic reasoning: Equivalent idea to size-oblivious guard for end of bounds on sizes Perhaps u1 - u0 should be size-like when u1 and u0 are size-like and u1 >= u0
Landed stuff from Edward
- [Dynamo] Handle guard_size_oblivious in user code
- Properly trace into mark_static - this one was embarrassing, the original test was not written appropriately
- Fix missing right square bracket to match glog format - ooops!

Topic		Replies	Views
PyTorch 2.1: automatic dynamic shape compilation, torch.distributed.checkpoint, torch.compile + NumPy, torch.export prototype, and more! Release Announcements	0	1007	October 6, 2023
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	66431	July 29, 2024
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation compiler	7	17196	March 9, 2023
A TorchDynamo trace time ablation study compiler	0	548	March 22, 2024
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1987	September 22, 2023

State of symbolic shapes branch

State of symbolic shapes: Jul 4 edition

Executive summary

State of symbolic shapes: Jul 9 edition

Executive summary

What’s next?

State of symbolic shapes: Jul 15, 2023 edition

Executive summary

What’s next?

State of symbolic shapes: Jul 22, 2023 edition

Executive summary

What’s next

State of symbolic shapes: Jul 29, 2023 edition

Executive summary

What’s next

State of symbolic shapes: Aug 6, 2023 edition

Executive summary

Numbers

What’s next?

State of symbolic shapes: Aug 12, 2023 edition

Executive summary

Benchmarking

Numbers

State of PT2: Aug 20, 2023 edition

Executive summary

Numbers

State of PT2: Sep 8, 2023 edition

Executive summary

State of PT2: Sep 15, 2023 edition

Executive summary

Numbers

State of PT2: Sep 23, 2023 edition

Executive summary

State of PT2: Oct 8, 2023 edition

Executive summary

Numbers

State of PT2: Nov 3, 2023 edition

Executive summary

Numbers

State of PT2: Jan 12, 2024 edition

State of PT2: Jan 20, 2024 edition

State of PT2: Jan 28, 2024 edition

State of PT2: Feb 4, 2024 edition

State of PT2: Feb 11, 2024 edition

State of PT2: Feb 16, 2024 edition

State of PT2: Feb 24, 2024 edition

Related topics