State of symbolic shapes branch

State of symbolic shapes: Jul 4 edition

Previous update: State of symbolic shapes branch - #58 by ezyang

Executive summary

This is a little more than two week’s worth of updates, covering PSC week, Edward on vacation and July 4th holiday.

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 7ae100628e). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 89%, 57/64 → 91%, 58/64 98%, 45/46 100%, 60/60 88%, 7/8 → 100%, 8/8
Speedup 1.11x → 1.08x 1.59x → 1.58x 1.19x → 1.21x 1.30x
Comptime 67s → 78s 99s → 152s 110s → 134s 31s → 78s
Memory 0.94x → 0.80x 1.00x → 1.01x 1.01x → 1.00x 1.59x → 0.76x

Inference dashboard (as of 7b3242d5f7). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 63/72 → 86%, 63/73 100%, 46/46 100%, 60/60 58%, 7/12
Speedup 1.52x → 1.53x 1.64x 1.72x → 1.73x 1.92x → 1.96x
Comptime 24s → 28s 38s → 45s 30s → 34s 45s → 53s
Memory 0.82x → 0.67x 1.15x → 1.11x 1.06x → 0.84x 1.11x → 0.86x

1 Like

State of symbolic shapes: Jul 9 edition

Previous update: State of symbolic shapes branch - #60 by ezyang

Executive summary

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of dd6c38cb59). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 91%, 58/64 → 89%, 57/64 98%, 45/46 100%, 60/60 → 97%, 58/60 100%, 8/8 → 88%, 7/8
Speedup 1.08x → 1.11x 1.58x → 1.60x 1.21x → 1.20x 1.30x
Comptime 78s → 97s 152s → 124s 134s → 178s 78s → 40s
Memory 0.80x 1.01x → 0.97x 1.00x 0.76x → 0.73x

  • vision_maskrcnn went back to failing, seems flaky. :person_shrugging:
  • eca_botnext26ts_256 and mobilevit_s timed out due to translation validation being enabled. #104654 fixed it (to be visible in next perf run.) Compilation time increase also appears to be due to TV.

Inference dashboard (as of dd6c38cb59) This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 86%, 63/73 100%, 46/46 → 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.52x 1.65x → 1.64x 1.73x 1.92x → 1.96x
Comptime 28s 44s 34s 53s
Memory 0.67x 1.11x 0.84x 0.86x

  • GPT2ForSequenceClassification is having some trouble across the board on all configurations; it’s currently failing accuracy.

What’s next?

  • Edward: Keep helping HF on their llama optimization; two level guards for backwards

State of symbolic shapes: Jul 15, 2023 edition

Previous update: State of symbolic shapes branch - #61 by ezyang

Executive summary

  • Dynamic shapes now support mode=“reduce-overhead” (CUDA graphs). Conventional wisdom was that dynamic shapes are incompatible with CUDA graphs, because any given CUDA graph recording can only work for a single static shape, and CUDA graphs requirement of hard coded memory addresses means that each CUDA graph takes up quite a lot of CUDA memory. However, this conventional wisdom is wrong: (1) multiple CUDA graphs can share the same memory pool, as long as you don’t have any live tensors from one pool to the next (this is precisely what CUDA graph trees by @eellison implements), and (2) recording a CUDA graph is much, much cheaper than running the entire PT2 compilation stack, so it is profitable to compile a dynamic program once and then CUDA graph it multiple times. Enable cuda graphs for dynamic shapes by ezyang · Pull Request #105064 · pytorch/pytorch · GitHub realizes these gains and switches our dynamic shapes benchmark configuration to use CUDA graphs, resulting in hefty performance gains with only a modest increase in compile time. Importantly, these benchmarks cover our _generate inference benchmarks, which actually make use of multiple sizes as sequence length varies. There’s more to be done here: our memory usage for this use case can be suboptimal, because the caching allocator doesn’t know that it’s OK to waste space for small allocations by fitting them inside larger allocations for a larger dynamic size. We also observed that folks using this CUDA graphs trick tend not to generate CUDA graphs for every size, but instead prefer to linearly sample sizes and pad; we should make it easier to do this (perhaps with a padding tensor subclass.) One cool result is a 6x performance improvement on cm3leon, a newly announced multi-modal model from Meta AI.
  • New API: torch._dynamo.maybe_mark_dynamic. Add torch._dynamo.maybe_mark_dynamic lets you suggest that we should try compiling a tensor dynamically, but doesn’t raise an error if it gets specialized (unlike mark_dynamic).
  • Infer valid input sizes from programs. Horace has wanted this for some time, and with Yukio’s recent Z3 translation validation work landed, it turned out to be pretty easy to write a PoC to exhaustively search the space of valid inputs, using guards to turn us away from portions of the space we’ve seen before. Check it out at dinfer.py · GitHub. If anyone is interested in productionizing this, it would be a neat little project to (1) put this code in PyTorch and put a nicer API on it (note that as written, you have to specify the input dimensions and dtypes of input tensors, so you’ll need to figure out a good way of specifying or inferring this info), (2) improve the solving code to minimize the generated sizes for an equivalence class, and (3) use it for something cool; e.g., you could use it to automatically generate sample inputs for OpInfo tests. Tag me (@ezyang) as reviewer if you send a PR!
  • Enabling automatic_dynamic_shapes in fbcode, for real this time. It turns out that I failed to actually turn things on in fbcode last time, so actually do it for real this time: Switch automatic_dynamic_shapes to True by default in fbcode. This got reverted once for breaking an internal model unit test (Incorrect ValueRanges analysis · Issue #105097 · pytorch/pytorch · GitHub, fixed by Perform value range analysis with rationals when possible by lezcano · Pull Request #105137 · pytorch/pytorch · GitHub, thanks @Lezcano for the speedy fix.) At time of writing, the PR has not actually hit fbcode yet.
  • lit_llama is finally landed in torchbench. https://github.com/pytorch/benchmark/pull/1730 At time of writing this model is in canary models because the weight download is a little flaky. This is the only 7B model in our benchmark suite and there’s a bit of pain associated with this; for example, we can’t run accuracy tests on this model, because accuracy tests are implemented by holding two copies of the model in memory, which we can’t do at 7B parameters.
  • Notable bug fixes.
  • Notable new bugs.

CI skips. -3, -1, -1, -2 (no change).

**Training dashboard (as of 7b4d080496). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 89%, 57/64 → 92%, 59/64 98%, 45/46 → 96%, 44/46 97%, 58/60 → 98%, 59/60 88%, 7/8 → 100%, 8/8
Speedup 1.11x → 1.52x 1.60x → 1.66x 1.20x → 1.27x 1.30x → 1.93x
Comptime 97s → 86s 124s → 120s 178s → 142s 40s → 42s
Memory 0.80x 0.97x 1.00x → 1.01x 0.73x → 0.69x

  • Now passing: hf_Longformer (this used to fail with ValueError: Cannot view a tensor with shape torch.Size([4, 12, 1024, 513]) and strides (6303744, 513, 6156, 1) as a tensor with shape (48, 4, 256, 513), this is thanks to Brian Hirsh finally landing his AOTAutograd longformer fix), vision_maskrcnn (flaky), eca_botnext26ts_256 and mobilevit_s (used to timeout; maybe the speedup from CUDA graphs was enough to get it under the timeout again)
  • Now failing: DebertaV2ForQuestionAnswering (failing accuracy due to cudagraphs, failing on inductor_with_cudagraphs too), cait_m36_384 (OOMing on accuracy due to increased CUDA graph memory usage)
  • Speedups: The majority of our speedups are due to the enablement of CUDA graphs for dynamic shapes. Some notable models and their speedups: BERT_pytorch (1.7698 → 3.3071), hf_GPT2 (1.7728 → 2.0056), basic_gnn_gin (1.3151 → 2.4841). The improvements on HF and TIMM models are much more modest since these are not super overhead bound models. Note that these numbers are still behind inductor_with_cudagraphs, because we are still losing some optimizations from running the PT2 compiler stack without static shapes.
  • Slowdowns: dlrm (infra failure due to cudagraphs, failing on inductor_with_cudagraphs too), hf_T5 (2.0252 → 1.8939, oddly enough–could this be due to memory pressure? But even more weirdly, hf_T5_large imporved perf)
  • Comptime/Memory: By in large compilation time did not increase, but for our training setup this is expected as we only actually run at one batch size, so you are simply measuring the cost of a single CUDA graph recording. As expected, memory compression ratio gets worse, due to standing allocation from CUDA graphs.

Inference dashboard (as of 7b4d080496). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 86%, 63/73 → 88%, 64/73 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.52x → 1.50x 1.64x → 1.76x 1.73x → 1.62x 1.96x → 2.94x
Comptime 28s → 36s 44s → 46s 34s 53s → 72s
Memory 0.67x → 0.68x 1.11x 0.84x → 0.85x 0.86x → 0.87x

  • Now passing: hf_Longformer (see training above)
  • Speedups: torchbench numbers are actually a huge mixed bag. Here are some of the wins: BERT_pytorch (2.2317 → 2.4529), basic_gnn_edgecnn (1.7809 → 1.8732). Note that for some reason many of the GNN variants are failing performance on inference (but not accuracy), cm3leon_generate (1.3037 → 5.7822, WOW! This is consistent with some perf analysis Armen and I did months ago, where I concluded that cm3leon was hella overhead bound), hf_T5_generate (2.2619 → 8.2081), hf_T5_large (3.1690 → 5.1747)
  • Slowdowns: A lot more models did worse with CUDA graphs enabled, including LearningToPaint (1.9209 → 1.6812), resnet18 (1.7779 → 1.4028), shufflenet_v2_x1_0 (1.9882 → 1.6010), squeezenet1_1 (1.8625 → 1.0040), yolov3 (2.0997 → 1.8843). It’s not entirely clear what’s going on here, but we will note that there was sizable dip in CUDA graphs performance without dynamic shapes too this week on torchbench. There is an across the board performance regression on TIMM models (and a slight regression on HuggingFace too.)
  • Comptime/Memory: Comptime generally got worse across the board, but not too much worse. Particularly notable are the generate models: hf_T5_generate (881 → 1032), cm3leon_generate (131 → 203). CUDA graphs is not free, but given that we’re running at much more than two sequence lengths, you can see the bulk of the compile cost is the PT2 stack. For the most part, memory usage stayed fairly stable, interestingly enough.

What’s next?

  • I think I want to investigate the memory planning situation with CUDA graphs a bit more; I also think it’s a good time to teach Inductor how to deal with data-dependent ops (without having to graph break on them.)

State of symbolic shapes: Jul 22, 2023 edition

Previous update: State of symbolic shapes branch - #62 by ezyang

Executive summary

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 0ad93a3d56). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 92%, 59/64 96%, 44/46 98%, 59/60 100%, 8/8
Speedup 1.52x → 1.54x 1.66x → 1.69x 1.27x → 1.28x 1.93x → 1.97x
Comptime 86s → 81s 120s → 107s 142s 42s → 38s
Memory 0.80x → 0.79x 0.97x → 0.96x 1.01x 0.69x

Not really much to say; the slight improvements appear to be within noise.

Inference dashboard (as of 0ad93a3d56). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 65/74 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.50x → 1.55x 1.76x → 1.78x 1.62x → 1.79x 2.94x → 3.03x
Comptime 36s → 35s 46s → 44s 34s → 36s 72s
Memory 0.68x 1.11x 0.85x → 0.84x 0.87x

What’s next

  • CUDA graphs memory planning is lower priority for now (@eellison may take a look, but higher priority is actually being able to turn on CUDA graphs in prod situations; a big problem here is when we fail to compile the entire extent of the model, causing CUDA graphs to increase overall memory usage.) It looks like we definitely need data-dependent op support in inductor though, based on sparse arch investigation.

State of symbolic shapes: Jul 29, 2023 edition

Previous update: State of symbolic shapes branch - #63 by ezyang

Executive summary

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 1da4115702). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 92%, 59/64 96%, 44/46 98%, 59/60 100%, 8/8
Speedup 1.54x → 1.56x 1.69x 1.28x → 1.35x 1.97x → 2.04x
Comptime 81s 107s → 108s 142s 38s → 39s
Memory 0.79x 0.96x 1.01x 0.69x

Inference dashboard (as of 1da4115702). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 65/74 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.55x → 1.54x 1.78x → 1.77x 1.79x → 1.80x 3.03x → 3.08x
Comptime 35s → 36s 44s → 45s 36s 72s → 75s
Memory 0.68x 1.11x 0.84x → 0.85x 0.87x

Looks all within noise.

What’s next

  • Rewriting export input/output spec flattening
  • Irrefutable guards
  • Generally more pushing on KJT stuff
1 Like

State of symbolic shapes: Aug 6, 2023 edition

Previous update: State of symbolic shapes branch - #65 by ezyang

Executive summary

  • More on KJT/torchrec. I had a nice discussion with Dennis van der Staay about torchrec and work on sparse arch. Some new information: (1) this workstream is almost certainly going to involve distributed later, because applying PT2 to post-torchrec sharded models is going to involve tracing past communication primitives–this also implies I’m going to want to get FakePG working on torchrec, (2) working on unit tests should be a pretty good idea, but there’s still some basic infra work to do (laid out last week), (3) not really expecting concrete performance improvements as sparse arch is going to be typically communication bound, so this is a mostly “we think this is promising, and the investment is not too big, because we’ve already done so much with dynamic shapes so far.”)
  • Pre-dispatch export. We’ve agreed to allow QAT to short-term publish a new export interface that produces a pre-dispatch FX graph with ATen operators which is suitable for graph transformations and training. The long term goal will to be have pre-dispatch functionalization which is the invariant the export team wants to allow this to be worked into torch.export proper. Pre-dispatch will generate an ExportedModule so that the APIs match.
  • Fake export. Export now supports exporting entirely fake modules/inputs. This means to export a model you don’t have to actually load its weights into memory; you can load it in a fake mode and still export it. This means we have some delicate code in Dynamo for dealing with two concurrent fake modes (but it’s not so bad: the outer fake mode is typically disabled while we do Dynamo analysis.) Only ONNX supports torch.load’ing models in fake mode at the moment.
  • Improved user stacks in Dynamo. torch._guards.TracingContext.extract_stack() now always accurately reports a user stack from anywhere in Dynamo, and we reliably use it for reporting real stacks for exceptions (previously, they used an entirely different mechanism.)
  • Improved error messages for non-local inputs in export. See Improve error message when export encounters non-local input for the details. This isn’t complete; follow through is also to make this work for outputs, and also work a little harder with the pytree representation (probably this week.)
  • Dynamo change in attitude. Many folks are concerned that Dynamo is just “endless” bugs. I pitched Animesh and Voz on a new attitude to fixing Dynamo bugs, which is that we should imagine the platonic ideal implementation of Dynamo as a faithful reimplementation of CPython in Python. Then, fixing a bug should not just be moving code around to fix a particular problem, but instead improving the local vicinity of code to bring it closer in line to this ideal. An example I used a lot explaining this was dict.keys support (bug fix is changing its type from tuple to set; real fix is to accurately model dict views.) To do this well, you need to regularly look at CPython code, and Dynamo may need to grow some new abstractions (perhaps a proper implementation of Python’s object model, Python traceable polyfills).
  • Notable new bugs.

Numbers

As we’re not really doing much on performance numbers recently, I am simplifying this section.

Training. 68cb854d73 Dashboard

Nothing much to report.

Inference. 68cb854d73 Dashboard

The big perf increase in torchbench is due to maml getting removed from the benchmark set (it slows down a lot under PT2 and was depressing the score). clip, hf_Whisper, llama_v2 are new models added thanks to @msaroufim !

What’s next?

There are a lot of things that need doing

  • Finish overhauling export input/output pytree matching (probably not dumping the pytree in/out spec, but I think if we tree_map into positional identifiers we can reliably detect KJT missing situations)
  • Make unbacked SymInts work in Inductor gist:1293a41299604c44310341b7540eabcb · GitHub (biggest problem is unbacked SymInt binding in wrapper codegen and the hinting logic)
  • Irrefutable guards
  • Write up the plan for sparse arch / KJT
  • Land pytree support for KJT/JT
  • 0/1 specialization suppression for list of int in KJT

Stuff that probably can wait until later?

  • Host side torch.cond
  • DynTensor
3 Likes

State of symbolic shapes: Aug 12, 2023 edition

Previous update: State of symbolic shapes branch - #66 by ezyang

Executive summary

I’m trying something a little different, expanding the update to cover a wider variety of topics beyond dynamic shapes, mostly centered around things that I personally have involvement in (this is a lot of things, so you should be getting pretty good coverage this way!)

Benchmarking

  • Inductor CI/perf is upgraded to CUDA 12 / gcc 9. This doesn’t seem to have any appreciable effect on perf, but we did it so we could do the next item.
  • torchrec_dlrm is back. They were disabled a few months ago because of fbgemm nightly related flakiness. The flakiness has been resolved by building fbgemm/torchrec from source in the Docker image. These are now installed as part of the general torchbench installs, and should help some of the work we are doing on jagged tensors (since many important operators are currently implemented in fbgemm).
  • Algorithmic efficiency. Frank Schneider posted about how PyTorch was slower than JAX in their upcoming algorithmic-efficiency benchmark suite. A bunch of us, spearheaded by @msaroufim, jumped in to take a look at what was going on. Status updates at https://docs.google.com/document/d/1okqKS32b0EhWQSFFoSV6IjGlYM4VhNYdxBPjdlFIw5w/edit (Meta-only). I personally have an interest in the dlrm side of things, since I’ve been working on sparse arch recently; after fixing some mild bugs, I was able to show parity on criteo1tb dlrm between PyTorch nightly and JAX on an A100x8 (PyTorch score: 7703.403180360794, JAX score: 7703.041719198227), although the number of evals varied, so I’m not sure if this a threat to validity. Unfortunately, this does not necessarily help their problem, which was an OOM. To make further progress on this, we may need some tools to help us understand why torch.compile memory usage is higher.

Export

Distributed

  • Tracing FSDP. @voz wrote a post Redirecting... (Meta-only) about the state of tracing FSDP in Dynamo. The key info is that on a branch, he can trace everything through and get identical results on a single forward-backward to eager. There’s a lot of fixes that need to land to main; from his post:
    1. The value of various small changes to FSDP to make this work vs adding fixes in dynamo (Pretty easy, preferring dynamo ofc but for some mostly no op shuffling, we do FSDP as well)
    2. TypedStorage - is it tensor-like/tensor-associated enough to go in the graph? Do we need to add some ops for doing tensor typed storage data ptr comparison / checking free, etc?
    3. Working through the cudastream story, in particular around wait_stream and such
    4. Lot’s of little bug fixes here and there
    5. Coverage for missing comparisons, bytecode ops, general coverage gaps like attr access on FSDP modules, setting data on a tensor, etc.
  • pytrees slow again for DTensor. Junjie and Rodrigo have been trying to improve DTensor’s eager perf, and we spent the first half of composability sync talking about it. Rodrigo had a hack to pre-compile pytree applications into Python code but apparently this doesn’t help that much: gist:5427cabfab6421d4e104905345f94a50 · GitHub . Another suggestion from the meeting was that after Brian’s subclass supports lands, maybe you could torch.compile each op individually with backend=“eager”.
  • Data-dependent all2all. Will Feng got all2all collective working in inductor https://github.com/pytorch/pytorch/pull/106655/ This is notable because all2all collective has data-dependent output shape. It looks like unbacked symints worked here!

Custom ops

  • Custom ops. Richard tells me he is going to add a class-based API for custom ops, to make it easier to define everything all in one place. More on this soon I assume!
  • Custom op testing. https://github.com/pytorch/pytorch/pull/106903 is here to make it easier to retrofit pre-existing test suites to also test for important operator properties.

Nested/jagged tensor

Dynamo

  • Pivot on per-NN module caching. @anijain2305 is working on having a separate code cache per NN module, but on Friday with the help of @voz we realized that you actually the problem is separable into two pieces: (1) an enhanced cache size limit policy that knows about NN modules [RFC][dynamo] Separate cache sizes for nn module guard specialization by anijain2305 · Pull Request #107077 · pytorch/pytorch · GitHub and (2) improvements to cache lookup when there are a lot of cache entries (guard trees).
  • Dynamo eager mode cond. Yidi Wu: to support cond in eager mode, we plan to torch.compile the entire cond operator, manufacturing fresh code objects to ensure that the caches don’t interfere with each other. https://docs.google.com/document/d/1esmHEa0fiktiSw1lvRsPmsbnTYxDSc0t3V9V5V0xK7I/edit#heading=h.pajqpbewbdg7 (Meta-only)
  • Time to get rid of functional VariableTracker? VariableTracker in Dynamo is an immutable data structure: when a mutation happens, you allocate a fresh VariableTracker and then replace old VariableTrackers with the new one. This is because we have checkpointing functionality that is used to rewind old VariableTracker. However, this is a bit of pain from the modeling side, as every Python data structure has to be reimplemented to have purely functional operations. An alternate design is to allow direct mutation of VariableTrackers. To do checkpoints, we simply restart Dynamo analysis to “go back in time” by stopping execution at the point where we would have checkpointed (a deepcopy could also work, though I’m not a fan.) Speculate subgraph would be implemented by simply denying all mutations or doing some crazy thermometer continuation thing. This would help make Dynamo more metacircular and reduce the work needed to support new container types, of which we often need to support a lot.

Dynamic shapes

  • expect_true irrefutable guards. I talked through this in the last 20min of composability sync. Check https://github.com/pytorch/pytorch/pull/106720 ; this is enough to make splits on unbacked SymInts work.
  • Boolean masking, at last. @yanboliang is looking into a pre-autograd FX transform that replaces boolean mask updates with torch.where calls. One annoying detail is how to deal with Dynamo tracing the boolean masks in the first place, when Inductor can’t deal with boolean masks if you can’t eliminate them? Our idea, in lieu of fixing Inductor to work with data-dependent shapes (which we are working on), is to attempt to eliminate all data-dependent ops in a pre-dispatch pass, and if it is not possible, restart Dynamo analysis saying “you need to graph break on this op next time.”
  • Notable fixes.
  • Notable new bugs.

Numbers

Training. 03414081ff Dashboard

  • Some accuracy regressions. torchbench: hf_BigBird, vision_maskrcnn (flaky). It’s not clear what broke hf_BigBird; possibly the CUDA 12 upgrade. Need to investigate. AlbertForQuestionAnswering improved accuracy!
  • The huge perf improvement across the board is thanks to Peter Bell’s work https://github.com/pytorch/pytorch/pull/106747 optimizing split reductions. This is not full runtime split reductions: instead Peter uses whatever the hint was at the time we compiled to plan the split reduction, and then we use it for all subsequent runs. This makes it more important to warm up Inductor with the “right” size hint to start; see also Padded tensor subclass · Issue #105325 · pytorch/pytorch · GitHub ; there was also another user complaining about other cases where we made suboptimal decisions if the first kernel we compiled with wasn’t representative

Inference. Dashboard 03414081ff

  • A lot of change on the last day; some improvements and some regressions (but mostly regressions). Maybe CUDA 12 update related, need to check. hf_BigBird also failing here too. RobertaForQuestionAnswering failing accuracy now
1 Like

State of PT2: Aug 20, 2023 edition

Previous update: State of symbolic shapes branch - #66 by ezyang

Executive summary

Public service announcements

  • Trying to understand how to read PT2 logs? Check out Logging docs - Google Docs for some quick orientation. (In other news, guards logging this week has been improved to show you which line of user code caused a guard to be added! Take a look and don’t be afraid to give feedback on it.)
  • Have you ever wanted to store lots of tracebacks for logging/debugging purposes, but were afraid to do so by default because it might be too expensive to do so? There is a new class in torch.utils._traceback called CapturedTraceback which makes it extremely fast to save a Python traceback (something like 20x faster than running a full traceback.extract_stack()), so it should change the calculation about whether or not you are willing to store tracebacks by default. We have already used this to keep fine-grained information about guard provenance to start. Note that CapturedTraceback DOES hold references to code objects, and these references can cause cycles (because of co_extra), so good practice is to make sure you clear these traces once you know you no longer need them.
  • I spent some time debugging reference cycles this week (due to CapturedTraceback), and Alban pointed me at objgraph for visualizing references/referents. It’s pretty cool, you should definitely check it out.

Composability sync https://www.youtube.com/watch?v=LmkFkOBwhks

Inside baseball

Distributed

  • Handling backward hooks in Dynamo is kind of difficult. There is a discontinuity between hooks on inputs and hooks on intermediates; hooks on intermediates, in particular, have to somehow be reflected in whatever graph gets differentiated by autograd, but at the same time these hooks may have arbitrary Python bits that need handling by Dynamo. It seems the problem is made easier if we have combined forward-backward tracing in Dynamo, at which Dynamo knows enough about the backward structure to bypass AOTAutograd entirely. It might also be possible to just do cuts prior to going to AOTAutograd, which will impede optimization. It might be possible to bypass this problem for FSDP if hooks are only on parameters and outputs. Lots of difficulties…

Dynamic shapes

Numbers

Training. 68b9bf9671 Dashboard

Not too much action this week. However, a bit of flakiness on the inside; may need some follow up. torchrec_dlrm added last week has shown up with dynamic shapes but is failing, so it still needs work. Early week TIMM improvement is from [inductor] make thread order consistent with loop order by shunting314 · Pull Request #106827 · pytorch/pytorch · GitHub . Some of later week TIMM improvement may be from Make Nd tensors hit fused addmm pass by eellison · Pull Request #106911 · pytorch/pytorch · GitHub (but stats were not run on that PR.)

dlrm is now passing on dynamic shapes which is cool. RobertaForQuestionAnswering was fixed (not clear what fixed this; it’s in the 35cca799ff42182a1b7f1ee4d0225ee879b7c924…384e0d104fd077d31efafc564129660e9b7a0f25 range). Some other wins (and some regressions, most importantly sam) from Unfuse bias add before pointwise ops by eellison · Pull Request #106912 · pytorch/pytorch · GitHub, also some other unexplained changes like convnext_base and jx_next_base in this same commit range (which sort of makes sense, @eellison landed a bunch of perf related changes)

State of PT2: Sep 8, 2023 edition

Previous update: State of symbolic shapes branch - #67 by ezyang

We were on break for two weeks because I went on vacation, and I didn’t have time to do a report before/after vacation lol.

Executive summary

  • PyTorch 2.1 branch cut. The cut was three weeks go (right when I went on vacation lol) and we’re reaching the end of the cherry-pick window. Track ongoing cherry picks at: https://github.com/pytorch/pytorch/issues/108055
  • Blueberries offsite was this week! The blueberries workstream is focused on accelerating SOTA transformer models using PT2, quantization, sparsity and other techniques. Some highlights: MFU is coming to the benchmark suite, some direct improvements to important models, int8 dynamic quantization with tensor subclasses. Many of these are not published yet, keep your eyes peeled at PTC!
  • PyTorch conference registration filling up fast. If you want to go and haven’t registered yet, you should register at PyTorch Conference | Linux Foundation Events

Composability sync

  • Aug 24 https://www.youtube.com/watch?v=H6EUSsvDmbw - we spent time going over recent KJT progress (to be reduxed below), and Voz reported progress on tracing FSDP with hooks (also to be reduxed below)
  • Aug 31 - not livestreamed publicly, I wasn’t there, but apparently there was some discussion about streams for tracing FSDP (no minutes alas)

Distributed and PT2

  • Tracing FSDP with Voz is deep in the weeds on backwards hooks support. We are attempting to implement hooks in a way that doesn’t require consolidated forward-backwards. The general strategy is (1) have Dynamo emit graphs that have register_hook calls on intermediates (register_hook calls on inputs must not go in the graph, they have to happen as part of residuals), (2) write these register_hook calls in such a way that when AOTAutograd runs, the actual hook code (which is arbitrary Python code and is not safe to run in tracing) is not run, but instead we run a meta function (which performs any needed metadata mutation) and then insert a call function to the original Python function (which will show up in backwards), (3) have compiled backwards take care of compiling this call function in the end.
  • Per parameter FSDP is looking pretty legit. Andrew Gu has been looking at the performance of per-parameter sharding (where parameters managed by FSDP aren’t shoved into a single flat buffer) and has found that we only really pay a penalty of 5% with per-parameter sharding but get better memory usage. Meta only: Redirecting...
  • DDP optimizer brittleness. We currently support pipelining DDP code with PT2 by manually splitting graphs into multiple AOTAutograd functions so that backwards isn’t run too soon. The code here is kind of janky: I ran into two separate bugs that only happend when optimize_ddp was on: [DDP PT2] TypeError: convert_frame_assert.<locals>._convert_frame_assert() missing 2 required positional arguments: 'hooks' and 'frame_state' · Issue #107637 · pytorch/pytorch · GitHub and [optimize_ddp] moco - NameError: name 's2' is not defined · Issue #108877 · pytorch/pytorch · GitHub . Pritam has also been complaining about the graph break strategy: torch.compile graph breaks should be independent of DDP buckets · Issue #108966 · pytorch/pytorch · GitHub Will tells me that Chien-Chin is working on some new DDP strategy, but it appears to be centered around starting with a non-parallelized graph. Hopefully we can present it at composability this week. Note that DDP cannot be easily traced as it is implemented in C++.

Dynamic shapes

Inductor fun

  • Peter Bell is very close to landing inductor IR support for scan https://github.com/pytorch/pytorch/pull/106581 which allows for native cumsum/cumprod support. Now all we need is for someone to add a higher order op that feeds into this and we will have torch.scan!
  • Someone should add a “realize” operator to PT2, which would force materializing a tensor rather than allowing fusions across it. Christian Puhrsch would find this useful for ensuring epilogue fusion occurs on int8 mm (today, regular fusion causes the pointwise operation to get fused into a later reduction, instead of fusing the pointwise into the matmul)
  • ABI compatibility for AOT Inductor is continuing to proceed apace slowly, but one agreement is that we’re probably going to also only have the ABI compatible codegen for OSS as well.

Performance

  • Flash Attention 2 is close to landing: Flash Attention v2 by drisspg · Pull Request #105602 · pytorch/pytorch · GitHub but it is currently stuck because it takes a lot of memory to compile, causing CI problems.
  • In the PT2 weekly meeting, we discussed H100 benchmarking. There are a lot of interlocking parts to this: we need to upgrade Triton to get their H100 improvements, and not everyone on the PyTorch team has access to an H100. Still looking for someone to sign up for this.
  • CUDA graph updates are a thing now: 1. Introduction — CUDA C Programming Guide There may be some opportunities here. Elias says: “It mostly helps with eliding input copies. For the most part, removing input copies only really matters when you torch.compile only part of your model and leave the rest of the model in eager. This use case is pretty unlikely to train well anyway since you’ll still need to bifurcate the memory pool.” However, personally, I also think CUDA graph updates could be pretty useful for allowing you to deallocate the pool of memory needed by a CUDA graph, only reallocating it when it’s time to run the CUDA graph again.

Dynamo

  • There was a pretty notable pytree API BC breakage which caused some internal problems: Serialize pytree to json string by angelayi · Pull Request #106116 · pytorch/pytorch · GitHub
  • Some big refactors that are in progress: refactoring skipfiles / allowed functions (talk to Yanbo), refactoring guard trees (talk to Animesh)
  • A bunch of new contributors being onboarded to Dynamo: Quansight is working more on Dynamo issues, and Jack Cao from PyTorch XLA is looking to help us with consolidated forwards-backwards-optimizer support in Dynamo as it is essential for XLA Dynamo perf.

Numbers is on break this week due to A100 runners down: apt-get install nvidia-docker2, Could not get lock /var/lib/dpkg/lock-frontend · Issue #108862 · pytorch/pytorch · GitHub

1 Like

State of PT2: Sep 15, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

Inductor

Composability sync hit a lot of topics this week. Composability meeting notes - Google Docs Topics that weren’t otherwise covered in this doc:

  • Elias told us about how SDPA pattern matches (and others; both inference and training patterns supported) are now compiled ahead of time, making it a lot cheaper to do lots of patterns. We took advantage of that to add a lot more patterns to match other SDPA variants. Add Python serialization to Pattern Matcher patterns by eellison · Pull Request #108894 · pytorch/pytorch · GitHub
  • Chien-Chin told us about the new PT2 DDP plans. We cannot directly trace DDP because it is implemented in C++, and we cannot easily port it to Python because the implementation is complicated by bucketing. So the idea is to implement a Python non-bucketed DDP, and rely on compile to optimize it away.
  • Horace told us about developments in LLMs. One thing he wants is dequant primitives in PT2: a way to take int3/int4 packed values and unpack them into a larger tensor, with the idea that PT2 would compile away the memory traffic. In general he doesn’t think we should directly do this in PT, as there are so many quantization formats.

Dynamic shapes

  • Last week I mentioned opcheck testing is usable, but Richard Zou is still evolving it on user feedback. A recent new change is to put the xfails into a JSON file so it can easily be automatically updated. However, there are still complaints from folks that it’s too hard to understand what goes wrong when a test crashes. Richard is going to investigate a two stage process now, where by we separate generation of test inputs and actually running the tests. To ensure generation of test inputs is kept up to date, we only need a single new test which runs all of the tests in the test file in one go and xrefs what tests are exercised with what we have recorded.
  • Horace wants a version of Tensor where some of the sizes are stored on device. This would allow you to perform a data-dependent operation without synchronizing; and you would still save on memory traffic because you would have kernels mask out memory loads when they go out of bounds of the dynamic shape. In some sense, this is a specialization of jagged tensor where everything in the jagged dimension has the same size.
  • Notable bug fixes:

Numbers

This is nearly a month worth of numbers!

Training. 34ddf08f27 dashboard

Inference. 34ddf08f27 dashboard

  • A lot of torchbench improvement: detectron2_fcos_r_50_fpn, doctr_reco_predictor, drq, llama, pyhpc_turbulent_kinetic_energy all now pass accuracy.
  • cudagraphs freezing accuracy improvement in timm models, likely from some major bugfixes for freezing
  • pytorch_stargan had huge perf improvement c2ac0da445cfe3d848342926f9cd4422bd35bfe2…781b7ebe912ec24cbd917cd548b748b1650ab6a2
  • HuggingFace regression due to pin update Problems hit when upgrading the version of HF used in CI · Issue #108145 · pytorch/pytorch · GitHub
  • Fairly large aot inductor regression due to ABI changes.
1 Like