State of symbolic shapes branch

State of symbolic shapes: Jul 4 edition

Previous update: State of symbolic shapes branch - #58 by ezyang

Executive summary

This is a little more than two week’s worth of updates, covering PSC week, Edward on vacation and July 4th holiday.

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 7ae100628e). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 89%, 57/64 → 91%, 58/64 98%, 45/46 100%, 60/60 88%, 7/8 → 100%, 8/8
Speedup 1.11x → 1.08x 1.59x → 1.58x 1.19x → 1.21x 1.30x
Comptime 67s → 78s 99s → 152s 110s → 134s 31s → 78s
Memory 0.94x → 0.80x 1.00x → 1.01x 1.01x → 1.00x 1.59x → 0.76x

Inference dashboard (as of 7b3242d5f7). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 63/72 → 86%, 63/73 100%, 46/46 100%, 60/60 58%, 7/12
Speedup 1.52x → 1.53x 1.64x 1.72x → 1.73x 1.92x → 1.96x
Comptime 24s → 28s 38s → 45s 30s → 34s 45s → 53s
Memory 0.82x → 0.67x 1.15x → 1.11x 1.06x → 0.84x 1.11x → 0.86x

1 Like

State of symbolic shapes: Jul 9 edition

Previous update: State of symbolic shapes branch - #60 by ezyang

Executive summary

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of dd6c38cb59). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 91%, 58/64 → 89%, 57/64 98%, 45/46 100%, 60/60 → 97%, 58/60 100%, 8/8 → 88%, 7/8
Speedup 1.08x → 1.11x 1.58x → 1.60x 1.21x → 1.20x 1.30x
Comptime 78s → 97s 152s → 124s 134s → 178s 78s → 40s
Memory 0.80x 1.01x → 0.97x 1.00x 0.76x → 0.73x

  • vision_maskrcnn went back to failing, seems flaky. :person_shrugging:
  • eca_botnext26ts_256 and mobilevit_s timed out due to translation validation being enabled. #104654 fixed it (to be visible in next perf run.) Compilation time increase also appears to be due to TV.

Inference dashboard (as of dd6c38cb59) This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 86%, 63/73 100%, 46/46 → 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.52x 1.65x → 1.64x 1.73x 1.92x → 1.96x
Comptime 28s 44s 34s 53s
Memory 0.67x 1.11x 0.84x 0.86x

  • GPT2ForSequenceClassification is having some trouble across the board on all configurations; it’s currently failing accuracy.

What’s next?

  • Edward: Keep helping HF on their llama optimization; two level guards for backwards

State of symbolic shapes: Jul 15, 2023 edition

Previous update: State of symbolic shapes branch - #61 by ezyang

Executive summary

  • Dynamic shapes now support mode=“reduce-overhead” (CUDA graphs). Conventional wisdom was that dynamic shapes are incompatible with CUDA graphs, because any given CUDA graph recording can only work for a single static shape, and CUDA graphs requirement of hard coded memory addresses means that each CUDA graph takes up quite a lot of CUDA memory. However, this conventional wisdom is wrong: (1) multiple CUDA graphs can share the same memory pool, as long as you don’t have any live tensors from one pool to the next (this is precisely what CUDA graph trees by @eellison implements), and (2) recording a CUDA graph is much, much cheaper than running the entire PT2 compilation stack, so it is profitable to compile a dynamic program once and then CUDA graph it multiple times. Enable cuda graphs for dynamic shapes by ezyang · Pull Request #105064 · pytorch/pytorch · GitHub realizes these gains and switches our dynamic shapes benchmark configuration to use CUDA graphs, resulting in hefty performance gains with only a modest increase in compile time. Importantly, these benchmarks cover our _generate inference benchmarks, which actually make use of multiple sizes as sequence length varies. There’s more to be done here: our memory usage for this use case can be suboptimal, because the caching allocator doesn’t know that it’s OK to waste space for small allocations by fitting them inside larger allocations for a larger dynamic size. We also observed that folks using this CUDA graphs trick tend not to generate CUDA graphs for every size, but instead prefer to linearly sample sizes and pad; we should make it easier to do this (perhaps with a padding tensor subclass.) One cool result is a 6x performance improvement on cm3leon, a newly announced multi-modal model from Meta AI.
  • New API: torch._dynamo.maybe_mark_dynamic. Add torch._dynamo.maybe_mark_dynamic lets you suggest that we should try compiling a tensor dynamically, but doesn’t raise an error if it gets specialized (unlike mark_dynamic).
  • Infer valid input sizes from programs. Horace has wanted this for some time, and with Yukio’s recent Z3 translation validation work landed, it turned out to be pretty easy to write a PoC to exhaustively search the space of valid inputs, using guards to turn us away from portions of the space we’ve seen before. Check it out at dinfer.py · GitHub. If anyone is interested in productionizing this, it would be a neat little project to (1) put this code in PyTorch and put a nicer API on it (note that as written, you have to specify the input dimensions and dtypes of input tensors, so you’ll need to figure out a good way of specifying or inferring this info), (2) improve the solving code to minimize the generated sizes for an equivalence class, and (3) use it for something cool; e.g., you could use it to automatically generate sample inputs for OpInfo tests. Tag me (@ezyang) as reviewer if you send a PR!
  • Enabling automatic_dynamic_shapes in fbcode, for real this time. It turns out that I failed to actually turn things on in fbcode last time, so actually do it for real this time: Switch automatic_dynamic_shapes to True by default in fbcode. This got reverted once for breaking an internal model unit test (Incorrect ValueRanges analysis · Issue #105097 · pytorch/pytorch · GitHub, fixed by Perform value range analysis with rationals when possible by lezcano · Pull Request #105137 · pytorch/pytorch · GitHub, thanks @Lezcano for the speedy fix.) At time of writing, the PR has not actually hit fbcode yet.
  • lit_llama is finally landed in torchbench. https://github.com/pytorch/benchmark/pull/1730 At time of writing this model is in canary models because the weight download is a little flaky. This is the only 7B model in our benchmark suite and there’s a bit of pain associated with this; for example, we can’t run accuracy tests on this model, because accuracy tests are implemented by holding two copies of the model in memory, which we can’t do at 7B parameters.
  • Notable bug fixes.
  • Notable new bugs.

CI skips. -3, -1, -1, -2 (no change).

**Training dashboard (as of 7b4d080496). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 89%, 57/64 → 92%, 59/64 98%, 45/46 → 96%, 44/46 97%, 58/60 → 98%, 59/60 88%, 7/8 → 100%, 8/8
Speedup 1.11x → 1.52x 1.60x → 1.66x 1.20x → 1.27x 1.30x → 1.93x
Comptime 97s → 86s 124s → 120s 178s → 142s 40s → 42s
Memory 0.80x 0.97x 1.00x → 1.01x 0.73x → 0.69x

  • Now passing: hf_Longformer (this used to fail with ValueError: Cannot view a tensor with shape torch.Size([4, 12, 1024, 513]) and strides (6303744, 513, 6156, 1) as a tensor with shape (48, 4, 256, 513), this is thanks to Brian Hirsh finally landing his AOTAutograd longformer fix), vision_maskrcnn (flaky), eca_botnext26ts_256 and mobilevit_s (used to timeout; maybe the speedup from CUDA graphs was enough to get it under the timeout again)
  • Now failing: DebertaV2ForQuestionAnswering (failing accuracy due to cudagraphs, failing on inductor_with_cudagraphs too), cait_m36_384 (OOMing on accuracy due to increased CUDA graph memory usage)
  • Speedups: The majority of our speedups are due to the enablement of CUDA graphs for dynamic shapes. Some notable models and their speedups: BERT_pytorch (1.7698 → 3.3071), hf_GPT2 (1.7728 → 2.0056), basic_gnn_gin (1.3151 → 2.4841). The improvements on HF and TIMM models are much more modest since these are not super overhead bound models. Note that these numbers are still behind inductor_with_cudagraphs, because we are still losing some optimizations from running the PT2 compiler stack without static shapes.
  • Slowdowns: dlrm (infra failure due to cudagraphs, failing on inductor_with_cudagraphs too), hf_T5 (2.0252 → 1.8939, oddly enough–could this be due to memory pressure? But even more weirdly, hf_T5_large imporved perf)
  • Comptime/Memory: By in large compilation time did not increase, but for our training setup this is expected as we only actually run at one batch size, so you are simply measuring the cost of a single CUDA graph recording. As expected, memory compression ratio gets worse, due to standing allocation from CUDA graphs.

Inference dashboard (as of 7b4d080496). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 86%, 63/73 → 88%, 64/73 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.52x → 1.50x 1.64x → 1.76x 1.73x → 1.62x 1.96x → 2.94x
Comptime 28s → 36s 44s → 46s 34s 53s → 72s
Memory 0.67x → 0.68x 1.11x 0.84x → 0.85x 0.86x → 0.87x

  • Now passing: hf_Longformer (see training above)
  • Speedups: torchbench numbers are actually a huge mixed bag. Here are some of the wins: BERT_pytorch (2.2317 → 2.4529), basic_gnn_edgecnn (1.7809 → 1.8732). Note that for some reason many of the GNN variants are failing performance on inference (but not accuracy), cm3leon_generate (1.3037 → 5.7822, WOW! This is consistent with some perf analysis Armen and I did months ago, where I concluded that cm3leon was hella overhead bound), hf_T5_generate (2.2619 → 8.2081), hf_T5_large (3.1690 → 5.1747)
  • Slowdowns: A lot more models did worse with CUDA graphs enabled, including LearningToPaint (1.9209 → 1.6812), resnet18 (1.7779 → 1.4028), shufflenet_v2_x1_0 (1.9882 → 1.6010), squeezenet1_1 (1.8625 → 1.0040), yolov3 (2.0997 → 1.8843). It’s not entirely clear what’s going on here, but we will note that there was sizable dip in CUDA graphs performance without dynamic shapes too this week on torchbench. There is an across the board performance regression on TIMM models (and a slight regression on HuggingFace too.)
  • Comptime/Memory: Comptime generally got worse across the board, but not too much worse. Particularly notable are the generate models: hf_T5_generate (881 → 1032), cm3leon_generate (131 → 203). CUDA graphs is not free, but given that we’re running at much more than two sequence lengths, you can see the bulk of the compile cost is the PT2 stack. For the most part, memory usage stayed fairly stable, interestingly enough.

What’s next?

  • I think I want to investigate the memory planning situation with CUDA graphs a bit more; I also think it’s a good time to teach Inductor how to deal with data-dependent ops (without having to graph break on them.)

State of symbolic shapes: Jul 22, 2023 edition

Previous update: State of symbolic shapes branch - #62 by ezyang

Executive summary

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 0ad93a3d56). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 92%, 59/64 96%, 44/46 98%, 59/60 100%, 8/8
Speedup 1.52x → 1.54x 1.66x → 1.69x 1.27x → 1.28x 1.93x → 1.97x
Comptime 86s → 81s 120s → 107s 142s 42s → 38s
Memory 0.80x → 0.79x 0.97x → 0.96x 1.01x 0.69x

Not really much to say; the slight improvements appear to be within noise.

Inference dashboard (as of 0ad93a3d56). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 65/74 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.50x → 1.55x 1.76x → 1.78x 1.62x → 1.79x 2.94x → 3.03x
Comptime 36s → 35s 46s → 44s 34s → 36s 72s
Memory 0.68x 1.11x 0.85x → 0.84x 0.87x

What’s next

  • CUDA graphs memory planning is lower priority for now (@eellison may take a look, but higher priority is actually being able to turn on CUDA graphs in prod situations; a big problem here is when we fail to compile the entire extent of the model, causing CUDA graphs to increase overall memory usage.) It looks like we definitely need data-dependent op support in inductor though, based on sparse arch investigation.

State of symbolic shapes: Jul 29, 2023 edition

Previous update: State of symbolic shapes branch - #63 by ezyang

Executive summary

CI skips. -3, -1, -1, -2 (no change).

Training dashboard (as of 1da4115702). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 92%, 59/64 96%, 44/46 98%, 59/60 100%, 8/8
Speedup 1.54x → 1.56x 1.69x 1.28x → 1.35x 1.97x → 2.04x
Comptime 81s 107s → 108s 142s 38s → 39s
Memory 0.79x 0.96x 1.01x 0.69x

Inference dashboard (as of 1da4115702). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 65/74 98%, 45/46 100%, 60/60 58%, 7/12
Speedup 1.55x → 1.54x 1.78x → 1.77x 1.79x → 1.80x 3.03x → 3.08x
Comptime 35s → 36s 44s → 45s 36s 72s → 75s
Memory 0.68x 1.11x 0.84x → 0.85x 0.87x

Looks all within noise.

What’s next

  • Rewriting export input/output spec flattening
  • Irrefutable guards
  • Generally more pushing on KJT stuff
1 Like

State of symbolic shapes: Aug 6, 2023 edition

Previous update: State of symbolic shapes branch - #65 by ezyang

Executive summary

  • More on KJT/torchrec. I had a nice discussion with Dennis van der Staay about torchrec and work on sparse arch. Some new information: (1) this workstream is almost certainly going to involve distributed later, because applying PT2 to post-torchrec sharded models is going to involve tracing past communication primitives–this also implies I’m going to want to get FakePG working on torchrec, (2) working on unit tests should be a pretty good idea, but there’s still some basic infra work to do (laid out last week), (3) not really expecting concrete performance improvements as sparse arch is going to be typically communication bound, so this is a mostly “we think this is promising, and the investment is not too big, because we’ve already done so much with dynamic shapes so far.”)
  • Pre-dispatch export. We’ve agreed to allow QAT to short-term publish a new export interface that produces a pre-dispatch FX graph with ATen operators which is suitable for graph transformations and training. The long term goal will to be have pre-dispatch functionalization which is the invariant the export team wants to allow this to be worked into torch.export proper. Pre-dispatch will generate an ExportedModule so that the APIs match.
  • Fake export. Export now supports exporting entirely fake modules/inputs. This means to export a model you don’t have to actually load its weights into memory; you can load it in a fake mode and still export it. This means we have some delicate code in Dynamo for dealing with two concurrent fake modes (but it’s not so bad: the outer fake mode is typically disabled while we do Dynamo analysis.) Only ONNX supports torch.load’ing models in fake mode at the moment.
  • Improved user stacks in Dynamo. torch._guards.TracingContext.extract_stack() now always accurately reports a user stack from anywhere in Dynamo, and we reliably use it for reporting real stacks for exceptions (previously, they used an entirely different mechanism.)
  • Improved error messages for non-local inputs in export. See Improve error message when export encounters non-local input for the details. This isn’t complete; follow through is also to make this work for outputs, and also work a little harder with the pytree representation (probably this week.)
  • Dynamo change in attitude. Many folks are concerned that Dynamo is just “endless” bugs. I pitched Animesh and Voz on a new attitude to fixing Dynamo bugs, which is that we should imagine the platonic ideal implementation of Dynamo as a faithful reimplementation of CPython in Python. Then, fixing a bug should not just be moving code around to fix a particular problem, but instead improving the local vicinity of code to bring it closer in line to this ideal. An example I used a lot explaining this was dict.keys support (bug fix is changing its type from tuple to set; real fix is to accurately model dict views.) To do this well, you need to regularly look at CPython code, and Dynamo may need to grow some new abstractions (perhaps a proper implementation of Python’s object model, Python traceable polyfills).
  • Notable new bugs.

Numbers

As we’re not really doing much on performance numbers recently, I am simplifying this section.

Training. 68cb854d73 Dashboard

Nothing much to report.

Inference. 68cb854d73 Dashboard

The big perf increase in torchbench is due to maml getting removed from the benchmark set (it slows down a lot under PT2 and was depressing the score). clip, hf_Whisper, llama_v2 are new models added thanks to @msaroufim !

What’s next?

There are a lot of things that need doing

  • Finish overhauling export input/output pytree matching (probably not dumping the pytree in/out spec, but I think if we tree_map into positional identifiers we can reliably detect KJT missing situations)
  • Make unbacked SymInts work in Inductor gist:1293a41299604c44310341b7540eabcb · GitHub (biggest problem is unbacked SymInt binding in wrapper codegen and the hinting logic)
  • Irrefutable guards
  • Write up the plan for sparse arch / KJT
  • Land pytree support for KJT/JT
  • 0/1 specialization suppression for list of int in KJT

Stuff that probably can wait until later?

  • Host side torch.cond
  • DynTensor
3 Likes

State of symbolic shapes: Aug 12, 2023 edition

Previous update: State of symbolic shapes branch - #66 by ezyang

Executive summary

I’m trying something a little different, expanding the update to cover a wider variety of topics beyond dynamic shapes, mostly centered around things that I personally have involvement in (this is a lot of things, so you should be getting pretty good coverage this way!)

Benchmarking

  • Inductor CI/perf is upgraded to CUDA 12 / gcc 9. This doesn’t seem to have any appreciable effect on perf, but we did it so we could do the next item.
  • torchrec_dlrm is back. They were disabled a few months ago because of fbgemm nightly related flakiness. The flakiness has been resolved by building fbgemm/torchrec from source in the Docker image. These are now installed as part of the general torchbench installs, and should help some of the work we are doing on jagged tensors (since many important operators are currently implemented in fbgemm).
  • Algorithmic efficiency. Frank Schneider posted about how PyTorch was slower than JAX in their upcoming algorithmic-efficiency benchmark suite. A bunch of us, spearheaded by @msaroufim, jumped in to take a look at what was going on. Status updates at https://docs.google.com/document/d/1okqKS32b0EhWQSFFoSV6IjGlYM4VhNYdxBPjdlFIw5w/edit (Meta-only). I personally have an interest in the dlrm side of things, since I’ve been working on sparse arch recently; after fixing some mild bugs, I was able to show parity on criteo1tb dlrm between PyTorch nightly and JAX on an A100x8 (PyTorch score: 7703.403180360794, JAX score: 7703.041719198227), although the number of evals varied, so I’m not sure if this a threat to validity. Unfortunately, this does not necessarily help their problem, which was an OOM. To make further progress on this, we may need some tools to help us understand why torch.compile memory usage is higher.

Export

Distributed

  • Tracing FSDP. @voz wrote a post Redirecting... (Meta-only) about the state of tracing FSDP in Dynamo. The key info is that on a branch, he can trace everything through and get identical results on a single forward-backward to eager. There’s a lot of fixes that need to land to main; from his post:
    1. The value of various small changes to FSDP to make this work vs adding fixes in dynamo (Pretty easy, preferring dynamo ofc but for some mostly no op shuffling, we do FSDP as well)
    2. TypedStorage - is it tensor-like/tensor-associated enough to go in the graph? Do we need to add some ops for doing tensor typed storage data ptr comparison / checking free, etc?
    3. Working through the cudastream story, in particular around wait_stream and such
    4. Lot’s of little bug fixes here and there
    5. Coverage for missing comparisons, bytecode ops, general coverage gaps like attr access on FSDP modules, setting data on a tensor, etc.
  • pytrees slow again for DTensor. Junjie and Rodrigo have been trying to improve DTensor’s eager perf, and we spent the first half of composability sync talking about it. Rodrigo had a hack to pre-compile pytree applications into Python code but apparently this doesn’t help that much: gist:5427cabfab6421d4e104905345f94a50 · GitHub . Another suggestion from the meeting was that after Brian’s subclass supports lands, maybe you could torch.compile each op individually with backend=“eager”.
  • Data-dependent all2all. Will Feng got all2all collective working in inductor https://github.com/pytorch/pytorch/pull/106655/ This is notable because all2all collective has data-dependent output shape. It looks like unbacked symints worked here!

Custom ops

  • Custom ops. Richard tells me he is going to add a class-based API for custom ops, to make it easier to define everything all in one place. More on this soon I assume!
  • Custom op testing. https://github.com/pytorch/pytorch/pull/106903 is here to make it easier to retrofit pre-existing test suites to also test for important operator properties.

Nested/jagged tensor

Dynamo

  • Pivot on per-NN module caching. @anijain2305 is working on having a separate code cache per NN module, but on Friday with the help of @voz we realized that you actually the problem is separable into two pieces: (1) an enhanced cache size limit policy that knows about NN modules [RFC][dynamo] Separate cache sizes for nn module guard specialization by anijain2305 · Pull Request #107077 · pytorch/pytorch · GitHub and (2) improvements to cache lookup when there are a lot of cache entries (guard trees).
  • Dynamo eager mode cond. Yidi Wu: to support cond in eager mode, we plan to torch.compile the entire cond operator, manufacturing fresh code objects to ensure that the caches don’t interfere with each other. https://docs.google.com/document/d/1esmHEa0fiktiSw1lvRsPmsbnTYxDSc0t3V9V5V0xK7I/edit#heading=h.pajqpbewbdg7 (Meta-only)
  • Time to get rid of functional VariableTracker? VariableTracker in Dynamo is an immutable data structure: when a mutation happens, you allocate a fresh VariableTracker and then replace old VariableTrackers with the new one. This is because we have checkpointing functionality that is used to rewind old VariableTracker. However, this is a bit of pain from the modeling side, as every Python data structure has to be reimplemented to have purely functional operations. An alternate design is to allow direct mutation of VariableTrackers. To do checkpoints, we simply restart Dynamo analysis to “go back in time” by stopping execution at the point where we would have checkpointed (a deepcopy could also work, though I’m not a fan.) Speculate subgraph would be implemented by simply denying all mutations or doing some crazy thermometer continuation thing. This would help make Dynamo more metacircular and reduce the work needed to support new container types, of which we often need to support a lot.

Dynamic shapes

  • expect_true irrefutable guards. I talked through this in the last 20min of composability sync. Check https://github.com/pytorch/pytorch/pull/106720 ; this is enough to make splits on unbacked SymInts work.
  • Boolean masking, at last. @yanboliang is looking into a pre-autograd FX transform that replaces boolean mask updates with torch.where calls. One annoying detail is how to deal with Dynamo tracing the boolean masks in the first place, when Inductor can’t deal with boolean masks if you can’t eliminate them? Our idea, in lieu of fixing Inductor to work with data-dependent shapes (which we are working on), is to attempt to eliminate all data-dependent ops in a pre-dispatch pass, and if it is not possible, restart Dynamo analysis saying “you need to graph break on this op next time.”
  • Notable fixes.
  • Notable new bugs.

Numbers

Training. 03414081ff Dashboard

  • Some accuracy regressions. torchbench: hf_BigBird, vision_maskrcnn (flaky). It’s not clear what broke hf_BigBird; possibly the CUDA 12 upgrade. Need to investigate. AlbertForQuestionAnswering improved accuracy!
  • The huge perf improvement across the board is thanks to Peter Bell’s work https://github.com/pytorch/pytorch/pull/106747 optimizing split reductions. This is not full runtime split reductions: instead Peter uses whatever the hint was at the time we compiled to plan the split reduction, and then we use it for all subsequent runs. This makes it more important to warm up Inductor with the “right” size hint to start; see also Padded tensor subclass · Issue #105325 · pytorch/pytorch · GitHub ; there was also another user complaining about other cases where we made suboptimal decisions if the first kernel we compiled with wasn’t representative

Inference. Dashboard 03414081ff

  • A lot of change on the last day; some improvements and some regressions (but mostly regressions). Maybe CUDA 12 update related, need to check. hf_BigBird also failing here too. RobertaForQuestionAnswering failing accuracy now
1 Like

State of PT2: Aug 20, 2023 edition

Previous update: State of symbolic shapes branch - #66 by ezyang

Executive summary

Public service announcements

  • Trying to understand how to read PT2 logs? Check out Logging docs - Google Docs for some quick orientation. (In other news, guards logging this week has been improved to show you which line of user code caused a guard to be added! Take a look and don’t be afraid to give feedback on it.)
  • Have you ever wanted to store lots of tracebacks for logging/debugging purposes, but were afraid to do so by default because it might be too expensive to do so? There is a new class in torch.utils._traceback called CapturedTraceback which makes it extremely fast to save a Python traceback (something like 20x faster than running a full traceback.extract_stack()), so it should change the calculation about whether or not you are willing to store tracebacks by default. We have already used this to keep fine-grained information about guard provenance to start. Note that CapturedTraceback DOES hold references to code objects, and these references can cause cycles (because of co_extra), so good practice is to make sure you clear these traces once you know you no longer need them.
  • I spent some time debugging reference cycles this week (due to CapturedTraceback), and Alban pointed me at objgraph for visualizing references/referents. It’s pretty cool, you should definitely check it out.

Composability sync https://www.youtube.com/watch?v=LmkFkOBwhks

Inside baseball

Distributed

  • Handling backward hooks in Dynamo is kind of difficult. There is a discontinuity between hooks on inputs and hooks on intermediates; hooks on intermediates, in particular, have to somehow be reflected in whatever graph gets differentiated by autograd, but at the same time these hooks may have arbitrary Python bits that need handling by Dynamo. It seems the problem is made easier if we have combined forward-backward tracing in Dynamo, at which Dynamo knows enough about the backward structure to bypass AOTAutograd entirely. It might also be possible to just do cuts prior to going to AOTAutograd, which will impede optimization. It might be possible to bypass this problem for FSDP if hooks are only on parameters and outputs. Lots of difficulties…

Dynamic shapes

Numbers

Training. 68b9bf9671 Dashboard

Not too much action this week. However, a bit of flakiness on the inside; may need some follow up. torchrec_dlrm added last week has shown up with dynamic shapes but is failing, so it still needs work. Early week TIMM improvement is from [inductor] make thread order consistent with loop order by shunting314 · Pull Request #106827 · pytorch/pytorch · GitHub . Some of later week TIMM improvement may be from Make Nd tensors hit fused addmm pass by eellison · Pull Request #106911 · pytorch/pytorch · GitHub (but stats were not run on that PR.)

dlrm is now passing on dynamic shapes which is cool. RobertaForQuestionAnswering was fixed (not clear what fixed this; it’s in the 35cca799ff42182a1b7f1ee4d0225ee879b7c924…384e0d104fd077d31efafc564129660e9b7a0f25 range). Some other wins (and some regressions, most importantly sam) from Unfuse bias add before pointwise ops by eellison · Pull Request #106912 · pytorch/pytorch · GitHub, also some other unexplained changes like convnext_base and jx_next_base in this same commit range (which sort of makes sense, @eellison landed a bunch of perf related changes)

State of PT2: Sep 8, 2023 edition

Previous update: State of symbolic shapes branch - #67 by ezyang

We were on break for two weeks because I went on vacation, and I didn’t have time to do a report before/after vacation lol.

Executive summary

  • PyTorch 2.1 branch cut. The cut was three weeks go (right when I went on vacation lol) and we’re reaching the end of the cherry-pick window. Track ongoing cherry picks at: https://github.com/pytorch/pytorch/issues/108055
  • Blueberries offsite was this week! The blueberries workstream is focused on accelerating SOTA transformer models using PT2, quantization, sparsity and other techniques. Some highlights: MFU is coming to the benchmark suite, some direct improvements to important models, int8 dynamic quantization with tensor subclasses. Many of these are not published yet, keep your eyes peeled at PTC!
  • PyTorch conference registration filling up fast. If you want to go and haven’t registered yet, you should register at PyTorch Conference | Linux Foundation Events

Composability sync

  • Aug 24 https://www.youtube.com/watch?v=H6EUSsvDmbw - we spent time going over recent KJT progress (to be reduxed below), and Voz reported progress on tracing FSDP with hooks (also to be reduxed below)
  • Aug 31 - not livestreamed publicly, I wasn’t there, but apparently there was some discussion about streams for tracing FSDP (no minutes alas)

Distributed and PT2

  • Tracing FSDP with Voz is deep in the weeds on backwards hooks support. We are attempting to implement hooks in a way that doesn’t require consolidated forward-backwards. The general strategy is (1) have Dynamo emit graphs that have register_hook calls on intermediates (register_hook calls on inputs must not go in the graph, they have to happen as part of residuals), (2) write these register_hook calls in such a way that when AOTAutograd runs, the actual hook code (which is arbitrary Python code and is not safe to run in tracing) is not run, but instead we run a meta function (which performs any needed metadata mutation) and then insert a call function to the original Python function (which will show up in backwards), (3) have compiled backwards take care of compiling this call function in the end.
  • Per parameter FSDP is looking pretty legit. Andrew Gu has been looking at the performance of per-parameter sharding (where parameters managed by FSDP aren’t shoved into a single flat buffer) and has found that we only really pay a penalty of 5% with per-parameter sharding but get better memory usage. Meta only: Redirecting...
  • DDP optimizer brittleness. We currently support pipelining DDP code with PT2 by manually splitting graphs into multiple AOTAutograd functions so that backwards isn’t run too soon. The code here is kind of janky: I ran into two separate bugs that only happend when optimize_ddp was on: [DDP PT2] TypeError: convert_frame_assert.<locals>._convert_frame_assert() missing 2 required positional arguments: 'hooks' and 'frame_state' · Issue #107637 · pytorch/pytorch · GitHub and [optimize_ddp] moco - NameError: name 's2' is not defined · Issue #108877 · pytorch/pytorch · GitHub . Pritam has also been complaining about the graph break strategy: torch.compile graph breaks should be independent of DDP buckets · Issue #108966 · pytorch/pytorch · GitHub Will tells me that Chien-Chin is working on some new DDP strategy, but it appears to be centered around starting with a non-parallelized graph. Hopefully we can present it at composability this week. Note that DDP cannot be easily traced as it is implemented in C++.

Dynamic shapes

Inductor fun

  • Peter Bell is very close to landing inductor IR support for scan https://github.com/pytorch/pytorch/pull/106581 which allows for native cumsum/cumprod support. Now all we need is for someone to add a higher order op that feeds into this and we will have torch.scan!
  • Someone should add a “realize” operator to PT2, which would force materializing a tensor rather than allowing fusions across it. Christian Puhrsch would find this useful for ensuring epilogue fusion occurs on int8 mm (today, regular fusion causes the pointwise operation to get fused into a later reduction, instead of fusing the pointwise into the matmul)
  • ABI compatibility for AOT Inductor is continuing to proceed apace slowly, but one agreement is that we’re probably going to also only have the ABI compatible codegen for OSS as well.

Performance

  • Flash Attention 2 is close to landing: Flash Attention v2 by drisspg · Pull Request #105602 · pytorch/pytorch · GitHub but it is currently stuck because it takes a lot of memory to compile, causing CI problems.
  • In the PT2 weekly meeting, we discussed H100 benchmarking. There are a lot of interlocking parts to this: we need to upgrade Triton to get their H100 improvements, and not everyone on the PyTorch team has access to an H100. Still looking for someone to sign up for this.
  • CUDA graph updates are a thing now: 1. Introduction — CUDA C Programming Guide There may be some opportunities here. Elias says: “It mostly helps with eliding input copies. For the most part, removing input copies only really matters when you torch.compile only part of your model and leave the rest of the model in eager. This use case is pretty unlikely to train well anyway since you’ll still need to bifurcate the memory pool.” However, personally, I also think CUDA graph updates could be pretty useful for allowing you to deallocate the pool of memory needed by a CUDA graph, only reallocating it when it’s time to run the CUDA graph again.

Dynamo

  • There was a pretty notable pytree API BC breakage which caused some internal problems: Serialize pytree to json string by angelayi · Pull Request #106116 · pytorch/pytorch · GitHub
  • Some big refactors that are in progress: refactoring skipfiles / allowed functions (talk to Yanbo), refactoring guard trees (talk to Animesh)
  • A bunch of new contributors being onboarded to Dynamo: Quansight is working more on Dynamo issues, and Jack Cao from PyTorch XLA is looking to help us with consolidated forwards-backwards-optimizer support in Dynamo as it is essential for XLA Dynamo perf.

Numbers is on break this week due to A100 runners down: apt-get install nvidia-docker2, Could not get lock /var/lib/dpkg/lock-frontend · Issue #108862 · pytorch/pytorch · GitHub

1 Like

State of PT2: Sep 15, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

Inductor

Composability sync hit a lot of topics this week. Composability meeting notes - Google Docs Topics that weren’t otherwise covered in this doc:

  • Elias told us about how SDPA pattern matches (and others; both inference and training patterns supported) are now compiled ahead of time, making it a lot cheaper to do lots of patterns. We took advantage of that to add a lot more patterns to match other SDPA variants. Add Python serialization to Pattern Matcher patterns by eellison · Pull Request #108894 · pytorch/pytorch · GitHub
  • Chien-Chin told us about the new PT2 DDP plans. We cannot directly trace DDP because it is implemented in C++, and we cannot easily port it to Python because the implementation is complicated by bucketing. So the idea is to implement a Python non-bucketed DDP, and rely on compile to optimize it away.
  • Horace told us about developments in LLMs. One thing he wants is dequant primitives in PT2: a way to take int3/int4 packed values and unpack them into a larger tensor, with the idea that PT2 would compile away the memory traffic. In general he doesn’t think we should directly do this in PT, as there are so many quantization formats.

Dynamic shapes

  • Last week I mentioned opcheck testing is usable, but Richard Zou is still evolving it on user feedback. A recent new change is to put the xfails into a JSON file so it can easily be automatically updated. However, there are still complaints from folks that it’s too hard to understand what goes wrong when a test crashes. Richard is going to investigate a two stage process now, where by we separate generation of test inputs and actually running the tests. To ensure generation of test inputs is kept up to date, we only need a single new test which runs all of the tests in the test file in one go and xrefs what tests are exercised with what we have recorded.
  • Horace wants a version of Tensor where some of the sizes are stored on device. This would allow you to perform a data-dependent operation without synchronizing; and you would still save on memory traffic because you would have kernels mask out memory loads when they go out of bounds of the dynamic shape. In some sense, this is a specialization of jagged tensor where everything in the jagged dimension has the same size.
  • Notable bug fixes:

Numbers

This is nearly a month worth of numbers!

Training. 34ddf08f27 dashboard

Inference. 34ddf08f27 dashboard

  • A lot of torchbench improvement: detectron2_fcos_r_50_fpn, doctr_reco_predictor, drq, llama, pyhpc_turbulent_kinetic_energy all now pass accuracy.
  • cudagraphs freezing accuracy improvement in timm models, likely from some major bugfixes for freezing
  • pytorch_stargan had huge perf improvement c2ac0da445cfe3d848342926f9cd4422bd35bfe2…781b7ebe912ec24cbd917cd548b748b1650ab6a2
  • HuggingFace regression due to pin update Problems hit when upgrading the version of HF used in CI · Issue #108145 · pytorch/pytorch · GitHub
  • Fairly large aot inductor regression due to ABI changes.
1 Like

State of PT2: Sep 23, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

  • Yanbo has been making good progress on understanding the state of our skipfiles/allowlist situation. Here is my attempt to record what he described to me in our 1:1.
    • First, what do these things do? For any given frame, we can make one of three decisions on it: inline - the default decision; skip - we never start Dynamo on this frame, and we induce a graph break instead of inlining into it (BUT, skipped functions may have overrides in Dynamo that allow us to avoid a graph break); allow in graph - we don’t inline into the function, but instead directly put it into the graph (and run it to do fake tensor propagation.) Skipfiles and allowlist control whether or not we do something different from the default decision.
    • Yanbo’s theory is that allowlist should be explicitly enumerated function-by-function. This makes sense; there’s a fixed set of operations we can actually put in the graph (coinciding with Torch IR; see composability sync), and they have to be audited to ensure they don’t do naughty stuff like mutate Python state.
    • Suppose that we didn’t care about compile time / Dynamo bugs at all. In theory, it shouldn’t be necessary to have a skip list at all, because you’d expect Dynamo to independently work out that something couldn’t be compiled and graph break. There is a big nuance here though: the torch module is skipped! Most of the time, this skip is bypassed for other reasons, e.g., a torch function is allowed in graph, or a submodule is explicitly allowed for inlining. But by default we won’t actually compile anything in torch (and this can be quite surprising for PyTorch devs!)
    • Chesterton’s fence rules everything around me. Sometimes we have manual implementations of functions (like nn.Module.parameters) which are unnecessary, because they were added back when Dynamo’s Python language support was not so good, but now we can just inline into these functions, but some seemingly benign skip rules are load bearing and cause problems. So many of Yanbo’s initial refactors will be oriented around preserving semantics as much as possible, while improving code organization.
  • Jason, Edward, Animesh and Voz got together to discuss some design questions about guard trees raised last week. The conclusion was that we are NOT going to do combined guard tries, Animesh’s plan as original scoped as is. One interesting thing I learned from this discussion was that guards with source-based guard structure deal poorly with guards that mention two sources, but Jason proposed a way to deal with this case: instead of directly having a guard like x.size(0) == x.size(1), instead have assignment statements like let s0 = x.size(0) and let s1 = x.size(1), and then have an extra guard that only makes reference to this local scratch space s0 == s1. These extra guards get run immediately once you notice that all of its free variables have been assigned. Jason’s argument is that size guards can be very fast to run if we compile them, so it doesn’t matter if they get run unnecessarily early. Some very rough meeting notes: https://docs.google.com/document/d/1EbrR9o7Loi_fU1MHNJAxxCOItn0dv44hLF3NB2pZ1nE/edit#heading=h.o7t8ttlom4nx
  • Lazos suffered a bit from some bikeshedding about how he should write some new VariableTrackers, but hey, at least we got a doc out of it: Which VariableTracker should I use - Google Docs
  • PSA: when you print Dynamo logs, they come with a [0/0] marker that says what frame you are compiling, and which recompile of that frame you are on. [19/0] means you are compiling the 19th interesting frame for the first time, while [1/20] means you are recompiling the 1st frame for the 20th time (probably bad!) Occasionally, you will see [0/0_1]; this means that we restarted analysis on 0/0 for some reason, so this is the second (1 is zero-indexed) time around compiling it.
  • I mentioned to jansel that there are a number of dynamo refactors I’d kind of like to firm up: mutable variable trackers, a more accurate python object model, variable tracker cleanup (we have a big VariableTracker subclass hierarchy), more metacircularity (so that constant folding is a lot easier to implement.) Hopefully we can hit some of these during the offsite.

Composability

  • We had a very spicy session at composability sync this week on Torch IR https://youtu.be/FSPNXppkcjg Composability meeting notes - Google Docs
    https://docs.google.com/document/d/17O1R57oOZp_fK4dRf83UiM4fH6h6nblxjizTrbHP8BY/edit The crux of the matter is what to do about “Torch IR”, which is conceptually a PyTorch program capture representation that is produced by fx.symbolic_trace: an unnormalized format that contains precisely torch API operations that are part of PyTorch’s public API. It is a coherent concept that is used by folks today, but its denormalization makes it difficult to write sound analyses/passes on. Some argued that because it’s so difficult to use, we shouldn’t expose it, while others argued that the horse escaped from the barn. We were able to agree in the meeting what Torch IR is and what guarantees you should expect from it, but it’s still an ongoing discussion how this should relate to export.
  • Zachary DeVito’s been working on single controller distributed paradigm for PyTorch, where we issue commands of entire Dynamo traced graphs for external nodes to run. This is being done via a tensor subclass, but it is a bit unusual in that it doesn’t match other tensor subclass applications, where we don’t actually want to trace into the subclass itself, we just want to trace tensor operations on it “as if it were a normal tensor.”
  • Apparently optree GitHub - metaopt/optree: OpTree: Optimized PyTree Utilities is slowly making progress into becoming an optional dependency for PyTorch: if it is installed, PyTorch pytree APIs will transparently make use of it instead, for hefty speedups. Pytrees are a big performance cost for PT2 compile time, so it will be nice for this to land; so nice that Richard Zou has been wondering if we shouldn’t actually just make this a required dependency.
  • COW tensor is alive again, thanks to Kurt Mohler accepting a mission to finish of Mikey Dagitses work. Implement Copy-on-write (COW) tensors · Issue #109833 · pytorch/pytorch · GitHub
  • Discussions on internal reliability continue. It’s going to be some sort of multi pronged approach where we negotiate with PGs what testing we absolutely must do, while at the same time improving testing on poorly exercised parts of PT2 (e.g., PT2+FSDP) and working out periodic tests that are covered similarly to how we cover benchmark results in OSS.

Dynamic shapes

1 Like

State of PT2: Oct 8, 2023 edition

Previous update: State of symbolic shapes branch - #70 by ezyang

We were on break last week as I was on vacation.

Executive summary

Compiler/distributed offsite was last week! PyTorch Conference talk slides are due to Linux Foundation end of this week!

Dynamo

  • Our initial take on mutable variable trackers was “well, it is probably technically feasible, but it’d be a lot of work and the ROI is not obviously there.” It came up again this week, though,for perf reasons: RFC / Discussion - Mutable Variable Trackers - Google Docs from Voz
  • We discussed accurate python object model: definitely something we should do for user defined objects, maybe Fidget-Spinner will work on it. We have some Dynamo bugs recently relating to classes with nontrivial metaclasses (like abc) and multiple inheritance.
  • We have a proposal for guarding on Dynamo configuration, which should make it a lot easier to tweak config options: Dynamo guard on global configuration · Issue #110682 · pytorch/pytorch · GitHub One notable choice we make is that outer-most torch.compile config wins; if this would be annoying for you please comment on the issue.

Tracing FSDP

  • We talked about the relative importance of landing tracing FSDP quickly during the offsite. The general consensus was that, while this is an important capability, the more time pressing problems are optimizing tensor parallel compute (as it’s harder to manually get optimal overlapping in this regime) and tracing DDP (which Chien-Chin is working on.)
  • During the offsite, we came up with a full plan for Dynamo-level support for propagating hooks to backwards. The primary complication is that, in full generality, a backward hook installed in a Dynamo compiled region may be arbitrary Python code that would vary from run to run, but we emphatically do not want to guard on it (nor can we, since we didn’t inline into the function.) In the simple case, the function is constant from iteration to iteration and we can bake it into the backwards graph (this is what is currently implemented); in the complicated case, Dynamo must construct the residual function, and then somehow pass it to AOTAutograd compiled function, so AOTAutograd knows to know that particular function is what should be invoked when backward rolls around. This can be done but it’s all quite fiddly. For FSDP we don’t need it in full generality because it’s a constant function.
  • More folks are collaborating on Voz’s experimental FSDP tracing branch: GitHub - pytorch/pytorch at voz/fsdp_autograd3 To run things on the branch just say torchrun --standalone --nproc_per_node=2 fsdp.py (will run with compiled forwards, but NOT compiled backwards). Current status is that compiled forwards works, compiled backwards does not. The problem is that compiled autograd has to do a pre-pass with fake tensors to construct the FX graph, but during this pre-pass it is unable to run hooks, and that means parameters aren’t the sizes it is expecting.
  • Not quite FSDP, but putting it here: on the subject of single controller, Haiping Zhao also looking at it this problem space, much more from the distributed side. He, Zach and Horace have been chatting.

Core

  • We spent a bit of time talking about optimizer in the offsite. @janeyx99 summarized the discussion at Meta only: Redirecting... and Meta only: https://docs.google.com/document/d/1JJhRCl8F51nH_Ke8Yd_BV3scAmv5V4eSVnle5D8__po/edit My brief summary: we’re going to make optimizer support taking parameters in arbitrary pytree structure, rather than forcing just a list of parameters (which gives you the awful integer indexed structure where you have to reverse engineer which parameter is what.) It’s not BC-breaking, but people who use this API will have a much easier to work with state dictionary.
  • Composability sync this week was all about quantization https://www.youtube.com/watch?v=7WhgpAIvxHU Composability meeting notes - Google Docs The resolutions:
    • uint2/uint3/uint4 support in core to be prototyped as Python subclass by torchrec folks (lead by Ivan Kobzarev)
    • Decent chance we are going to get dequantize operators that can show up in export IR
    • We will have a pattern matcher that will let you compile regular Python calls in PyTorch IR into appropriate ATen matchers, mirroring how Inductor’s pattern match infra works. This will support just returning fx.Nodes to you so you can do arbitrary transformations, instead of just doing a replacement.
  • Richard Zou has been trying to convince people to use the new operator registration API, but he has been noticing that people really like the old fashioned autograd.Function API, because it doesn’t require them to do work for things they don’t care about (e.g., supporting other transforms). Since we need to support this anyway, we are going to make sure Dynamo’s support in this regime is good.
  • AOTDispatch subclass PR is approved, close to landing! AOTDispatch subclass by bdhirsh · Pull Request #104483 · pytorch/pytorch · GitHub

Inductor

Export

  • Export input/output matching is being a problem again. This is the AssertionError: traced result #1 (<class 'torch.Tensor'>) is not among graph-captured output. Someone should rewrite this code.

Dynamic shapes

Numbers

I guess I’m doing these monthly now.

Training. 1b34238d67 dashboard

  • nanogpt, stable_diffusion_text_encoder, stable_diffusion_unet newly added torchbench models
  • mobilevit_s now passing timm_models
  • Not sure what’s going on with blueberries readout lol.
  • Compile time does seem to have gotten worse. @Chillee has been complaining about compile time, although a lot of it is in Dynamo tracing. It is hard to see the effect of Dynamo in our current benchmark suite because it is heavily Inductor biased. Some improvement from guard hashing, but Horace says it’s only 1-2 seconds.
  • Unattributed speedup on timm_efficientdet

Inference. 1b34238d67 dashboard

  • Lots of enablement in aot inductor, I like to see that pass rate go up
  • Some speedups are attributable to nanogpt being added
  • 1% improvement in HF, Horace updated some loading logic
  • HF inference: better flash attention matching at low inference +19%, some of this is also FlashAttention v2
  • Some ups and downs with Yanbo’s for equiv invocation, letting us hit baddmm
1 Like

State of PT2: Nov 3, 2023 edition

Previous update: State of symbolic shapes branch - #71 by ezyang

Sorry about the month’s delay! Between more vacation and PTC there wasn’t much time to do a writeup over the weekend.

Executive summary

Big tickets

Dynamo

Core libraries

  • Ying Liu has been working on a tensor subclass for async execution. We discussed it in composability sync. The idea is that you can trigger an operation (typically communication) on a side stream, as well as some follow on operations, without having to literally move the follow on operations to the point where a sync happens. This also means that code in torchrec that has to be manually written as a pair of custom autograd functions for req/wait can be written in an intuitive, autograd style. We have a version that does this manually with callbacks (only queue kernels onto the stream at some known later point in time) and Ying is working on another version that uses streams only. One interesting thing we noticed that when you schedule allreduce in forwards first, backwards will naturally schedule it last, but you actually want the allreduce to happen ASAP! @albanD suggested we may be able to add an API to modify the priority order of autograd backwards, could be useful.
  • There will be a new repo https://github.com/pytorch-labs/ao for some of the new quantization schemes we’re working on. We discussed this in composability sync.
  • I did a long overdue update to record_stream docs at Add a note about performant record_stream use. by ezyang · Pull Request #112526 · pytorch/pytorch · GitHub after having some more discussions about it with @eellison who was trying to get cuda graph trees to work with record stream.
  • We’ve been talking about this with Vincent for a while, but there is now a proposed PR to add TensorDict to PyTorch core, check it out: [RFC] Tensordict integration by vmoens · Pull Request #112441 · pytorch/pytorch · GitHub

Dynamic shapes

Numbers

Training. 64f326097b dashboard

  • TIMM improvement is from channels last optimization

Inference. 64f326097b dashboard

  • 3% HF improvement from concat codgen on inference
5 Likes

State of PT2: Jan 12, 2024 edition

We’re back from holiday break.

2 Likes

State of PT2: Jan 20, 2024 edition

  • I did some live streamed bug fix sessions, which you can watch on Youtube. Check it out!
  • We had an internal SEV review. One thing that stood out to me was that two of the SEVs were compilation slowness stemming from dynamic shapes accidentally being turned on when it shouldn’t be. Dynamic shapes trouble.
  • Jack Cao and I got a design for dealing with saved for backwards intermediates, which is that we’re going to generate a pre-dispatch ATen FX graph into Dynamo, rather than directly represent torch.*. We’re pretty sure this should work. Track Jack’s progress at [WIP] Dynamo single step graph by JackCaoG · Pull Request #112296 · pytorch/pytorch · GitHub
  • Will Feng is getting close to finished with his prototype for lazy scheduler. It is a bit complicated, but it seems like it will work. LazyScheduler for operator reordering - Google Docs
  • Richard Zou is back to designing a new, class-based custom ops API, analogous to autograd.Function. This is based on user feedback where the existing custom ops API is quite difficult to use for autograd. Meta only: https://docs.google.com/document/d/1TVV3sDUv1E8ou1Hk0MeL7e1C6QPNl5UQesiP2MWhDSQ/edit (hopefully public soon)
  • Richard Zou and co have been working on making the Dynamo CI tests less flaky. They’ve made a lot of progress. Right now, all of the tests in CI are reasonably hermetic (they reset before running) and they’ve clustered the failures. A lot of very simple stuff (e.g., error where Dynamo uses a variable that’s not defined) and then a huge long tail of niche failures. Not many accuracy failures. One big problem is many problems only repro in CI environment and not locally.
  • A number of new folks from PL&R are ramping up on Dynamo! The extra manpower is much appreciated.
  • Yanbo Liang is thinking about how to measure compile time in our prod workloads. The challenge is full E2E tests are quite difficult to run. But maybe simple components like Shampoo can be extracted out and tested on their own!
  • Simon Fan is working on improving testing of compiled autograd by turning it on our torchbench suite. Meta only status: https://docs.google.com/spreadsheets/d/17aCEcAcif-1saHrdqALjfr-ybRMl_uEq7mqje5whPZw/edit?usp=sharing the summary is some pass, some fail. _cudnn_rnn_backward needs meta support, and there are some side/stride mismatch issues. Fire up the minifier! He’s also running torchbench with DDP.
  • New einops style library einx from the community: Reddit - Dive into anything Seems pretty neat! This is my continued reminder that first class dimensions are also a thing in PyTorch too.
  • Landed stuff:
  • Up for review:

State of PT2: Jan 28, 2024 edition

1 Like

State of PT2: Feb 4, 2024 edition

2 Likes

State of PT2: Feb 11, 2024 edition

2 Likes

State of PT2: Feb 16, 2024 edition

Long weekend, so you get the update early this week! And good thing too, this week was jam packed.

State of PT2: Feb 24, 2024 edition

  • A lot of progress on torchrec and PT2 this week.
  • ghstack 0.9.0 is out. This release has a big new feature: you can now specify a subset of commits in your stack. Simply say ghstack submit HEAD~ to push only HEAD~ (but not HEAD). Commit ranges are also supported. There’s also a number QoL improvements: we now no longer include the PR title inside our generated head/base commits, we no longer strip @ from email addresses in commit messages, and there’s a smoother GitHub auth token flow. Most open issues in our bug tracker were fixed. There’s also an experimental --direct feature which lets you generate PRs that merge directly into main, but it’s not tested to work with pytorchbot, this is mostly useful if you’re using ghstack on your own repository. Internal xref: Redirecting...
  • Structured logging:
  • Composability sync was pretty spicy: Jason Ansel reopened the question of whether or not we want to be supporting a pre-dispatch serialization IR at all. Composability meeting notes - Google Docs .
    • The composability minutes don’t have the full story: after the meeting, there was a bunch of back channeling with Horace and Jason, and we got back to “OK, I guess we need to support pre-dispatch IR”. One major thing that convinced Jason was that we don’t actually play on exporting FSDP-ized models (instead, the model will be unflattened post export and the FSDP applied at that point). For Horace, solving the org problem of needing to move your model around to different transforms and not at all in one go seemed insoluble without the export format. Horace to enumerate all the cases where we do side tables that you can’t export with pre dispatch: checkpointing and user defined Triton kernels.
    • On a side note, there was some late night discussion about what to do about subclasses and pre-dispatch export. My explanation to Michael Suo was that if a subclass can only be desugared after autograd, then your pre dispatch IR must include the subclass, and your target runtime must know how to deal with the subclass. In some sense, this is not surprising at all, because pre dispatch IR really hasn’t had any of PyTorch’s internal subsystems resolved ahead of time, so you are going to need, e.g., a full on autograd engine to actually run it (in practice, these export IRs are going to target PyTorch again). Sometimes, subclasses can be implemented before autograd, but you often give up quite a bit to do so. For example, for DTensor to be pre-autograd, would give up the capability to have a different sharding pattern between forwards and backwards; for NestedTensor to be pre-autograd, necessitates every operation nested tensor to have a separate ATen operator with its own derivative formula (as many nested tensor operations cannot be implemented just be desugaring). In some sense, this is not surprising: people are using __torch_dispatch__ because there are things you can’t do unless you’re below autograd!
  • Per-parameter FSDP is very serious business. With some upcoming training use cases that I cannot describe here in public, per-parameter FSDP is in serious contention for being the distribution mechanism. Will Feng has been focused on torch.compile’ing per-parameter FSDP (internal xref: Redirecting...), and generating more AOTAutograd features that Brian has been helping consult with.
  • I had a chance to ask Brian what was going on with FP8 and DTensor composability. There is progress going on here (contrary to my impression), and it seems the current design problems revolve around cases where first DTensor then FP8 ordering wants to be violated. In particular, in some cases DTensor wants to do something special when a conversion to FP8 happens. The current thinking is that to_fp8 will be a dedicated ATen operator that DTensor can override, and we just need to figure out how to dispatch this to the FP8 subclass (similar to backend select, we have a dispatch problem since no argument to the operator actually takes in an FP8 tensor) before we finally get to proxy tensor (since we don’t want to_fp8 to show up in the final, desugared of subclasses graph.) Brian is consulting on this.
  • The state of NestedTensor has been on my mind.
    • Jeffrey Wan has a fairly major change to nested int out: [NJT] Store vec on nested ints by soulitzer · Pull Request #119976 · pytorch/pytorch · GitHub The implementation in the PR is actually different than the design Christian advocated for in the meeting, where sequences are canonically CPU but can be cached on CUDA. The main problem with Christian’s design is if you are given only a CUDA lengths tensor, Christian’s design as is forces an immediate sync to establish the CPU source of truth. More discussion necessary.
    • Joel Schlosser has subclass view-ification out Subclass view fake-ification via reified ViewFuncs by jbschlosser · Pull Request #118405 · pytorch/pytorch · GitHub It seems pretty close, just needs some detail work.
    • Basil Hosmer’s old fold prototype was back in the news. The context is that I finally internalized something Michael Suo was telling me, which is that stock NJT cannot handle torchrec KJT, which requires two dimensions before jagged dim. How exactly should this be modeled with nested ints that Jeffrey is working on? If a nested int corresponds to a Seq dim in Basil’s formulation, fold tells us pretty directly how to model this: (feature, batch, jagged, embedding), where jagged is a Seq dim. It is a little unfortunate that the stride note was never written, but with only a single jagged dimension, intuitive multiplication of Seq with int does what you want. Not sure if anyone else is on board with this, need to get more alignment. One thing fold doesn’t answer is how you should resolve degrees of freedom of if Seq’s data is stored on CPU or solely on CUDA. torchrec makes one particular choice, but there are others! Ivan tells me, however, that typically you don’t want to store stuff on CPU if you can avoid it.
    • Michael Suo has been pushing us to think more about the long term state of NJT and torchrec. This is not in the plans right now, but in the long term, we would ideally have NJT be the backend representation for JT and KJT. But torchrec has pretty stringent eager performance considerations, so it is not at all clear how you are ever going to actually manage this. This is somewhat reminiscent of the situation with complex tensors, where we have a C++ implementation, but for PT2 a Python implementation would be much preferable (but we can’t get rid of the C++ implementation because eager perf would suffer.)
  • Oguz/Chip thinking about super big jobs. When you have so many nodes, when one node fails you’re going to have to restart. This is going to happen a lot. This means warm start matters a lot, and the model is not changing. We’ve already got a memcache thing going on for compiled Triton kernels: think bigger.
  • Richard has been thinking more about custom ops; he has a new custom op API proposal [WIP] python custom op API by zou3519 · Pull Request #120345 · pytorch/pytorch · GitHub . One of the tender design questions is “why do people want to put plain PyTorch operations in their custom operator, the so-called traceable operator”? Reference this old document for some use cases: PT2 black box escape hatch - Google Docs Another interesting idea that popped up: if someone puts a Triton kernel in their CUDA implementation of a custom op, how do we get this to Inductor in a way that it can understand and directly incorporate the Triton in, without generating an actual call to the custom op? One idea is for operators to have a more “structural” implementation, e.g., TritonKernel, FXKernel, where it’s not an opaque Python callable and you can poke at it from the compiler to get the important information. And you could automatically generate these by using Dynamo. Food for thought.
  • Torchbind fakeification is making progress by Yidi Wu at Add torch.library.impl_abstract_class to fakify torchBind class by ydwu4 · Pull Request #120045 · pytorch/pytorch · GitHub . There’s some API bikeshedding on what exactly the API for getting the fake version of a torchbind object and testing its guard should be.
  • James March was complaining to me that OSS C++ logging has no timestamps on NCCL logs like “watchdog timeout”. This probably is not hard to fix, someone should check it out.
  • Animesh Jain has been steadily working on C++ guards, it’s a big stack of PRs that is slowly landing. He is planning to move into the accumulate grad question, which we had discussed in composability last week. We worked out some more implementation details: fixing .grad handling in Dynamo is probably the tall pole in the tent, desugaring of accumulateGrad will happen in Dynamo via a polyfill, we will never handle the refcount == 1 case.
  • Mengwei Liu has a proposal for letting you load inline Executorch kernels into regular PyTorch Google Colab
  • Chip’s been thinking about big training jobs! Some Meta only docs to look at: https://docs.google.com/document/d/1gN8UmuqBxTX0MROSFkApjHFY0jzeE_tMTbuRPoxgWHk/edit also Internal Login
  • There’s a Dynamo bug burn down coming up soon, organized by Jane Xu and Richard Zou. Meta only: Redirecting...
  • Xiaodong is worried that DTensor is too focused on parallelism in llama, and there are other contexts which it is not well adapted to.
  • lucidrains will “be available in San Francisco for contracting, private tutoring, or full-time hire in March 2024”, per his website.
  • I didn’t pay much attention to the weekly compile time meeting but it seems like stuff is happening. Minutes at: https://docs.google.com/document/d/199LbkPiZdjn2CExEeRl8XDqO4unIPqMEwzyXJljGlY0/edit?usp=sharing
  • Notable bugs everywhere:
  • Notable bugs in symbolic shapes:
  • Landed stuff from Edward
1 Like