State of symbolic shapes branch

Z3 has a decent Python API. So, in theory this can be integrated easily.
Z3 is a dependency for a lot of things these days, including clang’s static analyzer (optional, but a lot of distros ship it enabled).

Z3 can also simplify expressions. It has a bunch of tactics that can be applied to expressions to simplify them. It can also do quantifier elimination, which can be useful for expression simplification.

I understand the guards & bounds checks simplification problems. Those can be mapped into entailment checks as you say.

What I don’t understand yet is why are these shape inference expressions so complicated. I was just looking at Torchy’s code that does shape+stride inference, and doesn’t look that complicated (there are min, max, equalities, mul, add, reduce). I think the worst part is that in a lot of cases you need to know the number of dimensions. If that’s also symbolic, things become very non-trivial. Do you need to give an upper bound for the dimensions? Or do you generate predicates that are valid in the unbounded case?

I’m happy to help with this stuff (designing, optimizing, whatever) if needed. I’ve experience with symbolic reasoning.

Regarding crashes:
this one issues a warning:

@torch.compile
def fn(a):
    b = a * a
    return b

# [WARNING] Unsupported: meta converter nyi with fake tensor propagation
print(fn(torch.tensor([1.0, 2.0], device="meta")))

Assert fails:

@torch.compile
def fn(a):
    b = a * a
    return b

with FakeTensorMode() as mode:
    print(fn(torch.empty([2,3])))

#   File "_dynamo/variables/builder.py", line 611, in wrap_tensor
#    assert type(value) in (torch.Tensor, torch.nn.Parameter)

Assert fails; could give a nice error message:

# without allow_meta=True
with FakeTensorMode() as mode:
    print(fn(torch.empty([2,3], device="meta")))

# File "_subclasses/fake_tensor.py", line 587, in __init__
#    assert device.type != "meta"

Another crash:

torch._dynamo.config.dynamic_shapes = True
torch._functorch.config.use_dynamic_shapes = True

@torch.compile
def fn(a):
    b = a * a
    return b

print(fn(torch.tensor([1.0, 2.0])))

#   File "sympy/core/cache.py", line 70, in wrapper
#    retval = cfunc(*args, **kwargs)
# TypeError: unhashable type: 'SymInt'
# RuntimeError: Trying to extract a concrete int out of a symbolic int

All I was trying was to print some complicated symbolic shape expressions so I could understand the problem a bit better. But then I hit all these crashes, so I must be doing something very wrong…

I definitely think Z3 as an optional dependency is not too hard a sell. The turning point would be if we could make our design substantially simpler if we could assume Z3 was always available; then the calculation would turn to whether or not PyTorch by default has Z3 as a dep, which, enhhhhh. It’s easier for a distro to do it probably.

Nope, everything is fixed dimensionality. So honestly the computations in the smt2 files are not that complicated, but they can be very repetitive because we are just tracing the real shape compute PyTorch does which was not necessarily written in a nice way for symbolic analysis.

A big gap I see from the operators you quote here is floor division (shows up a bit in pooling operations) and modulus (not sure where these come from actually.)

Definitely looking for collabs!

These are crashing for silly reasons haha. In order:

  1. Meta tensors intentionally don’t work with fake tensor (which is what PT2 will do.) In principle they actually should work fine but real world user code doesn’t actually need to optimize code computing on meta tensors, and when we were working on fake tensor it was usually a bug to try to fakeify a meta tensor, soooo yeah. @eellison we probably should fix this eventually
  2. You don’t need to explicitly fakeify before calling torch.compile; it does that for you. So what happened here is you said hey PT2 compile this with a tensor subclass input and this doesn’t work. Shouldn’t be an assert though; we should file a bug on this.
  3. This is (1) and yeah let’s beef up the medsage
  4. This one I think is because inductor and dynamic shapes is busted. So you need to make sure you don’t use inductor. Easiest is to use torch._dynamo.optimize(“aot_eager”) instead of compile

For playing around with simple examples, your best bet is to look at the tests with Symbolic in their class name in test/test_proxy_tensor.py. In particular, most of these call make_fx with tracing mode symbolic. This will give you the smallest slice of the system that is doing something interesting with dynamic shapes.

1 Like

Thank you!

Just a couple more crashes (I know, I’m a magnet for them…):

def fn(a):
    b = a * a
    if b.sum():
        return a
    return b

print(fn(torch.tensor([1.0, 2.0])))
traced_f = make_fx(fn, tracing_mode="symbolic")(torch.tensor([1.0, 2.0]))
print(traced_f)

# RuntimeError: tried to get Double out of SymFloat

If using this instead:

def fn(a):
    b = a * a
    if b.sum() >= 1:
        return a
    return b

# NotImplementedError: local_scalar_dense/item NYI for torch.bool

I’ve attempted to fix it by patching fake tensor’s local_scalar_dense:

    elif is_integer_dtype(arg.dtype) or is_boolean_dtype(arg.dtype):
        return fake_mode.shape_env.create_unbacked_symint()

But no luck; still crashes (tried to get Long out of SymInt).

Re the latest crashes, these are all due to attempting to do control flow on data dependent values (in this case, the float in the tensor). This is expected but we can make the error messages better.

State of symbolic shapes: Jan 11 edition

Previous update: State of symbolic shapes branch - #22 by ezyang

Commit ID at time of writing: f6c7cf1bf579cc42ea7e21bd557168625648a3e9

Executive summary

Welcome back from holiday and PSC folks! The holidays were sleepy, but not that sleepy: @jbschlosser landed some nice coverage wins (+2 working models on master) and @tugsbayasgalan has been working on a number of dynamic shape fixes that show up in PyTorch Edge export use cases. Since it’s the new year, you might be wondering what you ought to be doing in the new year. This status post is to help you figure out.

High priority things that need to be done (excluding export):

Things to be done sourced from PyTorch Edge export workstream (Meta only):

Things to be done sourced by generic export workstream (@SherlockNoMad)

Status report:

  • Model training status on master. See also Symbolic shapes work items tracker - Google Sheets
    • aot_eager: -15 (+2 MoM). These newly working models are thanks to work by @jbschlosser skip list
    • inductor: still really broken on master :rotating_light::rotating_light::rotating_light:
  • OpInfo tests on symbolic shapes.
    • pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 516 passed (+3 MoM), 522 skipped (no change), 224 xfailed (-3 MoM)
    • pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 291 passed (+5 MoM), 143 skipped (+1 MoM), 197 xfailed (-5 MoM)

Guard simplification

As you can see from the chatter before this post, Nuno is working on guard simplification. Nuno looked at resnet, and found that range analysis and a quadratic solver would solve most of the nontrivial constraints we were generating, greatly simplifying the guards we generate. However, there were also suggestions that we could simplify the guards at generation time.

Example constraint: Eq((s2 - 1)//16 + 1, 1). This is produced by empty_strided calling compute_contiguous on the passed in strides. Constraint is specifically the size_d != 1 test when testing contiguity. Should avoid guarding here ala subclass_zoo/dynamic_strides.ipynb at main · albanD/subclass_zoo · GitHub cell 18 (but need SymBool for this!)

 T z = 1;
  // NB: make sure we do signed arithmetic
  for (int64_t d = int64_t(sizes.size()) - 1; d >= 0; d--) {
    const auto& size_d = sizes[d];
    if (size_d != 1) {
      if (strides[d] == z) {
        z *= size_d;
      } else {
        is_contiguous = false;
        break;
      }
    }
  }

Example constraint: (s2 - 1)//2 + 1 < (s2 - 1)//2**2 + 2*(s2 - 1)//2 + 1. This comes from:

File "/Users/ezyang/Dev/pytorch-cpu/torch/_prims/__init__.py", line 348, in _elementwise_meta
  strides = utils.compute_elementwise_output_strides(*args_)
 File "/Users/ezyang/Dev/pytorch-cpu/torch/_prims_common/__init__.py", line 407, in compute_elementwise_output_strides
  comparison = should_swap(perm[dim0], perm[dim1])
 File "/Users/ezyang/Dev/pytorch-cpu/torch/_prims_common/__init__.py", line 387, in should_swap
  if stride_a < stride_b:

The easiest fix is probably to make sure we don’t run the sorting algorithm when we have contiguous inputs. But even better would be to introduce a contiguity guard (which tests that a tensor is contiguous), which should be able to eliminate these guards entirely.

You can reproduce these experiments with the following code:

model = torchvision.models.resnet18(weights="DEFAULT")
model = torch._dynamo.optimize("eager")(model)
model.eval()
model(torch.rand([64, 3, 7, 7]))

Something that is not so easy right now is finding out what produced guard expressions; e.g., I see x ** 2 // y blah blah, where did it come from? More detailed logging at the Dynamo per-op level would help.

What’s made it to master since last time?

ezyang

voz

jbschlosser

tugsbayasgalan

nkaretnikov

What’s coming next?

  • ezyang: catching up on code review
  • voz: changing dynamo backend api to take aot autograd directly
  • bdhirsh: on vacation until next week
  • Chillee: inductor integration

Our north star:

  • Dynamic shapes enabled by default for PT2 release
  • Fallback implementation for custom operators without symbolic shape propagation, inferred by running fallback on real operators :rotating_light::rotating_light::rotating_light:
  • All OpInfo tests passing
  • Dynamic shapes on by default for developers / users
1 Like

Mini-update: Status of unbacked SymInts on Jan 16

I recently presented our progress on unbacked SymInts, our strategy for data-dependent output sizes, in the most recent composability meeting (meeting notes: Composability meeting notes - Google Docs , Meta only recording: Redirecting...). This status post will recap what I described in the meeting, and also explain what you should expect on unbacked symints in the near future.

tl;dr I (@ezyang) will be deprioritizing proper support for unbacked SymInts, because it looks like there are fundamental infrastructure improvements in Sympy reasoning and tracing performance we need work on first. Also, unbacked SymInts are not launch blocking for dynamic shapes in PT2. Fortunately, we have identified a few short term unblocking steps that can help immediate users of unbacked SymInts make progress, albeit at the cost of some slight unsoundness (which we don’t expect to matter in practice.)

Background

In PyTorch’s tracing model, we ordinarily try to treat the shapes of input tensors symbolically. However, if we need to perform control flow on an expression involving one of these symbolic sizes, we peek at the true values to resolve the condition to true/false, and install a guard saying that are trace is only valid if the condition evaluates equivalently in the future. This is motivated by the fact that in a tracing framework, we cannot easily trace both sides of the conditional (you could use something like thermometer continuations to run the trace as if the condition evaluated true, and then rewind and rerun the trace as if the condition evaluated false, but you still have the problem of a combinatorial explosion of possible paths you could go down.)

Guarding works well for statically known sizes, but if you call an operator like torch.nonzero, it will produce a size that is only known at runtime. Our idea for how to handle this case is simple: we produce a symbolic size that has no underlying value (aka is “unbacked”), and instead error if we attempt to guard on this symbolic size.

Some further reading that you may find helpful: subclass_zoo/dynamic_strides.ipynb at main · albanD/subclass_zoo · GitHub (about dynamic shapes and strides) and subclass_zoo/dynamic_shapes.ipynb at main · albanD/subclass_zoo · GitHub (about dynamic shapes in general)

Current state

Here is the state of unbacked SymInts on master:

Where exactly are these guards coming from? Here is a mostly complete accounting:

    case MemoryFormat::Contiguous: {
      // dim_ is a virtual call, don't repeat it
      const auto dim_ = dim();
      extra_meta_->strides_.resize(dim_);
      if (dim_ > 0) {
        const auto last_idx = dim_ - 1;
        extra_meta_->strides_[last_idx] = c10::SymInt(1);
        for (auto i = last_idx - 1; i >= 0; --i) {
          extra_meta_->strides_[last_idx] =
              extra_meta_->strides_[i + 1] * extra_meta_->sizes_[i + 1].max(1);
        }
      }
  • When we construct a TensorImpl, we have to compute whether or not it is contiguous. It turns out that we don’t short circuit this computation when torch.empty (whose output is guaranteed to be contiguous) is called. This computation branches on size and stride:
  for (int64_t d = int64_t(sizes.size()) - 1; d >= 0; d--) {
    const auto& size_d = sizes[d];
    if (size_d != 1) {
      if (strides[d] == z) {
        z *= size_d;
      } else {
        is_contiguous = false;
        break;
      }
    }
  }
  • Even if we were to shortcut contiguity call, we still need to compute whether or not the tensor is channels last contiguous. This is because some contiguous tensors are ALSO channels last contiguous, e.g., (1, 1, 1, 1). This computation branches on size and stride in a similar way:
      T expected = 1;
      for (auto& d : {1, 3, 2, 0}) {
        const auto& size_d = sizes[d];
        if (size_d != 1) {
          if (strides[d] != expected) {
            return bool_is_channels_last_contiguous(false);
          }
          expected *= size_d;
        }
      }
      return bool_is_channels_last_contiguous(true);
  • We also compute whether or not a tensor is non-overlapping and dense. The computation here is quite involved: we have to do a sort on the strides (sorting network, anyone?) But fortunately, it also can be short circuited in the common case (since a tensor is definitely non-overlapping and dense if it is contiguous.) However, it cannot always be short-circuited; for example, if you allocate a tensor with an unbacked SymInt size and then take a view on it, the view may not be contiguous, and so you have to do the full computation in that case. This is exactly what happens in the case of boolean indexing (in the meta implementation of index.Tensor). One extra problem is the call to nonzero here is in library code, so if you want to add asserts on the result of index, you can’t easily do this.
            if index.dtype in [torch.int8, torch.bool]:
                nonzero = index.nonzero()
                k = len(result)
                check(
                    k + index.ndim <= self.ndim,
                    lambda: f"too many indices for tensor of dimension {self.ndim}",
                    IndexError,
                )
                for j in range(index.ndim):
                    check(
                        index.shape[j] == self.shape[k + j],
                        lambda: f"The shape of the mask {index.shape} at index {i} "
                        f"does not match the shape of the indexed tensor {self.shape} at index {k + j}",
                        IndexError,
                    )
                    result.append(nonzero.select(1, j))

To support “proper” unbacked SymInts, we must modify our framework code to avoid guarding in all of these cases. We also need a mechanism by which users can insert extra runtime assertions to ensure that if a guard is unavoidable (e.g., size >= 0) but will always resolve one direction, we can manually guide tracing down one branch of the guard and have a runtime test to verify that we would have gone down that path at runtime as well.

Progress towards unbacked SymInts

The diff stack at Switch is_contiguous to use guardless implementation by ezyang · Pull Request #92234 · pytorch/pytorch · GitHub was my attempt to directly remove all of these guards. The infrastructure pieces (first three PRs) were not too bad (and I intend to land them), however, they do not actually remove the guards. The actual guard removal has run into at least two problems:

  • When I remove the stride computation guard (max between size and 1), it makes test_aot_autograd_symbolic_exhaustive_nn_functional_max_pool2d_cpu_float32 take a long time to run. This suggests that one of the guards from stride computation was essential for simplifying our size variables and prevented Sympy from doing a ton of unnecessary work.

  • When I remove the compute contiguous / non-overlapping guards, along with causing problems with max_pool2d, it also makes test_aot_autograd_symbolic_exhaustive_nn_functional_unfold_cpu_float32 take a long time to run.

While I could believe that the ultimate fixes for these two problems could be quite short in the end, the way we thread the needle with our Sympy processing is quite opaque to me, and I don’t feel comfortable in being able to discover the fixes for these problems in that amount of time. Combined with our deadline of getting dynamic shapes turned on by default for the PT2 release (), as well as the fact that we need to make investments to speeding up Sympy simplification (e.g., Nuno’s suggestions), I think it makes sense for me to deprioritize making full unbacked SymInts work in the short term. If a brave soul wants to step up to try to fix these problems, I can help give advice.

The short term unblock

The Executorch folks are attempting to export models with symbolic shapes in them. If we don’t have true unbacked SymInts, what can you do? A lot, it turns out. Here are the workarounds I suggest trying, from least sophisticated (aka most unsound), to most sophisticated (but requiring more work).

  1. If you need to implement a meta function for an operator with dynamic output size, an extremely simple first step is to simply make the function return a fixed size (zero, or the maximum possible value, are good choices) instead of an unbacked SymInt. This is unsound for the following reasons: (1) the fixed size may get burned into the graph (this is more likely to happen with zero; the max size is likely itself a symbolic size, so the burn-in in that case is less severe), (2) conditionals will be burned in according to this size, which means you may miss noticing a branch that actually needs to be replaced with a cond() operator to allow both branches to be exported. However, this is really good for “getting past” a missing output meta and seeing what needs to be done next.

  2. The next step up from (1) is to return an unbacked SymInt, but allow guards to be resolved with respect to an example size in certain contexts. The basic idea is to return an unbacked SymInt, but then “mock up” size_hint so that when we try to guard on the unbacked SymInt, we make use of an example size instead. This prevents some burn-in (as we are still passing around a fresh, unbacked SymInt for tracing purposes); we only burn-in conditional paths. A user could then use logging / manual inspection to see if any conditional paths need to be treated specially. You can also only mock in certain fragments of code (e.g., while executing a tensor factory) where the example value is known to generalize for all possible values. I can help implement this, though I’m not exactly sure when to do it.

We can also in parallel implement the mechanism for adding runtime asserts, although without better guard simplification framework (e.g., Nuno’s range analysis) it is difficult to use these asserts to actually resolve guards on unbacked SymInts.

The long term plan

To be honest, I have not committed to a particular course of action for how to solve these problems. I can think of two plausible trajectories.

The conservative route. We land my PR stack as is, and try to hotfix the particular simplification problems case-by-case. Hopefully, they aren’t too hard to resolve and there aren’t too many of them.

The blow-it-up route. We rewrite ShapeEnv’s symbolic reasoning logic from scratch without Sympy or using Sympy in a much more limited fashion, so that algorithmically it is obvious that we always get good performance. This would also help resolve performance issues in tracing from spending too much time in Sympy (according to @voz , in Bert we spend 30% of our time in Sympy simplification.)

Open to comments from the peanut gallery. Honestly, this is going to depend a lot on who ends up doing this work.

2 Likes

How about learn from JAX and add an optional static-sized variant for nonzero, so at least some powerful users could get past this problem? AFAIK Tensor indexing still uses nonzero so we could unblock those use cases if users are willing to change their code.

I’ve raised this 18 months ago: [JIT] Support JAX-style statically shaped nonzero to avoid host-device synchronization · Issue #62320 · pytorch/pytorch · GitHub. Seems still relevant to me.

Hi!

I would be curious how indexing like t[mask] = t[mask] * 2 can be solved by adding this new size argument? You still need to find a way to get the number of non-zero values internally before being able to call nonzero?

I guess in principle it could be rewritten as t = torch.where(mask, t * 2, t).

See Alternative to array-based boolean indexing for jax.jit · Issue #2765 · google/jax · GitHub

FWIW, I do think we should have a static-sized variant for nonzero. But for example, in the boolean masking example, this wouldn’t actually help, as there is no nonzero call; instead, the nonzero call is implicit in the index by boolean mask. The torch.where works well for pointwise operation on top, but for the flip() example you need to do something more complicated.

1 Like

State of symbolic shapes: Jan 20 edition

Previous update: State of symbolic shapes branch - #31 by ezyang

Executive summary

Volume-wise, there wasn’t that much activity, but there were three landed PRs that had a disproportionate effect on our metrics. First, we landed Brian’s AOTAutograd fixes which fixed a large number of assert faiulres; second, Horace is finally back to dynamic shapes and landed a PR that fixes a few Inductor inference dynamic shapes problem (fixing inductor enough that we can start reporting master stats again); finally, I noticed an accounting problem for our stats, where many of the failures we were reporting actually had nothing to do with dynamic shapes. Overall, this pushed our delta for aot_eager to TWO :tada::tada::tada: (one coverage, one timeout). This is fantastic, and we are turning our attention to other areas of dynamic shapes support:

  • Brian is spearheading tracking the number of extra graph breaks caused by dynamic shapes (tracked on “extra graph breaks” sheet at Symbolic shapes work items tracker - Google Sheets ). For now, we are only looking at torchbench. We don’t have a consolidated statistic to track this week over week yet but we will soon.
  • Horace is grinding down inductor inference failures with dynamic shapes (tracked on “inductor eval” sheet at Symbolic shapes work items tracker - Google Sheets ; the horace sheet is with Horace’s WIP stack). We are in the progress of transitioning regular CI coverage from testing aot_eager training to testing inductor inference, which will allow us to give comparable metrics to aot_eager on master (this week we will have a one-off metric here).
  • Voz is working on improving our tracing time, which is called out by both OSS and internal users as a problem, and is a big problem for dynamic shapes, which is ostensibly about improving compilation times. We are also in the process of preparing a consolidated statistic to track week over week.

We also need to start working on inductor training support, which is has its own unique challenges. We’ve also been discussing nested tensor / jagged tensor compilation with inductor (e.g., PyTorch Composability Sync: Nested/Jagged Tensor compilation - YouTube ). We are deprioritizing work to characterize how dynamic/static our benchmark suite is, and instead indeed to evaluate this ad hoc on use cases where users come to us and say “hey, I need this to be dynamic.” One example is this script from Timothee Lacroix: Redirecting... (Meta only). There is some discussion about needing a more fine-grained way to turn on dynamic shapes (e.g., instead of turning it on for ALL local tensors, only turning it on for tensor dimensions that are known to be dynamic.)

Status report:

What’s made it to master since last time?

ezyang

voz (nothing dynamic shapes related)

Chillee

jbschlosser (nothing; just got back from PTO)

bdhirsh

nkaretnikov

What’s coming next?

  • ezyang: CI stuff, then probably trying to get inductor training going on master
  • Chillee: hosing down inductor inference errors
  • bdhirsh: working on dynamo graph breaks; also working on AOTDispatch enhancements for torchquant and nested tensor
  • jbschlosser: not sure yet
  • nkaretnikov: enabling dynamic shapes testing on inductor

Our north star: Dynamic shapes at feature parity with static shapes for PT2 release (but NOT turned on by default)

2 Likes

There is now a manual for all things dynamic shapes related, check it out here: The dynamic shapes manual - Google Docs

1 Like

State of symbolic shapes: Jan 29 edition

Previous update: State of symbolic shapes branch - #37 by ezyang

Executive summary

We are two weeks away from branch cut for PyTorch 2.0. Dynamic shapes has enough on master that we are non-blocking for the release: there is still a lot we want to get in before the release, but the most important stuff is landed. In particular, Horace landed more inference fixes and we also have enabled CI for Inductor inference on master. There is a PR in progress for training https://github.com/pytorch/pytorch/pull/93059 but our general thinking is that dynamic shapes is more important for inference (where you are more likely to want to vary sequence length) as opposed to training.

Horace’s order of operations is: (1) basic training support, (2) inference performance on autoregressive generation, (3) other stuff; Edward will just be working on general enablement here and there. Voz is still working on trace time performance (some improvements landed, and some very promising work on short circuiting meta computation at [WIP] [RFC] add shape_preserving notions to decomps for fake_tensor specific short circuiting by voznesenskym · Pull Request #93118 · pytorch/pytorch · GitHub could also lead to speed wins with static shapes too.) Brian and Joel have still been working on Dynamo graph breaks, although none of the PRs from this workstream have landed yet (still working out Dynamo code review.)

  • Models outside of the benchmark suite. We took some fun models out for a spin last week. wav2vec2 is successfully running inference under torch.compile with dynamic shapes. maskrcnn is not in as good a state, but a lot of its blockers are things we know about and have been working on.
  • Accuracy failures. Background_Matting and LearningToPaint are failing accuracy with inductor inference with dynamic shapes, but not without dynamic shapes. These are priority to fix.
  • Documentation. This got its own post, but in case you missed it: there is now a manual for dynamic shapes enablement: The dynamic shapes manual - Google Docs Let us know if there’s anything you’d like to see in it.
  • How dynamic is the benchmark suite? Edward ran an experiment where he printed out the number of unique symbolic variables after tracing. Interestingly, most models only have one unique symbolic variable (likely the batch dimension.)
  • Why is tracing so slow? Voz added a bunch of extra instrumentation to help better characterize what exactly we’re doing when tracing, and Horace ran some experiments. One of the more interesting results was that in hf_Bert inference, Dynamo produces a graph with 570 nodes, but after AOTAutograd this balloons to 1528 nodes. Making matters worse, fake tensor is invoked 47000 times (16k occurring before AOTAutograd, 31k after.) This is what pushed us in the direction of reducing fake tensor overhead with meta function short circuiting. Hacky experiments by Voz show we can get a 50-70% speedup this way. Also, pytree is slow, we are eagerly awaiting [WIP][POC][pytree] Use OpTree for PyTree manipulation by XuehaiPan · Pull Request #92679 · pytorch/pytorch · GitHub
  • Model training status on master. See also Symbolic shapes work items tracker - Google Sheets
    • aot_eager inference: -3 (NEW!). It turns out there are some models that are failing static shapes aot_eager training but not inference. These appear to be failing for straightforward coverage reasons and should be easily fixable.
    • aot_eager training: -1 (-1 WoW). The only remaining error is a timeout, which we hope will be resolved by trace time performance work.
    • inductor inference: -16 (NEW!). Doing a more direct comparison against Horace’s stack from last week, a manual sweep gives 143/160 (+43 WoW)
    • inductor training: with Horace’s patch, 49/129 (NEW!)
  • OpInfo tests on symbolic shapes.
    • pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 547 passed (+5 WoW), 523 skipped (no change), 196 xfailed (-5 WoW)
    • pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 302 passed (+5 WoW), 143 skipped, 828 deselected, 188 xfailed (-5 WoW)

What’s made it to master since last time?

ezyang

Chillee

voz

jbschlosser

What’s coming next?

  • ezyang: inductor inference accuracy failures, popcorn enablement
  • Chillee: inductor training, autoregressive generation performance
  • bdhirsh: dynamo graph breaks, inference functionalization (this looks like we will still need to put copy_ in the graph)
  • jbschlosser: dynamo graph breaks
  • nkaretnikov: finally getting the floor div patch series in (it fixes real bugs!)

Our north star: Dynamic shapes at feature parity with static shapes (but NOT turned on by default.)

Mini-update: Dynamic shapes enablement for OpenNMT

vince62s has been interested in using using dynamic shapes with OpenNMT. I haven’t had a chance to try the full model, but a small extracted example is (mostly) successfully compiling in a size generic way when run with dynamic shapes. Check out my summary comment for more details!

1 Like

State of symbolic shapes: Feb 4 edition

Previous update: State of symbolic shapes branch - #38 by ezyang

Executive summary

We are one week away from branch cut for PyTorch 2.0. We are continuing the slow grind of last mile bugs: there are still no major features we want to get in before the branch cut. We have reached the point where there are no more “easy” bugs; every bug is nontrivial and has someone working on it. On the people front: Brian is focusing on inference functionalization + AOTAutograd related OOMs; Voz has passed off trace time improvements to Edward.

  • Expression hoisting. Natalia from Inductor has been pitching in to help us with some dynamic shapes related Inductor problems. She landed symbolic shape hoisting to host compute dynamic tensor shapes for indexing on the host by ngimel · Pull Request #93872 · pytorch/pytorch · GitHub which greatly improved the quality of generated code, and hopefully also fixes a decent number of floor/ceil problems. She’s been working on unet upsampling, which uses floating point to do indexing computations (oof).
  • Tracking dynamic shapes performance. @gchanan has asked us to start tracking the performance of models running with dynamic shapes. This should be integrated into the performance dashboard; hopefully it can happen in the next month; but perhaps it should wait until we move performance onto GCP A100 in CI, as the current dashboard run takes 16 hours.
  • Dynamic shapes minifier. The dynamic shapes minifier is less broken now; Edward spent some time this week working on it to extract repros for LearningToPaint and Background_Matting accuracy errors. However, it still does not work very well (AOT accuracy minification is fundamentally broken and needs a rewrite.)
  • More on tracing speed. OpTree is still shambling along, but at this rate unlikely to make branch cut. Voz passed off his hacky prototype to Edward, who has implemented a sound version of the optimization at https://github.com/pytorch/pytorch/pull/94047 ; it isn’t as fast as the hacky prototype, but in exhaustive testing appears to match behavior exactly.
  • Model status on master. See also Symbolic shapes work items tracker - Google Sheets
    • aot_eager inference: -6 (-3 WoW). The regression is due to the CUDA 11.7 upgrade, which introduced new accuracy failures. This upgrade also caused this exact set of models to start accuracy failing on training with static shapes, so it’s probably not dynamic shapes specifics, but by our metric design this still counts as a regression. We haven’t had a chance to investigate yet.
    • aot_eager training: 0 (+1 WoW). :tada::tada::tada: I believe the timeout was fixed by some of Voz’s performance improvements.
    • inductor inference: -10 (+6 WoW). The breakdown is: two accuracy failures (Inductor miscompilation with dynamic shapes from LearningToPaint · Issue #93361 · pytorch/pytorch · GitHub Inductor miscompilation with dynamic shapes from Background_Matting · Issue #93864 · pytorch/pytorch · GitHub), two timeouts (hopefully we can book some more trace time improvements), and the rest are floor/ceiling not defined errors (Natalia is working on this.) Fixing this last problem is nontrivial; ordinarily, you would think to just define how to translate floor/ceiling into Triton, but actually there is no Triton builtin for these operations. Our current thinking is to hoist this computation out of the Triton kernel; should be good for performance too. This is out of date since Natalia landed her patch, need to test.
    • inductor training: no change, we are waiting on Horace to land his patch
  • Opinfo tests on symbolic shapes.
    • pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 550 passed (+3 WoW), 522 skipped (-1 WoW), 192 xfailed (-4 WoW)
    • pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 304 passed (+2 WoW), 143 skipped (no change), 185 xfailed (-3 WoW)
  • Graph breaks on master. -19 (NEW) models have more graph breaks with dynamic shapes than without static shapes. (see sheet aot_eager static 2/4)

What’s made it to master since last time?

ezyang

Chillee

jbschlosser

nkaretnikov

ngimel

What’s coming next?

  • ezyang: landing tracing speed, model enablement fixes (Lacroix, MaskRCNN, OpenNMT, detectron2), unbacked SymInt?
  • Chillee: landing inductor training, autoregressive generation performance, inductor accuarcy failures (and minifier fixes for it)
  • bdhirsh: inference functionalization
  • jbschlosser: more dynamo graph breaks?
  • voz: landing dynamo graph breaks. bug fixes? post branch cut: forbid specialization (previously known as guard inversion), test bankruptcy

Our north star: Dynamic shapes at feature parity with static shapes (but NOT turned on by default.) We might want to adjust this so that dynamic shapes is “on” by default, but we assume all inputs are static by default.

3 Likes

State of symbolic shapes: Feb 11 edition

Previous update: State of symbolic shapes branch - #40 by ezyang

Executive summary

It’s the eve of the branch cut for PyTorch 2.0. Training support is still not landed to master, though Horace says he’s made progress against some of the FSDP bugs plaguing his PR. Brian’s functionalization for inference is on the cusp of landing, after working through considerable flakiness in Inductor. The general mood is that we need more inductor developers. We had a lot of forward looking design discussion this week, check the bullets for details.

  • Tracing speed improved on master. Fast path binary ops in fake tensor by ezyang · Pull Request #94047 · pytorch/pytorch · GitHub has made it to master; on some models, it can cut dynamic shapes tracing time in half. There is some low hanging fruit for extending the approach here to more operators. In general, we’ve found improving trace time challenging as Python profiles tend not to identify problems directly (one thing we note is that we don’t get aggregation on a per-op basis because of all the metaprogramming we do; another hypothesis for our difficulty is the amount of Python-C++ crossings we tend to do). We don’t think Sympy is the main cause of slowdown, but Sympy can cause other problems when it gets into pathological cases (see below.) We need to start tracking this metric in our runs.
  • Dynamic shape induced graph breaks are way down. Voz and Joel have done a lot of good work hosing down extra graph breaks from dynamic shapes. The remaining holdout is avoiding graph breaks on negation; fixing the graph break is easy but the resulting graph appears to cause Inductor to infinite loop; Voz was unsuccessful at diagnosing the problem.
  • Dynamic by default. jansel has argued us back into goaling on dynamic by default. The general idea is that we should be leaning into PyTorch’s historic flexibility and support for dynamic shapes. Jason’s straw proposal is the first time we assume it’s static, and upon recompile we compile with dynamic shapes. Elias cautions that if there aren’t too many dynamic shapes, it may be better to generate separate specialized kernels for each and retain use of CUDA graphs and autotuning. Most of us were generally positive on this idea, but concerned about the implementation: dynamic shapes trace time, greater bug surface and lower generated code quality. Still, we can probably keep moving PT2 in this direction, but more data is also necessary.
  • Fine-grained dynamic dimensions. We want an API for marking dimensions as being explicitly dynamic. There are two primary users for this: (1) export, where we want to reject specialization/guards on dimensions that are explicitly dynamic as these usually indicate tracing problems and (2) eager mode power users, who want fine grained control over what dimensions should be compiled dynamically–e.g., to maximize performance, avoid unnecessary recompilation, or just diagnose why a model is recompiling when it shouldn’t. The fine-grained API is not intended to be the initial starting point for regular users: normal users should be able to not annotate anything and get intelligent results (see bullet above.) We went through a lot of API variations, but our current plan is to ship an API that lets you mark tensors as having dynamic dimensions, which affects how torch.compile does compilation. In the long term, we intend for eager mode to propagate these annotations using symbolic meta formulas (which is important for making this work across graph breaks). These annotations will NOT do anything if inserted inside a model; they only work at the “top level” where you trigger compilation. Some minutes for this discussion at https://docs.google.com/document/d/1aoIyYE8_6cYpWqS25thzVoIiKsT5aaUEOiiPwbIXt8k/edit Sherlock has an out-of-date prototype of the API at Fine grain dynamic shape by SherlockNoMad · Pull Request #93813 · pytorch/pytorch · GitHub
  • torch.constrain. When you mark a dimension as dynamic, you reject any user code which would constrain the dimension. But what if the constraint is that the dimension != 0; you would like a way to say that this constraint is acceptable. torch.constrain is an API that can be called inside the model to indicate these constraints should be accepted. Originally, Voz and Horace were imagining that you could put arbitrarily Python expressions in these constraints. We’ve negotiated to only allow a more limited set of constraints to start: min/max and multiple-of. We plan not to allow “relational” constraints that relate two separate symbolic variables: we simply assume that these are always OK. This means that export still needs some mechanism for communicating “implicit” guards (much in the same way we implicitly guard on the dtypes of all tensor inputs.) This API will likely also be used by unbacked SymInts. A big benefit of only allowing min/max constraints is they are easy to implement; you do not need to use Z3.
  • Unspecialize is adding too many float/int arguments. Horace noticed that forward graphs with dynamic shapes turned on have a lot of SymInt/SymFloat arguments. This is because we set specialize_int_float = False with dynamic shapes, but it turns out there are quite a few int/floats we SHOULD be specializing on. Horace is looking into this more.
  • SymInts/SymFloats in the backend compiler calling convention. In several conversations this week, we discussed whether or not Inductor should accept graphs that ONLY have Tensor inputs, or are int/floats valid inputs to the graphs. jansel and Natalia argued that it works well to turn int/floats into 0d cpu scalar tensors, and inductor supports this well. However, Horace pointed out that this does not work in general: if you have a backward graph for x.sum(), you need to do an x.expand(sym_int), where the SymInt isn’t necessarily derivable from any input to the backward graph. You can make it work, but the price is an FX graph that isn’t executable from Python anymore. So our current thinking is that we are doubling down on int/float inputs, unless jansel manages to change our minds.
  • Unbacked SymInts. Lots of progress: the stack at Get boolean masking to work with unbacked SymInts by ezyang · Pull Request #94523 · pytorch/pytorch · GitHub gets boolean masking working (and all but the last PR are passing CI). The general flavor of the work here is that you run into a lot of guards that fail on unbacked SymInts, but they all tend to be workaroundable one way or another. Edward next plans to tackle one of the internal models that needs this for mobile export, as well as getting range analysis going.
  • Model status on master. See also Symbolic shapes work items tracker - Google Sheets
    • aot_eager inference: -2 (+4 WoW). The CUDA 11.7 upgrade regression has been resolved. The remaining two errors are iadd related, which should be resolved by General in-place binary op support in dynamo by jbschlosser · Pull Request #94203 · pytorch/pytorch · GitHub
    • aot_eager training: 0 (unchanged). No regressions!
    • inductor inference: -8 (+2 WoW). The improvements are from hoisting from Natalia; there are also some new improvements from Natalia which should help with the remaining errors.
    • inductor training: still waiting on Horace to land his patch
  • Opinfo tests on symbolic shapes.
    • pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 553 passed (+3 WoW), 523 skipped (+1 WoW), 196 xfailed (+4 WoW). The increase in xfails are from some new RNG opinfos contributed by min-jean-cho.
    • pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 305 passed (+1 WoW), 146 skipped (+2 WoW), 185 xfailed (no change)
  • Graph breaks on master. -2 (+17 WoW). We made a lot of progress! timm_efficientdet and XLNetLMHeadModel are the last stragglers.
  • Tracing cost of enabling dynamic shapes (NEW!) Mean: 28s (-5s WoW), Max: 660s (-145s WoW; swin_base_patch4_window7_224). TODO: explain

What’s made it to master since last time?

ezyang

voz

jbschlosser

nkaretnikov

ngimel

bdhirsh:

What’s coming next?

  • ezyang: Unbacked SymInt range analysis, and then probably moving into the calling convention area
  • Chillee: landing inductor training, ??? not really sure ???
  • bdhirsh: landing inference functionalization, into per-dispatch key mode stacks and then torchquant/DTensor PT2 support
  • jbschlosser: helping with the core/export “fake sprint” https://docs.google.com/document/d/1NT2eCCT13QRqet_iCZzPXcjkBOK9qiuQKbaQsd8o4Tk/edit#heading=h.uh5wfznvwt0c ; straggler failures that aren’t captured in CI (moco, hf_BigBird)
  • voz: graph breaks, into dynamic to static guard rejection. Some minor rambling around with Z3.

Our north star: Dynamic shapes at feature parity with static shapes. Actively in discussions about getting back to “turned on by default, for some definition of default” for PT 2.1.

2 Likes

State of symbolic shapes: Feb 19 edition

Previous update: State of symbolic shapes branch - #42 by ezyang

Executive summary

The branch cut for PyTorch 2.0 has come and gone. Training support was not landed to master in time for the branch cut, so it is unlikely to be working in the official PT2 release (on the bright side: inference functionalization made it for the branch cut! We think this fixed some inductor inference accuracy problems.) Stick to testing dynamic shapes on inference, sorry folks. A lot of the work this week was focused on inference export scenarios: there was quite a bit of progress on unbacked SymInts, the beginnings of a fine-grained dynamic shape API are in master and we also covered the 0/1 specialization problem in composability sync.

  • Unbacked SymInt progress. We made a lot of progress for unbacked SymInts support; most notably, value range analysis for unbacked SymInts has made it to trunk, so you can now use constrain_range on an unbacked SymInt to discharge guards (e.g., by marking an unbacked SymInt as >= 0). The entirety of CrystalDPR has successfully been traced end-to-end Redirecting... (Meta-only), and I’m currently in the process of landing the necessary changes / cleaning up hacks. The biggest new discovered item we have to handle is when user code tests if a mask has any True elements in it, and if so, performs a boolean mask (otherwise, it lets all the elements through); to faithfully export this, we must use torch.cond, and we also must be able to join divergent shape variables together (into an unbacked SymInt). We also identified that guard free implementations of PyTorch must be allowed to change stride behavior sometimes; we will justify this under stride agnostic PyTorch (which says changes in strides are not allowed to affect program semantics.)
  • Fine-grained dynamic shapes. Voz has landed an initial version of fine-grained dynamic shape control in https://github.com/pytorch/pytorch/pull/94787 . This is a compromise API, where you still have to mark dynamic=True and assume_static_by_default=False, pending some refactors to bypass this. The PR as landed still uses backed SymInts, and only tests if a dynamic dimension is overconstrained at the end of trace time; switching it to use unbacked SymInts for export and better diagnostics for why guards occurred is upcoming work.
  • 0/1 specialization. Export is moving to using unbacked SymInts for exporting dynamic dimensions, to ensure that the resulting programs are not 0/1 specialized. This is the outcome of discussing the following doc The 0/1 specialization problem in Pt2 Export - Google Docs in composability sync this week. If you’re curious what it all means, there’s a podcast giving basic background about this problem here: https://pytorch-dev-podcast.simplecast.com/episodes/zero-one-specialization
  • Model status on master.
    • aot_eager inference: -1 (+1 WoW). The last holdout is vision_maskrcnn, which is due to an FX printing problem involving constants and inplace addition.
    • aot_eager training: 0 (unchanged). No regressions!
    • inductor inference: -5 (+3 WoW). We no longer have accuracy failures with dynamic shapes (we never root caused this, but I bet it’s inference functionalization related); pytorch_unet was fixed by more improvements from Natalia
    • inductor training: still waiting on Horace to land his patch
  • OpInfo tests on symbolic shapes.
    • 557 passed (+4 WoW), 524 skipped (+1 WoW), 197 xfailed (+1 WoW). New OpInfo for _upsample_bilinear2d_aa among others.
    • 306 passed (+1 WoW), 148 skipped (+2 WoW), 185 xfailed (no change)

What’s made it to master since last time?

ezyang

voz

nkaretnikov

ngimel

bdhirsh:

What’s coming next?

  • ezyang: landing the rest of the CrystalDPR enablement fixes, presenting about how unbacked SymInt enablement works, then probably more on calling convention (but I also might want to work on some real model enablement for a bit? Depends on how much time I have)
  • Chillee: still landing inductor training
  • bdhirsh: per-dispatch key mode stacks and then torchquant/DTensor PT2 support
  • jbschlosser: not sure
  • voz: not sure

State of symbolic shapes: Feb 26 edition

Previous update: State of symbolic shapes branch - #43 by ezyang

Executive summary

Export summit was this week. nonzero export support has hit master, give it a shot with capture_dynamic_output_shape_ops; with this, dynamic shapes work stream is deprioritizing export use cases for a bit. Training support is still not landed, this week we blame export summit for reducing coding time.

  • Worse is better unbacked SymInt. Last week, we showed off a PR stack with all the tricks for avoiding guards in core PyTorch code. Well, people didn’t like having to change their code to carefully avoid guarding on 0/1. So we came up with a new plan: Steps forward for 0/1 specialization (worse is better) - Google Docs that involves doing a lot less work, as we will just unsoundly assume that unbacked SymInts cannot be 0/1 (the observation being that: if code works for any value of N>1, it usually works for N=1 too.) The rest of the non-controversial unbacked SymInt changes have landed to master (in particular, value ranges are in! Lots of opportunity here!), and @ezyang is moving back to non-export related dynamic shapes work. To experiment with item()/nonzero() support, you must set config flags capture_scalar_outputs and capture_dynamic_output_shape_ops to True. Inductor does NOT support nonzero, so this is mostly only useful for export.
  • Delta benchmarking. You can now easily run a sweep for static/dynamic and compare their frame time / graph breaks with Utility for running delta comparisons between two flag configs. Feel free to add any more comparisons to the script as necessary.
  • State of real world model enablement. I’d like to keep this section around from week-to-week. Not exactly sure how to setup the numeric metrics yet, stay posted. We don’t have a stack ranking of how important any given model is, but there are not so many so far so I think it is safe to say they are all important.
  • Model status on master.
  • Opinfo tests on symbolic shapes.
    • pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 562 passed (+5 WoW), 523 skipped (-1 WoW), 195 xfailed (-2 WoW)
    • pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 318 passed (+12 WoW), 147 skipped (-1 WoW), 175 xfailed (-10 WoW)
  • Graph breaks on master. -3 (-1 WoW). The regression is hf_Longformer and AllenaiLongformerBase on master, non-function or method super: <slot wrapper '_setitem_' of 'collections.OrderedDict' objects>, @voz is investigating. To catch regressions, we’re going to add expected graph break counts to CI; @wconstab is going to do this.
  • Tracing cost of enabling dynamic shapes (aot_eager). Mean: 21s (-7s FoF), Max: 254s (-406 FoF). Two weeks ago I said I was going to explain what these stats are and forgot . Here’s the explanation I promised. What we would like to measure is how much more expensive doing tracing with symbolic shapes is, since if it is substantially more expensive in absolute terms our users won’t want to have it turned on by default. So what we do is we run aot_eager accuracy benchmark with and without benchmarks, and record how long we spend inside frame compilation. We then compare the difference and take the mean and max. Improvements from last two weeks are likely from better constructor short circuiting (no longer decomposing empty.)

What’s made it to master since last time?

I’m retiring this section, but holler if you found it useful and I’ll start doing it again.

What’s coming next?

  • ezyang: no longer working on unbacked SymInts, back on model enablement. Will spend some time ramping on inductor (mood: no docs WRYYYY). Still planning to untangle the int-inductor calling convention problem
  • Chillee: still landing inductor training. Maybe will write another Inductor explainerdoc.
  • bdhirsh: aotautograd refactor for training export, as well as per-dispatch key mode stacks and then torchquant/DTensor PT2 support
  • jbschlosser: pivoting to sparse stuff, will be omitted from list here
  • voz: follow through on fine grained dynamic shapes, follow up on graph break regression, constraints

State of symbolic shapes: Mar 5 edition

Previous update: State of symbolic shapes branch - #43 by ezyang

Executive summary

The tl;dr:

  • Training support has landed in master, BUT you need to set torch._functorch.config.use_dynamic_shapes = True to use it (will be fixed soon)
  • specialize_int_float is renamed to specialize_int, and it now actually works. It is temporarily defaulted to True (but this will change soon)
  • Only int inputs will be allowed in inductor; floats must be passed as 0d tensors. More at Handling int/float inputs/intermediates in Inductor - Google Docs

The details:

  • Training support has hit master… sort of. Horace’s patch to add training support passed CI and is landed, but it turns out our testing was insufficient and you need to manually turn on torch._functorch.config.use_dynamic_shapes = True for it to actually work. This is fixed in Fix training enablement in AOTAutograd by ezyang · Pull Request #95975 · pytorch/pytorch · GitHub which isn’t landed yet. A large number of models pass, but there are also many models which fail or have accuracy failures, we plan on burning these issues down shortly.
  • Handling int/float inputs/intermediates in Inductor. One of the major points of contention are how exactly non-Tensor ints/floats supposed to be passed to Inductor. Should they be passed as SymInt or 0d tensors? We’ve finally resolved the question at Handling int/float inputs/intermediates in Inductor - Google Docs The overall idea is that sizevar computation should be expressed as non-Tensor computation (or, more specifically, with sympy expressions), but everything else should just be Tensor compute, for ease of lowering. In practice, this means ints are passed as ints (as we must track their sympy expressions in case they are used in a sizevar compute), but we have made a policy decision that input floats can NEVER be used sizevar compute, and thus they can always be passed as 0d tensors.
  • Int unspecialization actually works now. Previously, there was a knob specialize_int_float, which, hypothetically, if set to False, allowed you to generate Dynamo graphs which didn’t specialize on int/float inputs. In practice, this knob didn’t actually do anything, as every integer between 0-17 was specialized anyway. In practice, this matters; for example, overspecialization in torch._dynamo.exc.Unsupported: dynamic shapes: arange · Issue #93468 · pytorch/pytorch · GitHub was due to Dynamo deciding that a batch size of 10 was small enough to specialize on. Make int unspecialization actually work fixes that problem. However, this in the process uncovered a pile of bugs in Inductor regarding unspecialized ints. Right now, int unspecialization is not turned on by default but we intend to shift to it soon, allowing for regressions in CI.
  • We now allow implicit specialization via int conversion. Previously, if you ran int(symint), we would raise a warning, saying that this would cause a specialization, and if you really wanted it, you should explicitly guard_int. We have now relaxed this restriction: we will implicitly convert SymInts to ints and introduce guards as necessary. This switch is profitable because there are a number of APIs which cannot, even in principle, support dynamic shapes, and so allowing these implicit conversions make these APIs work (as opposed to fail as we do today).
  • Guards depending on unbacked SymInts. @tugsbayasgalan took the new nonzero export support in master, and he noticed one major gap: in some cases, we would generate guards that depended on unbacked SymInts, which is a big no-no, because guards must be executed at the very beginning of a model, but an unbacked SymInt may only become known midway through execution. The fix for this Don't generate guards that refer to unbacked SymInts by ezyang · Pull Request #95732 · pytorch/pytorch · GitHub revealed that there a number of guards with the odd property: (1) if you replace the backed shape variables (s0, s1, …) with their size hints, you can statically determine what the guard should evaluate to given the example inputs, but… (2) without this replacement, it’s not clear if the guard is true or not. For example, Ne(36*i0, 36) is trivially True when i0 > 1, but the real expression in context is Ne(i0*((s2//2 - 1)//8)**2 + 2*i0*((s2//2 - 1)//8) + i0, ((s2//2 - 1)//8)**2 + 2*((s2//2 - 1)//8) + 1) (which Tugsuu also thinks is True, but sympy can’t figure it out.) Another example is Eq(i0*s3**2, 9*i0), where this should result in a guard that s3 = 3 but sympy once again cannot figure it out. Our current hypothesis is that many of these shape variables are actually static at the end, but at the time the guard we don’t know what they are; so either deferring the checks till later or encouraging users to assume_static_by_default = True should help. Tugsuu will validate this next week.
  • PSA: size_hint vs evaluate_expr. We found out this week that some internal teams are making use of ShapeEnv, and were misusing evaluate_expr. Contrary to what its name suggests, this not only evaluates an expression, but it ALSO installs a guard. If you want to peek at what the value of the expression is without guarding, you should use size_hint.
  • State of real world model enablement.
    • Mark Saroufim has tentatively volunteered to add LLaMa and InstructPix2Pix to torchbench, which will help us track whether they work or not with dynamic shapes.
    • OpenNMT’s arange minimal repo no longer overspecializes, but it fails in Inductor now with assert isinstance(numer, int) and isinstance( at
      torch/_inductor/utils.py:83. This failure also affects fastNLP_Bert, speech_transformer and yolov3 inductor inference
    • No updates: MaskRCNN, Detectron2, wav2vec2, fused boolean mask update

The numbers:

  • Model status on master.
    • aot_eager inference: 0 (+1 WoW), Joel’s PR made it in.
    • aot_eager training: 0 (unchanged). No regressions!
    • inductor inference: -4 (+1 WoW). swin fix made it in.
    • inductor inference unspec: -10 (NEW!). This number is tracking inductor inference with specialize_int = False now that unspecialization actually does something. We plan to subsume the old inductor inference number with this one, as unspecialization is important for avoiding overspecialization in graph breaks in practice. We should probably also switch the aot_eager stats to this as well.
    • inductor training: -42ish (NEW!). We don’t have official CI numbers, also an important bug fix hasn’t made it yet (Fix training enablement in AOTAutograd by ezyang · Pull Request #95975 · pytorch/pytorch · GitHub), the number here is based off of a sweep with this PR patched in.
  • Opinfo tests on symbolic shapes.
    • pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 562 passed (unchanged), 523 skipped (unchanged), 195 xfailed (unchanged)
    • pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 320 passed (+2 WoW), 147 skipped (unchanged), 173 xfailed (-2 WoW)
  • Graph breaks on master. 0ish (+3 WoW). A sweep on 2/23 revealed a extra graph breaks on hf_Longformer and AllenaiLongformerBase but Voz manually confirmed that the static model also graph breaks. @wconstab added ability for CI to record graph breaks Add dynamo graph break stats to CI by wconstab · Pull Request #95635 · pytorch/pytorch · GitHub so hopefully we can just start testing the number of graph breaks in CI and ensure we avoid regressions this way.
    Tracing cost of enabling dynamic shapes (aot_eager). Mean: 20s (-1s WoW), Max: 240s (-14s WoW). This looks within noise.

Known problems

  • Inductor guards are silently discarded; this could cause accuracy problems
  • CI is not testing if we actually successfully generate dynamic code, so we could silently regress this (right now, we validate the generated code by manual inspection)
  • We are not testing performance of dynamic shapes code; it could be heinously slow (fixing this is blocked on GCP dashboard runs)
  • Minifier does not work with dynamic shapes
  • Split reductions in Inductor do not work with dynamic shapes
  • Python profiling gives misleading results for what is causing slowdowns

What’s coming next?

  • ezyang: probably a mix of unspec int / training burndown, and discovering more issues from our E2E models. The assert isinstance(numer, int) and isinstance( is a particular blocker for OpenNMT.
  • Chillee: returning to dynamic shapes, probably something symbolic reasoning related, also some training burndown
  • msaroufim: I asked Mark to add LLaMa and InstructPix2Pix to torchbench, we’ll see if he gets to it or not
  • bdhirsh: still aotautograd refactor for training export, as well as per-dispatch key mode stacks and then torchquant/DTensor PT2 support (was working on reference cycle issue this week)
  • voz: inline shape refinement with torch.constrain, make dynamo.export not run the real model, use original arg names on export Use original arg names if possible by voznesenskym · Pull Request #95898 · pytorch/pytorch · GitHub