Mini-update: Status of unbacked SymInts on Jan 16
I recently presented our progress on unbacked SymInts, our strategy for data-dependent output sizes, in the most recent composability meeting (meeting notes: Composability meeting notes - Google Docs , Meta only recording: Redirecting...). This status post will recap what I described in the meeting, and also explain what you should expect on unbacked symints in the near future.
tl;dr I (@ezyang) will be deprioritizing proper support for unbacked SymInts, because it looks like there are fundamental infrastructure improvements in Sympy reasoning and tracing performance we need work on first. Also, unbacked SymInts are not launch blocking for dynamic shapes in PT2. Fortunately, we have identified a few short term unblocking steps that can help immediate users of unbacked SymInts make progress, albeit at the cost of some slight unsoundness (which we don’t expect to matter in practice.)
Background
In PyTorch’s tracing model, we ordinarily try to treat the shapes of input tensors symbolically. However, if we need to perform control flow on an expression involving one of these symbolic sizes, we peek at the true values to resolve the condition to true/false, and install a guard saying that are trace is only valid if the condition evaluates equivalently in the future. This is motivated by the fact that in a tracing framework, we cannot easily trace both sides of the conditional (you could use something like thermometer continuations to run the trace as if the condition evaluated true, and then rewind and rerun the trace as if the condition evaluated false, but you still have the problem of a combinatorial explosion of possible paths you could go down.)
Guarding works well for statically known sizes, but if you call an operator like torch.nonzero, it will produce a size that is only known at runtime. Our idea for how to handle this case is simple: we produce a symbolic size that has no underlying value (aka is “unbacked”), and instead error if we attempt to guard on this symbolic size.
Some further reading that you may find helpful: subclass_zoo/dynamic_strides.ipynb at main · albanD/subclass_zoo · GitHub (about dynamic shapes and strides) and subclass_zoo/dynamic_shapes.ipynb at main · albanD/subclass_zoo · GitHub (about dynamic shapes in general)
Current state
Here is the state of unbacked SymInts on master:
Where exactly are these guards coming from? Here is a mostly complete accounting:
case MemoryFormat::Contiguous: {
// dim_ is a virtual call, don't repeat it
const auto dim_ = dim();
extra_meta_->strides_.resize(dim_);
if (dim_ > 0) {
const auto last_idx = dim_ - 1;
extra_meta_->strides_[last_idx] = c10::SymInt(1);
for (auto i = last_idx - 1; i >= 0; --i) {
extra_meta_->strides_[last_idx] =
extra_meta_->strides_[i + 1] * extra_meta_->sizes_[i + 1].max(1);
}
}
- When we construct a TensorImpl, we have to compute whether or not it is contiguous. It turns out that we don’t short circuit this computation when
torch.empty
(whose output is guaranteed to be contiguous) is called. This computation branches on size and stride:
for (int64_t d = int64_t(sizes.size()) - 1; d >= 0; d--) {
const auto& size_d = sizes[d];
if (size_d != 1) {
if (strides[d] == z) {
z *= size_d;
} else {
is_contiguous = false;
break;
}
}
}
- Even if we were to shortcut contiguity call, we still need to compute whether or not the tensor is channels last contiguous. This is because some contiguous tensors are ALSO channels last contiguous, e.g., (1, 1, 1, 1). This computation branches on size and stride in a similar way:
T expected = 1;
for (auto& d : {1, 3, 2, 0}) {
const auto& size_d = sizes[d];
if (size_d != 1) {
if (strides[d] != expected) {
return bool_is_channels_last_contiguous(false);
}
expected *= size_d;
}
}
return bool_is_channels_last_contiguous(true);
- We also compute whether or not a tensor is non-overlapping and dense. The computation here is quite involved: we have to do a sort on the strides (sorting network, anyone?) But fortunately, it also can be short circuited in the common case (since a tensor is definitely non-overlapping and dense if it is contiguous.) However, it cannot always be short-circuited; for example, if you allocate a tensor with an unbacked SymInt size and then take a view on it, the view may not be contiguous, and so you have to do the full computation in that case. This is exactly what happens in the case of boolean indexing (in the meta implementation of index.Tensor). One extra problem is the call to nonzero here is in library code, so if you want to add asserts on the result of index, you can’t easily do this.
if index.dtype in [torch.int8, torch.bool]:
nonzero = index.nonzero()
k = len(result)
check(
k + index.ndim <= self.ndim,
lambda: f"too many indices for tensor of dimension {self.ndim}",
IndexError,
)
for j in range(index.ndim):
check(
index.shape[j] == self.shape[k + j],
lambda: f"The shape of the mask {index.shape} at index {i} "
f"does not match the shape of the indexed tensor {self.shape} at index {k + j}",
IndexError,
)
result.append(nonzero.select(1, j))
To support “proper” unbacked SymInts, we must modify our framework code to avoid guarding in all of these cases. We also need a mechanism by which users can insert extra runtime assertions to ensure that if a guard is unavoidable (e.g., size >= 0
) but will always resolve one direction, we can manually guide tracing down one branch of the guard and have a runtime test to verify that we would have gone down that path at runtime as well.
Progress towards unbacked SymInts
The diff stack at Switch is_contiguous to use guardless implementation by ezyang · Pull Request #92234 · pytorch/pytorch · GitHub was my attempt to directly remove all of these guards. The infrastructure pieces (first three PRs) were not too bad (and I intend to land them), however, they do not actually remove the guards. The actual guard removal has run into at least two problems:
-
When I remove the stride computation guard (max between size and 1), it makes test_aot_autograd_symbolic_exhaustive_nn_functional_max_pool2d_cpu_float32
take a long time to run. This suggests that one of the guards from stride computation was essential for simplifying our size variables and prevented Sympy from doing a ton of unnecessary work.
-
When I remove the compute contiguous / non-overlapping guards, along with causing problems with max_pool2d
, it also makes test_aot_autograd_symbolic_exhaustive_nn_functional_unfold_cpu_float32
take a long time to run.
While I could believe that the ultimate fixes for these two problems could be quite short in the end, the way we thread the needle with our Sympy processing is quite opaque to me, and I don’t feel comfortable in being able to discover the fixes for these problems in that amount of time. Combined with our deadline of getting dynamic shapes turned on by default for the PT2 release (), as well as the fact that we need to make investments to speeding up Sympy simplification (e.g., Nuno’s suggestions), I think it makes sense for me to deprioritize making full unbacked SymInts work in the short term. If a brave soul wants to step up to try to fix these problems, I can help give advice.
The short term unblock
The Executorch folks are attempting to export models with symbolic shapes in them. If we don’t have true unbacked SymInts, what can you do? A lot, it turns out. Here are the workarounds I suggest trying, from least sophisticated (aka most unsound), to most sophisticated (but requiring more work).
-
If you need to implement a meta function for an operator with dynamic output size, an extremely simple first step is to simply make the function return a fixed size (zero, or the maximum possible value, are good choices) instead of an unbacked SymInt. This is unsound for the following reasons: (1) the fixed size may get burned into the graph (this is more likely to happen with zero; the max size is likely itself a symbolic size, so the burn-in in that case is less severe), (2) conditionals will be burned in according to this size, which means you may miss noticing a branch that actually needs to be replaced with a cond() operator to allow both branches to be exported. However, this is really good for “getting past” a missing output meta and seeing what needs to be done next.
-
The next step up from (1) is to return an unbacked SymInt, but allow guards to be resolved with respect to an example size in certain contexts. The basic idea is to return an unbacked SymInt, but then “mock up” size_hint
so that when we try to guard on the unbacked SymInt, we make use of an example size instead. This prevents some burn-in (as we are still passing around a fresh, unbacked SymInt for tracing purposes); we only burn-in conditional paths. A user could then use logging / manual inspection to see if any conditional paths need to be treated specially. You can also only mock in certain fragments of code (e.g., while executing a tensor factory) where the example value is known to generalize for all possible values. I can help implement this, though I’m not exactly sure when to do it.
We can also in parallel implement the mechanism for adding runtime asserts, although without better guard simplification framework (e.g., Nuno’s range analysis) it is difficult to use these asserts to actually resolve guards on unbacked SymInts.
The long term plan
To be honest, I have not committed to a particular course of action for how to solve these problems. I can think of two plausible trajectories.
The conservative route. We land my PR stack as is, and try to hotfix the particular simplification problems case-by-case. Hopefully, they aren’t too hard to resolve and there aren’t too many of them.
The blow-it-up route. We rewrite ShapeEnv’s symbolic reasoning logic from scratch without Sympy or using Sympy in a much more limited fashion, so that algorithmically it is obvious that we always get good performance. This would also help resolve performance issues in tracing from spending too much time in Sympy (according to @voz , in Bert we spend 30% of our time in Sympy simplification.)
Open to comments from the peanut gallery. Honestly, this is going to depend a lot on who ends up doing this work.