Major updates to the after AOT accuracy minifier!

The accuracy minifier allows you to detect, extract and minify accuracy problems that arise when you use torch.compile. The post-AOT accuracy minifier (the one that gets triggered when you run TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4 has had some major improvements to it, which I want to tell you about.

  • Real input tensors are now saved. Previously, we would unconditionally generate random data in the extracted repro, which often would cause the bug to go away when you actually ran the repro. We now, by default, save tensors to disk in the checkpoints folder (look in storages/). Note: the repro scripts will still run even if this folder doesn’t exist!

  • We no longer pickle config. Instead, you get a nice and readable set of Python code, and only for configs which diverged from the default settings.

  • You can analyze repros. For any repro script, you can now python analyze it to get a read out like this:

    It does several things:

    1. It will run your model twice and check for nondeterminism in inductor/float64, even on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy.
    2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64.

    Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.) Note that not every intermediate can be compared, because Inductor will fuse operations together and avoid materializing an intermediate all together. In my experience, you will still get an intermediate every 10 or so ops.

  • Accuracy minification is more conservative by default. Even if you manage to extract a repro, the minifier often would rabbithole on some unrelated problem, giving you a useless minification in the end. To combat this, by default the accuracy minifier will only minify if it is able to reproduce an RMSE increase compared to FP64 run of the model. It will refuse to minify if FP64 computation fails, or if a bool/int tensor diverges. You can recover the old behavior by passing --strict-accuracy, but we think this default makes more sense.

  • You can run minifier same process. By default the minifier runs every query in a separate process to avoid corrupting state. By passing --no-isolate you can run everything in the same process. Even if the minifier crashes, you can likely restart minification from a checkpoint. Same process minification is dramatically faster, as every subprocess has about a 10s startup latency.

  • Coming soon: offload to disk support for minifier. If your repro is extremely large, the minifier is likely to OOM on it. In Add --offload-to-disk support to minifier by ezyang · Pull Request #100546 · pytorch/pytorch · GitHub we add --offload-to-disk, which lets you offload some of the minifiers working memory to disk, reducing its overall memory usage. This fix was used to successfully minify the bug behind [inductor] Prevent reusing aliased buffers if aliases still have uses by ngimel · Pull Request #100332 · pytorch/pytorch · GitHub

Finally, some words about using the accuracy minifier. I think it is still early days for this codepath, and there are still plenty of ways for it to not work out of the box. In general, it is difficult for the minifier to always work all the time, because definitionally every bug is different. However, if you are willing to roll up your sleeves, I have found working on the minifier in tandem with diagnosing a bug is pretty fruitful. You can fix a bug in the minifier, and then rerun your extracted repro to try minifying again. To this end, our minifier tests (in test/inductor/ now run substantially faster, so it should be a lot easier for you to make one off fixes and send them in!