TorchInductor Update 7: key optimizations with CPU backend in PyTorch 2.2 release

In our previous blogs of TorchInductor Update 4, TorchInductor Update 5 and TorchInductor Update 6, we detailed the progress and provided a technical deep dive into the optimization efforts for the Inductor C++/OpenMP backend. In the PyTorch 2.2 release, our focus lies on enhancing PyTorch 2 Export Quantization with Inductor CPU backend, incorporating features such as quantization-aware training and CPP Wrapper support with int8 data type.

Quantization-Aware Training

In the PyTorch 2.1 release, leveraging the PyTorch 2 Export Quantization flow, we introduced the X86InductorQuantizer specifically for applying post-training static quantization recipes tailored to the Inductor CPU backend. Building on this foundation, the PyTorch 2.2 release incorporates support for quantization-aware training with the same X86InductorQuantizer.

The backend optimization for quantization-aware training in Inductor closely mirrors that of post-training static quantization. We identify quantization patterns, implement weight prepacking, and employ post-op fusion utilizing the oneDNN library for Conv/GEMM operations. In the case of non-Conv/GEMM element-wise and reduction operations, we achieve optimal performance by enabling explicit vectorization with uint8 data types in the C++ backend code generation.

Extensive testing on all CNN models from the TorchBench test suite demonstrates the effectiveness of our approach when compared to the Inductor FP32 inference path. Notably, models such as mobilenet_v3_large and densenet121 exhibit a relative accuracy loss exceeding 10% in comparison to FP32 inference, as documented and tracked in PT2 QAT flow fails to get reasonable accuracy with mobilenet_v3_large · Issue #114859 · pytorch/pytorch · GitHub.


+----------+--------------------+---------------------------------+ 

| Compiler |  Geometric Speedup | Geometric Related Accuracy Loss | 

+----------+--------------------+---------------------------------+ 

| inductor |   3.19x, 12/12     |         1.01%, 10/12            | 

+----------+--------------------+---------------------------------+ 

The above data are measured on AWS c6i.16xlarge instance (ICX).

C++ Wrapper with int8 data type

In PyTorch 2.1, the C++ Wrapper introduced in TorchInductor stands as a new Prototype feature. This functionality utilizes PyTorch C++ APIs to generate pure C++ code that seamlessly integrates both internally generated and external kernels. The key advantage is the execution of each captured Dynamo graph in pure C++, effectively minimizing Python overhead within the graph. Building upon this innovation, the PyTorch 2.2 release further extends the capabilities of the C++ Wrapper by introducing support for int8 data type.

Same to the other data types, to activate this feature, users need to add the following API:


import torch._inductor.config as config 

config.cpp_wrapper = True 

Implementing the C++ Wrapper with int8 data type in PyTorch 2.2 enhances model speed by alleviating the Python overhead associated with the default Inductor wrapper. Extensive testing on all CNN models from TorchBench shows a geomean speedup of 1.02x compared to the default Python wrapper, achieving this acceleration without sacrificing accuracy.


+----------+--------------------+---------------------------------+ 

| Compiler |  Geometric Speedup | Geometric Related Accuracy Loss | 

+----------+--------------------+---------------------------------+ 

| inductor |   1.025x, 12/12     |         NaN, 12/12            | 

+----------+--------------------+---------------------------------+ 

The above data are measured on AWS c6i.16xlarge instance (ICX).

Summary

This blog post from the Intel PyTorch team delivers an update on the latest features and performance optimizations incorporated into the Inductor C++/OpenMP backend with the PyTorch 2.2 release. The focus is on quantization-aware training and the extension of CPP Wrapper support with int8 data type. Looking ahead, the team’s roadmap includes maturing CPP Wrapper to be turned on by default, supporting AOTInductor, incorporating float16 datatype support, and introducing Post-Training Dynamic Quantization support.

Many thanks to @jansel, @desertfire, @eellison, @jerryzh168 and @Chillee for their invaluable contributions and unwavering support during the development.

3 Likes