TorchDynamo Update 7: Inference with FX2TRT

Overview

This note gives the current development status of our experimental project TorchDynamo with backend support of FX2TRT. TorchDynamo hooks into frame evaluation API in CPython to dynamically modify the bytecode of Python before its execution. It rewrites the python bytecode by extracting the sequences of Pytorch operations into FX Graph. FX2TRT is the tool targeting both usability and speed of light performance for model inference. It works on FX graph and lowers it into a TensorRT graph which takes advantage of TensorRT optimization opportunities for inference on GPU.

Till now, FX2TRT can successfully lower 27 out of 48 models in TorchBench to TensorRT with a geomean speedup of 2.69x (for the 27 models with fp16 enabled) over eager mode inference on GPU. It achieves great speedup especially on vision models due to their workload nature of heavy computation and the efficiency of TensorRT on powerful tensor cores of the GPU.

TorchDynamo

If you are new to TorchDynamo, the links below will allow you to catch up on the new exploration. TorchDynamo generates FX graph from Python bytecode and various backends are integrated with TorchDynamo to complete inference/training of the model. In the future, with the help of a cost model, TorchDynamo could automate the selection of the best backend for each subgraph to achieve optimal performance.

Why TorchDynamo + FX2TRT

FX2TRT is the tool developed to help bring the best performance of TensorRT to the user with some merits.

  • It is based on the Torch.FX graph which means users could take advantage of all the features that FX offered including easily checking the model architecture and operation replacements.
  • TensorRT alone has a steep learning curve and usability challenges. With the help of FX2TRT, users do not bother to know how TensorRT works which saves them both time and efforts on exporting models to TensorRT.
  • TensorRT comes with various performance optimization tactics including kernel fusion, kernel selection and graph optimizations to bring the best performance of the GPU.
  • TorchDynamo has a different way to generate the FX graph than the original FX symbolic tracing method. It can handle control flow and other aspects of Python and therefore increase the coverage of models.

We use TorchBench as the testing bed. TorchBench consists of 52 benchmarks. Among them, 4 of them are not qualified as inference models.They are either quantized models or unsupervised learning models. Removed them, our testing work is based on 48 models. The test results come from a A100 machine with TensorRT 8.2.1.8.

  • At the time of the initial test Update 3: GPU Inference Edition, there were only 9 models that could successfully run on TorchDynamo + FX2TRT.
  • Since then, our model coverage has tripled and now 27 models run on TorchDynamo with FX2TRT as backend.
  • To achieve this, we added support and improved around 20 operations. Most of these changes are landed, with a few more landing shortly.

Challenges in FX2TRT

In the TorchDynamo’s backend integration of FX2TRT, we implement a splitter to automatically split the FX graph based on the supportness of operations. If the operation is not supported in FX2TRT, it will keep eager mode. If the operation is supported, it will be lowered to TRT. As a result, if the operations in a FX graph are all supported, they will be lowered to one TRT graph. If not, they will be transformed to a graph mixed with TRT and FX subgraphs. From the performance perspective, we want to avoid the latter case since the TensorRT pre and post processing overhead plays an important role here. This overhead is not negligible when the subgraph switches happen frequently between eager mode subgraph and TensorRT subgraph.

Results

We did the experiment on an A100 machine with fp16 as the target precision in TRT due to the fact that A100 has strong computation power of fp16 on TensorCores. With fp16 TensorCores it can achieve 312 TFLOPS while single precision (FP32) can only achieve 19.5 TFLOPS. Due to the precision loss happened in the fp16, here the accuracy is computed by the cosine similarities between the TensorRT run and the eager run outputs as measurement for the output quality (1.0 is the best). We compare accuracy to the model run with fp16 in eager mode, and not to the original fp32 result. It is worth noting that

  • TorchDynamo could handle some models where FX tracing could not. For example, hugging face models could not be traced directly by FX. They need to rely on the customized fx tracer. So far, we got 4 hugging face models supported.
  • TensorRT provides very good performance for vision models in both torchvision and timm models. Of course, the benefits are due to A100 card and TRT optimizations. The other reason is that the default batch size is relatively small in TorchBench and eager mode is less efficient on small batch sizes.
  • FYI, the models highlighted with green are models with 100% op support while the blue are those with a few ops not supported in TRT.

Outstanding issues in failing models

  • Since our lowering process still needs to trace the FX graph to decompose the module function, there are some corner cases that fx tracer could not handle.
  • Even though the unsupported subgraphs fall back to eager mode, the performance is badly hurt due to the subgraph switches.

Next steps

  • The next step will be to continue working on the current failed models to improve coverage. GitHub issues have been created, and contributions are welcome.

Testing

Coming soon. FX2TRT is in the refactor stage and will be in public some time in May.

3 Likes

Update for FX2TRT open source:
Now the fx2trt is merged into Torch-TensorRT. Please follow the instructions to compile the library and install it.
To test any torchbench models, please run the command below:
python torchbench.py -dcuda --speedup-fx2trt-fp16 --only model_name

1 Like