Initial perf results from compiling NestedTensor Python Subclass via desugaring

jw3468 · October 23, 2023, 9:39pm

– with @ezyang, @bdhirsh, @jbschlosser, @cpuhrsch

Recently NestedTensor became compilable (stay tuned for a different post about this and our plans here), and just to get a sense of what the numbers should look like, I did some measuring:

Intro

NestedTensor allows one to handle raggedness in data efficiently in native PyTorch. Historically, NestedTensor (and most subclasses in general) are written in C++ to ensure fast execution. More recently however, writing tensor subclasses in Python has become more appealing as PT2 stack gains more capabilities to potentially remove that overhead.

In particular, the approach featured here today is possible thanks to @bdhirsh’s work on subclass compilation via desugaring, e.g. converting the subclass into its underlying dense tensors before it’s processed by the compiler backend. (Take note that this is not the only way we plan to support compilation for NestedTensor, see future work).

In this post we compare the performance of (1) the original NestedTensor in C++, (2) a new NestedTensor written as a Python subclass, (3) and the new python subclass torch.compile via desugaring. Results were obtained on PyTorch main branch with a single A100 GPU. We use variations of dummy program that consists of sin, cos, and linear ops to demonstrate the effect of varying input and model size (number of ops) and operator fusion, etc on the runtime.

Summary

Overall, the results are directionally as expected, with the degree of eager python overhead being the biggest surprise.

Varying input size: Compiled Python NT outperforms eager cpp across all input sizes. In the overhead-bound regime this is due to reduced overhead. In the compute-bound regime this is due to operator fusion. Non-compiled Python subclass matches eager cpp on large input sizes, but underperforms on smaller inputs due to Python subclass overhead.
Varying program size (num operators): Speedups from compilation over eager cpp are more pronounced for larger programs (with more ops) as guard execution overhead etc is amortized away (as long as the larger program doesn’t induce proportionally more guards).

Future work:

Overhead in the eager mode should be investigated.
We plan to preserve the nested structure all the way to inductor so that we can rely on a generic codegen rather custom kernels. (@jbschlosser is working on this)
There is also interest to expose codegened inductor kernels into eager to ensure fast eager performance.

Results

How does the size of the input tensor affect things?

Example 1: A moderate sized program (~60 ops in forward) with fusion opportunities

Compiled NestedTensor subclass is measured to be both faster than eager cpp at small input sizes due to reduction in overhead, and faster than cpp at large input sizes due operator fusion speeding up
bandwidth-bound operations. Python subclass w/o compile matches cpp at large input sizes since both are compute bound and launch the same kernels, but significantly slower at small inputs sizes due to overhead.*

Input size	cpp w/o compile	python w/o compile	python w/ compile
[[20, 30, 40], 2^14]	8.812ms**	37.774ms**	4.27ms**
[[20, 30, 40], 2^16]	11.955ms	38.983ms**	10.559ms
[[20, 30, 40], 2^18]	43.27ms	43.861ms	37.73ms
[[20, 30, 40], 2^20]	168.7ms	167.665ms	145.7ms

Table 1: ** indicates overhead bound

Example 2: Slightly smaller sized program (~40 ops), and with fewer fusion opportunities

When there are fewer fusion opportunities the gap between compile and eager cpp narrows at large input sizes.

Compiled NestedTensor subclass is measured to be faster than the C++ NT in the overhead-bound regime, though the gap has narrowed compared to example 1 because we have fewer ops in forward/backward and hence less overhead to remove; see row 1 of Table 2 below. On the same row, we see that without torch.compile, Python NT subclass is much slower than C++. All 3 columns have identical runtime in the compute-bound regime as expected since their implementations do not differ significantly (or at all?) in terms of the underlying kernels called.

Input size	cpp w/o compile	python w/o compile	python w/ compile
[[20, 30, 40], 2^14]	5.93ms**	28.9ms**	4.27ms**
[[20, 30, 40], 2^16]	10.5ms	29.6ms**	10.5ms
[[20, 30, 40], 2^18]	37.8ms	38.3ms	37.4ms
[[20, 30, 40], 2^20]	148.8ms	147.6ms	145.2ms

Table 2: ** indicates overhead bound

How does the size of the program (number of ops) affect things?

Result: Smaller programs means that the overhead of guard execution is more visible, so larger programs achieve higher multiples in speedup.

The program used for measurement in this section is the same program used in example 1 of the previous section, except now we vary the number of times we execute unit, and test for both overhead-bound and compute-bound regimes. Each execution of unit corresponds to 6 aten ops.

Example 1: Overhead-bound (smaller tensor): [[20, 30, 40], 2^14]

range(N)	cpp w/o compile	python w/o compile	python w/ compile
2	2.5ms	7.9ms	2.0ms
5	4.7ms	18.34ms	2.9ms
10	8.5ms	38ms	4.5ms
20	18ms	73ms	7ms

Table 3

Compute-bound (smaller tensor): [[20, 30, 40], 2^14]

Python w/o compile matches cpp w/o compile at all graph sizes as long as input size is large. Compile is faster due to operator fusion. (See appendix for some profiler traces showing this.)

range(N)	cpp w/o compile	python w/o compile	python w/ compile
2	34ms	34ms	30ms
5	84ms	85ms	75ms
10	169ms	170ms	148ms
20	334ms	334ms	329ms

Table 4

Future work

Reduce eager mode overhead

As pointed out above in the results, Python subclasses seem to have a significant amount of overhead. Eager performance of tensor subclasses can be important as well for users who cannot compile their model for various reasons.

aten::sin call for Python NT without compile

aten::sin call for C++ NT without compile

1402×798 37.9 KB

Compile support without desugaring Inductor codegen

Currently, with the current torch compile desugaring solution, we still must rely on custom kernels to implement operations that need to reduce or broadcast over the ragged dimension. These custom kernels can take significant effort to write. Preserving the nested tensor structure all the way to inductor and using inductor codegen would a way to cleanly support these operations without the need for custom kernels.

Eager mode performance for inductor generated kernels

Following up from the previous item, if you rely on inductor generated kernels during compile, but still rely on a unbind-based fallback during eager, you would observe poor performance for eager subclasses (C++ or Python) even in a compute-bound regime. Hence, it may be necessary to have some way of running those inductor generated kernels in eager.

Appendix

Code to reproduce

gist.github.com

https://gist.github.com/soulitzer/9fb272ade0442f83f79c8e79b1fa5e43

compile_nt_prof.py

import torch
from torch.nested._internal.nested_tensor import NestedTensor, jagged_from_list
from torch.profiler import profile, record_function, ProfilerActivity

device="cuda:5"

for nb_unit in (10, 1, 2, 5, 20):
    lin = torch.nn.functional.linear

    def sin(x):

This file has been truncated. show original

Profiler traces: Operator fusion speed up

Unfused backward for sin/cos

Fused backward for sin/cos

Profiler traces: compute bound vs overhead bound execution

Compute-bound ([20, 30, 40], 2^18])

Python NT no compile (38.3ms)

C++ NT no compile (37.8ms)

Python NT compile (37.4ms)

Overhead-bound ([[20, 30, 40], 2^14] case)

Python NT no compile (29.0ms)

C++ NT no compile (5.93ms)

C++ NT no compile (4.27ms)

Topic		Replies	Views
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1975	September 22, 2023
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript performance	0	1669	January 28, 2021
PyTorch Sparse(GNN) Compiler RFC compiler	28	2331	May 21, 2024
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	65687	July 29, 2024
Tracing with Primitives: Update 2 compiler	4	6909	January 13, 2023