Initial perf results from compiling NestedTensor Python Subclass via desugaring

– with @ezyang, @bdhirsh, @jbschlosser, @cpuhrsch

Recently NestedTensor became compilable (stay tuned for a different post about this and our plans here), and just to get a sense of what the numbers should look like, I did some measuring:

Intro

NestedTensor allows one to handle raggedness in data efficiently in native PyTorch. Historically, NestedTensor (and most subclasses in general) are written in C++ to ensure fast execution. More recently however, writing tensor subclasses in Python has become more appealing as PT2 stack gains more capabilities to potentially remove that overhead.

In particular, the approach featured here today is possible thanks to @bdhirsh’s work on subclass compilation via desugaring, e.g. converting the subclass into its underlying dense tensors before it’s processed by the compiler backend. (Take note that this is not the only way we plan to support compilation for NestedTensor, see future work).

In this post we compare the performance of (1) the original NestedTensor in C++, (2) a new NestedTensor written as a Python subclass, (3) and the new python subclass torch.compile via desugaring. Results were obtained on PyTorch main branch with a single A100 GPU. We use variations of dummy program that consists of sin, cos, and linear ops to demonstrate the effect of varying input and model size (number of ops) and operator fusion, etc on the runtime.

Summary

Overall, the results are directionally as expected, with the degree of eager python overhead being the biggest surprise.

  • Varying input size: Compiled Python NT outperforms eager cpp across all input sizes. In the overhead-bound regime this is due to reduced overhead. In the compute-bound regime this is due to operator fusion. Non-compiled Python subclass matches eager cpp on large input sizes, but underperforms on smaller inputs due to Python subclass overhead.
  • Varying program size (num operators): Speedups from compilation over eager cpp are more pronounced for larger programs (with more ops) as guard execution overhead etc is amortized away (as long as the larger program doesn’t induce proportionally more guards).

Future work:

  • Overhead in the eager mode should be investigated.
  • We plan to preserve the nested structure all the way to inductor so that we can rely on a generic codegen rather custom kernels. (@jbschlosser is working on this)
  • There is also interest to expose codegened inductor kernels into eager to ensure fast eager performance.

Results

How does the size of the input tensor affect things?

Example 1: A moderate sized program (~60 ops in forward) with fusion opportunities

Compiled NestedTensor subclass is measured to be both faster than eager cpp at small input sizes due to reduction in overhead, and faster than cpp at large input sizes due operator fusion speeding up
bandwidth-bound operations. Python subclass w/o compile matches cpp at large input sizes since both are compute bound and launch the same kernels, but significantly slower at small inputs sizes due to overhead.*

Input size cpp w/o compile python w/o compile python w/ compile
[[20, 30, 40], 2^14] 8.812ms** 37.774ms** 4.27ms**
[[20, 30, 40], 2^16] 11.955ms 38.983ms** 10.559ms
[[20, 30, 40], 2^18] 43.27ms 43.861ms 37.73ms
[[20, 30, 40], 2^20] 168.7ms 167.665ms 145.7ms

Table 1: ** indicates overhead bound

Example 2: Slightly smaller sized program (~40 ops), and with fewer fusion opportunities

When there are fewer fusion opportunities the gap between compile and eager cpp narrows at large input sizes.

Compiled NestedTensor subclass is measured to be faster than the C++ NT in the overhead-bound regime, though the gap has narrowed compared to example 1 because we have fewer ops in forward/backward and hence less overhead to remove; see row 1 of Table 2 below. On the same row, we see that without torch.compile, Python NT subclass is much slower than C++. All 3 columns have identical runtime in the compute-bound regime as expected since their implementations do not differ significantly (or at all?) in terms of the underlying kernels called.

Input size cpp w/o compile python w/o compile python w/ compile
[[20, 30, 40], 2^14] 5.93ms** 28.9ms** 4.27ms**
[[20, 30, 40], 2^16] 10.5ms 29.6ms** 10.5ms
[[20, 30, 40], 2^18] 37.8ms 38.3ms 37.4ms
[[20, 30, 40], 2^20] 148.8ms 147.6ms 145.2ms

Table 2: ** indicates overhead bound

How does the size of the program (number of ops) affect things?

Result: Smaller programs means that the overhead of guard execution is more visible, so larger programs achieve higher multiples in speedup.

The program used for measurement in this section is the same program used in example 1 of the previous section, except now we vary the number of times we execute unit, and test for both overhead-bound and compute-bound regimes. Each execution of unit corresponds to 6 aten ops.

Example 1: Overhead-bound (smaller tensor): [[20, 30, 40], 2^14]

range(N) cpp w/o compile python w/o compile python w/ compile
2 2.5ms 7.9ms 2.0ms
5 4.7ms 18.34ms 2.9ms
10 8.5ms 38ms 4.5ms
20 18ms 73ms 7ms

Table 3

Compute-bound (smaller tensor): [[20, 30, 40], 2^14]

Python w/o compile matches cpp w/o compile at all graph sizes as long as input size is large. Compile is faster due to operator fusion. (See appendix for some profiler traces showing this.)

range(N) cpp w/o compile python w/o compile python w/ compile
2 34ms 34ms 30ms
5 84ms 85ms 75ms
10 169ms 170ms 148ms
20 334ms 334ms 329ms

Table 4

Future work

Reduce eager mode overhead

As pointed out above in the results, Python subclasses seem to have a significant amount of overhead. Eager performance of tensor subclasses can be important as well for users who cannot compile their model for various reasons.

aten::sin call for Python NT without compile

aten::sin call for C++ NT without compile

Compile support without desugaring Inductor codegen

Currently, with the current torch compile desugaring solution, we still must rely on custom kernels to implement operations that need to reduce or broadcast over the ragged dimension. These custom kernels can take significant effort to write. Preserving the nested tensor structure all the way to inductor and using inductor codegen would a way to cleanly support these operations without the need for custom kernels.

Eager mode performance for inductor generated kernels

Following up from the previous item, if you rely on inductor generated kernels during compile, but still rely on a unbind-based fallback during eager, you would observe poor performance for eager subclasses (C++ or Python) even in a compute-bound regime. Hence, it may be necessary to have some way of running those inductor generated kernels in eager.

Appendix

Code to reproduce

Profiler traces: Operator fusion speed up

Unfused backward for sin/cos

Fused backward for sin/cos

Profiler traces: compute bound vs overhead bound execution

Compute-bound ([20, 30, 40], 2^18])

Python NT no compile (38.3ms)

C++ NT no compile (37.8ms)

Python NT compile (37.4ms)

Overhead-bound ([[20, 30, 40], 2^14] case)

Python NT no compile (29.0ms)

C++ NT no compile (5.93ms)

C++ NT no compile (4.27ms)

4 Likes