– with @ezyang, @bdhirsh, @jbschlosser, @cpuhrsch
Recently NestedTensor became compilable (stay tuned for a different post about this and our plans here), and just to get a sense of what the numbers should look like, I did some measuring:
Intro
NestedTensor allows one to handle raggedness in data efficiently in native PyTorch. Historically, NestedTensor (and most subclasses in general) are written in C++ to ensure fast execution. More recently however, writing tensor subclasses in Python has become more appealing as PT2 stack gains more capabilities to potentially remove that overhead.
In particular, the approach featured here today is possible thanks to @bdhirsh’s work on subclass compilation via desugaring, e.g. converting the subclass into its underlying dense tensors before it’s processed by the compiler backend. (Take note that this is not the only way we plan to support compilation for NestedTensor, see future work).
In this post we compare the performance of (1) the original NestedTensor in C++, (2) a new NestedTensor written as a Python subclass, (3) and the new python subclass torch.compile via desugaring. Results were obtained on PyTorch main branch with a single A100 GPU. We use variations of dummy program that consists of sin, cos, and linear ops to demonstrate the effect of varying input and model size (number of ops) and operator fusion, etc on the runtime.
Summary
Overall, the results are directionally as expected, with the degree of eager python overhead being the biggest surprise.
- Varying input size: Compiled Python NT outperforms eager cpp across all input sizes. In the overhead-bound regime this is due to reduced overhead. In the compute-bound regime this is due to operator fusion. Non-compiled Python subclass matches eager cpp on large input sizes, but underperforms on smaller inputs due to Python subclass overhead.
- Varying program size (num operators): Speedups from compilation over eager cpp are more pronounced for larger programs (with more ops) as guard execution overhead etc is amortized away (as long as the larger program doesn’t induce proportionally more guards).
Future work:
- Overhead in the eager mode should be investigated.
- We plan to preserve the nested structure all the way to inductor so that we can rely on a generic codegen rather custom kernels. (@jbschlosser is working on this)
- There is also interest to expose codegened inductor kernels into eager to ensure fast eager performance.
Results
How does the size of the input tensor affect things?
Example 1: A moderate sized program (~60 ops in forward) with fusion opportunities
Compiled NestedTensor subclass is measured to be both faster than eager cpp at small input sizes due to reduction in overhead, and faster than cpp at large input sizes due operator fusion speeding up
bandwidth-bound operations. Python subclass w/o compile matches cpp at large input sizes since both are compute bound and launch the same kernels, but significantly slower at small inputs sizes due to overhead.*
Input size | cpp w/o compile | python w/o compile | python w/ compile |
---|---|---|---|
[[20, 30, 40], 2^14] | 8.812ms** | 37.774ms** | 4.27ms** |
[[20, 30, 40], 2^16] | 11.955ms | 38.983ms** | 10.559ms |
[[20, 30, 40], 2^18] | 43.27ms | 43.861ms | 37.73ms |
[[20, 30, 40], 2^20] | 168.7ms | 167.665ms | 145.7ms |
Table 1: ** indicates overhead bound
Example 2: Slightly smaller sized program (~40 ops), and with fewer fusion opportunities
When there are fewer fusion opportunities the gap between compile and eager cpp narrows at large input sizes.
Compiled NestedTensor subclass is measured to be faster than the C++ NT in the overhead-bound regime, though the gap has narrowed compared to example 1 because we have fewer ops in forward/backward and hence less overhead to remove; see row 1 of Table 2 below. On the same row, we see that without torch.compile, Python NT subclass is much slower than C++. All 3 columns have identical runtime in the compute-bound regime as expected since their implementations do not differ significantly (or at all?) in terms of the underlying kernels called.
Input size | cpp w/o compile | python w/o compile | python w/ compile |
---|---|---|---|
[[20, 30, 40], 2^14] | 5.93ms** | 28.9ms** | 4.27ms** |
[[20, 30, 40], 2^16] | 10.5ms | 29.6ms** | 10.5ms |
[[20, 30, 40], 2^18] | 37.8ms | 38.3ms | 37.4ms |
[[20, 30, 40], 2^20] | 148.8ms | 147.6ms | 145.2ms |
Table 2: ** indicates overhead bound
How does the size of the program (number of ops) affect things?
Result: Smaller programs means that the overhead of guard execution is more visible, so larger programs achieve higher multiples in speedup.
The program used for measurement in this section is the same program used in example 1 of the previous section, except now we vary the number of times we execute unit
, and test for both overhead-bound and compute-bound regimes. Each execution of unit corresponds to 6 aten ops.
Example 1: Overhead-bound (smaller tensor): [[20, 30, 40], 2^14]
range(N) | cpp w/o compile | python w/o compile | python w/ compile |
---|---|---|---|
2 | 2.5ms | 7.9ms | 2.0ms |
5 | 4.7ms | 18.34ms | 2.9ms |
10 | 8.5ms | 38ms | 4.5ms |
20 | 18ms | 73ms | 7ms |
Table 3
Compute-bound (smaller tensor): [[20, 30, 40], 2^14]
Python w/o compile matches cpp w/o compile at all graph sizes as long as input size is large. Compile is faster due to operator fusion. (See appendix for some profiler traces showing this.)
range(N) | cpp w/o compile | python w/o compile | python w/ compile |
---|---|---|---|
2 | 34ms | 34ms | 30ms |
5 | 84ms | 85ms | 75ms |
10 | 169ms | 170ms | 148ms |
20 | 334ms | 334ms | 329ms |
Table 4
Future work
Reduce eager mode overhead
As pointed out above in the results, Python subclasses seem to have a significant amount of overhead. Eager performance of tensor subclasses can be important as well for users who cannot compile their model for various reasons.
aten::sin call for Python NT without compile
aten::sin call for C++ NT without compile
Compile support without desugaring Inductor codegen
Currently, with the current torch compile desugaring solution, we still must rely on custom kernels to implement operations that need to reduce or broadcast over the ragged dimension. These custom kernels can take significant effort to write. Preserving the nested tensor structure all the way to inductor and using inductor codegen would a way to cleanly support these operations without the need for custom kernels.
Eager mode performance for inductor generated kernels
Following up from the previous item, if you rely on inductor generated kernels during compile, but still rely on a unbind-based fallback during eager, you would observe poor performance for eager subclasses (C++ or Python) even in a compute-bound regime. Hence, it may be necessary to have some way of running those inductor generated kernels in eager.
Appendix
Code to reproduce
Profiler traces: Operator fusion speed up
Unfused backward for sin/cos
Fused backward for sin/cos
Profiler traces: compute bound vs overhead bound execution
Compute-bound ([20, 30, 40], 2^18])
Python NT no compile (38.3ms)
C++ NT no compile (37.8ms)
Python NT compile (37.4ms)
Overhead-bound ([[20, 30, 40], 2^14] case)
Python NT no compile (29.0ms)
C++ NT no compile (5.93ms)
C++ NT no compile (4.27ms)