We’ve been working on DanLing NestedTensor, a NestedTensor implementation for variable-length tensors with broad PyTorch op coverage, and wanted to share some benchmark results.
To avoid confusion up front: this is about DanLing NestedTensor, not torch.nested.
DanLing registers the necessary dispatch and collation hooks so variable-length batches can work naturally with PyTorch:
PNTensor + register_pn_tensor_collate()for DataLoader collation intoNestedTensor__torch_function__/__torch_dispatch__-based handlers for PyTorch ops- native support for built-in PyTorch transformer and vision models without padding or masks
We currently have two kinds of benchmarks:
1. Real workload: IMDB training
Real workload benchmark, using a BERT-large-shaped torch.nn.TransformerEncoder on IMDB with long variable-length sequences.
Config:
bert-large-uncased, 2 epochs, batch size 32, max length 8192, d_model=1024, nhead=16, num_layers=24
Environment:
- single NVIDIA B200 180GB GPU
- PyTorch 2.11
- bfloat16
Result:
-
Training step compute (forward + backward, all epochs)
- NestedTensor:
154819.4 ms - Padded:
306926.7 ms - Result:
1.98x faster
- NestedTensor:
-
Peak extra CUDA memory per training step
- NestedTensor:
12.68 GiB - Padded:
74.67 GiB - Result:
83% lower
- NestedTensor:
This run measured nearly 2x faster model compute and an 83% reduction in peak extra CUDA memory for the NestedTensor path.
Note: the timing here is model forward+backward compute, not full end-to-end wall clock including tokenization, data loading, or validation.
2. Model and operator benchmarks
We also maintain benchmark suites for broader coverage.
Models
Synthetic model benchmarks cover:
TransformerEncoderTransformerDecoderTransformerResNet-50
These are run across varying occupancy levels on a single NVIDIA B200 180GB GPU.
A few examples from the README:
-
TransformerEncoder, Train, 20% occupancy- Padded (eager):
26.95 ms - DanLing (eager):
8.60 ms 3.13xfaster
- Padded (eager):
-
TransformerDecoder, Infer, 20% occupancy- Padded (eager):
47.38 ms - DanLing (eager):
9.27 ms 5.11xfaster
- Padded (eager):
-
Transformer, Train, 21% occupancy- Padded (eager):
68.79 ms - DanLing (eager):
24.53 ms 2.80xfaster
- Padded (eager):
Operators
Synthetic operator benchmarks cover common transformer-hot ops and tensor primitives across:
- padded tensors
- DanLing
NestedTensor torch.nested
Examples at 35% occupancy:
-
F.layer_norm- Padded (eager):
0.17 ms - DanLing (eager):
0.10 ms torch.nested(eager):0.20 ms
- Padded (eager):
-
F.softmax- Padded (eager):
0.12 ms - DanLing (eager):
0.11 ms torch.nested(eager):0.15 ms
- Padded (eager):
-
torch.add- Padded (eager):
0.05 ms - DanLing (eager):
0.12 ms torch.nested(eager):0.11 ms
- Padded (eager):
The goal of the synthetic benchmarks is to isolate behavior across occupancy levels, models, and operators, while the IMDB benchmark gives a more realistic workload result.
Source Code: DanLing/danling/tensors at master · ZhiyuanChen/DanLing · GitHub
Would especially appreciate feedback on:
- additional real workloads that would be useful to benchmark
- PyTorch-native model paths we should test next
- operator/model cases where comparison against
torch.nestedwould be most useful