DanLing NestedTensor on variable-length inputs: IMDB, model, and operator benchmarks

ZhiyuanChen · April 7, 2026, 10:29am

We’ve been working on DanLing NestedTensor, a NestedTensor implementation for variable-length tensors with broad PyTorch op coverage, and wanted to share some benchmark results.

To avoid confusion up front: this is about DanLing NestedTensor, not torch.nested.

DanLing registers the necessary dispatch and collation hooks so variable-length batches can work naturally with PyTorch:

PNTensor + register_pn_tensor_collate() for DataLoader collation into NestedTensor
__torch_function__ / __torch_dispatch__-based handlers for PyTorch ops
native support for built-in PyTorch transformer and vision models without padding or masks

We currently have two kinds of benchmarks:

1. Real workload: IMDB training

Real workload benchmark, using a BERT-large-shaped torch.nn.TransformerEncoder on IMDB with long variable-length sequences.

Config:
bert-large-uncased, 2 epochs, batch size 32, max length 8192, d_model=1024, nhead=16, num_layers=24

Environment:

single NVIDIA B200 180GB GPU
PyTorch 2.11
bfloat16

Result:

Training step compute (forward + backward, all epochs)
- NestedTensor: 154819.4 ms
- Padded: 306926.7 ms
- Result: 1.98x faster
Peak extra CUDA memory per training step
- NestedTensor: 12.68 GiB
- Padded: 74.67 GiB
- Result: 83% lower

This run measured nearly 2x faster model compute and an 83% reduction in peak extra CUDA memory for the NestedTensor path.

Note: the timing here is model forward+backward compute, not full end-to-end wall clock including tokenization, data loading, or validation.

2. Model and operator benchmarks

We also maintain benchmark suites for broader coverage.

Models

Synthetic model benchmarks cover:

TransformerEncoder
TransformerDecoder
Transformer
ResNet-50

These are run across varying occupancy levels on a single NVIDIA B200 180GB GPU.

A few examples from the README:

TransformerEncoder, Train, 20% occupancy
- Padded (eager): 26.95 ms
- DanLing (eager): 8.60 ms
- 3.13x faster
TransformerDecoder, Infer, 20% occupancy
- Padded (eager): 47.38 ms
- DanLing (eager): 9.27 ms
- 5.11x faster
Transformer, Train, 21% occupancy
- Padded (eager): 68.79 ms
- DanLing (eager): 24.53 ms
- 2.80x faster

Operators

Synthetic operator benchmarks cover common transformer-hot ops and tensor primitives across:

padded tensors
DanLing NestedTensor
torch.nested

Examples at 35% occupancy:

F.layer_norm
- Padded (eager): 0.17 ms
- DanLing (eager): 0.10 ms
- torch.nested (eager): 0.20 ms
F.softmax
- Padded (eager): 0.12 ms
- DanLing (eager): 0.11 ms
- torch.nested (eager): 0.15 ms
torch.add
- Padded (eager): 0.05 ms
- DanLing (eager): 0.12 ms
- torch.nested (eager): 0.11 ms

The goal of the synthetic benchmarks is to isolate behavior across occupancy levels, models, and operators, while the IMDB benchmark gives a more realistic workload result.

Source Code: DanLing/danling/tensors at master · ZhiyuanChen/DanLing · GitHub

Would especially appreciate feedback on:

additional real workloads that would be useful to benchmark
PyTorch-native model paths we should test next
operator/model cases where comparison against torch.nested would be most useful

Topic		Replies	Views
Initial perf results from compiling NestedTensor Python Subclass via desugaring	0	900	October 23, 2023
State of PyTorch core: September 2021 edition frontend API	1	9529	September 21, 2021
Lengths' argument should be a 1D CPU int64 tensor, but got 1D meta Long tensor compiler	1	641	June 12, 2023
DTensor - Status, Design and Looking Forward distributed	3	4069	July 14, 2025
NNC Per-Operator Benchmarks (on CPU) nnc	5	1161	January 27, 2021

DanLing NestedTensor on variable-length inputs: IMDB, model, and operator benchmarks

1. Real workload: IMDB training

2. Model and operator benchmarks

Models

Operators

Related topics