DanLing NestedTensor on variable-length inputs: IMDB, model, and operator benchmarks

We’ve been working on DanLing NestedTensor, a NestedTensor implementation for variable-length tensors with broad PyTorch op coverage, and wanted to share some benchmark results.

To avoid confusion up front: this is about DanLing NestedTensor, not torch.nested.

DanLing registers the necessary dispatch and collation hooks so variable-length batches can work naturally with PyTorch:

  • PNTensor + register_pn_tensor_collate() for DataLoader collation into NestedTensor
  • __torch_function__ / __torch_dispatch__-based handlers for PyTorch ops
  • native support for built-in PyTorch transformer and vision models without padding or masks

We currently have two kinds of benchmarks:

1. Real workload: IMDB training

Real workload benchmark, using a BERT-large-shaped torch.nn.TransformerEncoder on IMDB with long variable-length sequences.

Config:
bert-large-uncased, 2 epochs, batch size 32, max length 8192, d_model=1024, nhead=16, num_layers=24

Environment:

  • single NVIDIA B200 180GB GPU
  • PyTorch 2.11
  • bfloat16

Result:

  • Training step compute (forward + backward, all epochs)

    • NestedTensor: 154819.4 ms
    • Padded: 306926.7 ms
    • Result: 1.98x faster
  • Peak extra CUDA memory per training step

    • NestedTensor: 12.68 GiB
    • Padded: 74.67 GiB
    • Result: 83% lower

This run measured nearly 2x faster model compute and an 83% reduction in peak extra CUDA memory for the NestedTensor path.

Note: the timing here is model forward+backward compute, not full end-to-end wall clock including tokenization, data loading, or validation.

2. Model and operator benchmarks

We also maintain benchmark suites for broader coverage.

Models

Synthetic model benchmarks cover:

  • TransformerEncoder
  • TransformerDecoder
  • Transformer
  • ResNet-50

These are run across varying occupancy levels on a single NVIDIA B200 180GB GPU.

A few examples from the README:

  • TransformerEncoder, Train, 20% occupancy

    • Padded (eager): 26.95 ms
    • DanLing (eager): 8.60 ms
    • 3.13x faster
  • TransformerDecoder, Infer, 20% occupancy

    • Padded (eager): 47.38 ms
    • DanLing (eager): 9.27 ms
    • 5.11x faster
  • Transformer, Train, 21% occupancy

    • Padded (eager): 68.79 ms
    • DanLing (eager): 24.53 ms
    • 2.80x faster

Operators

Synthetic operator benchmarks cover common transformer-hot ops and tensor primitives across:

  • padded tensors
  • DanLing NestedTensor
  • torch.nested

Examples at 35% occupancy:

  • F.layer_norm

    • Padded (eager): 0.17 ms
    • DanLing (eager): 0.10 ms
    • torch.nested (eager): 0.20 ms
  • F.softmax

    • Padded (eager): 0.12 ms
    • DanLing (eager): 0.11 ms
    • torch.nested (eager): 0.15 ms
  • torch.add

    • Padded (eager): 0.05 ms
    • DanLing (eager): 0.12 ms
    • torch.nested (eager): 0.11 ms

The goal of the synthetic benchmarks is to isolate behavior across occupancy levels, models, and operators, while the IMDB benchmark gives a more realistic workload result.

Source Code: DanLing/danling/tensors at master · ZhiyuanChen/DanLing · GitHub

Would especially appreciate feedback on:

  • additional real workloads that would be useful to benchmark
  • PyTorch-native model paths we should test next
  • operator/model cases where comparison against torch.nested would be most useful