Summary
PyTorch added uint16, uint32, and uint64 to the type system (c10::ScalarType) and defined dispatch macros (AT_BAREBONES_UNSIGNED_TYPES, AT_INTEGRAL_TYPES_V2), but operator coverage is highly incomplete. Basic operations like torch.add, torch.sub, and torch.lt raise NotImplementedError for these types. This RFC proposes completing that coverage at the ATen level.
a = torch.tensor([10, 20], dtype=torch.uint16)
b = torch.tensor([5, 10], dtype=torch.uint16)
a + b # NotImplementedError: "add_stub" not implemented for 'UInt16'
a < b # NotImplementedError: Could not run 'aten::lt.Tensor_out'
34 operators fail across 3 dtypes × 2 devices (CPU, CUDA), approximately 183 total failures as of PyTorch 2.12.0a0 (some ops fail on only one device).
Why It Matters
The unsigned integer spectrum:
| Width | Max Value | Use Case |
|---|---|---|
| 8-bit | 255 | Pixel values (JPEG/PNG), boolean masks (already fully supported) |
| 16-bit | 65,535 | DICOM medical images, 16-bit imaging, GIS satellite rasters, sensor data, medium-sized token vocabularies (~50K) |
| 32-bit | 4,294,967,295 | Large token vocabularies (>65k tokens), GIS rasters, modular arithmetic(Z_{2^32}) |
| 64-bit | 1.8446744×10¹⁹ | Cryptographic hash spaces (Z_{2^64}), unique identifiers, modular arithmetic |
and the workloads that depend on it:
- Higher fidelity images (DICOM, GIS, HDR): Medical imaging (DICOM) stores pixel intensities as uint16 natively, a CT scan slice is a 512×512 grid of 16-bit unsigned pixel values (0–65,535), from which Hounsfield units are derived via rescale slope/intercept. Truncating to uint8 destroys diagnostic information; widening to int32 doubles GPU memory on already memory-bound datasets. The same applies to GIS satellite rasters (Landsat, Sentinel-2 loaded via GDAL → NumPy → PyTorch) and HDR imaging pipelines that need add, sub, mul, clamp on 16-bit unsigned channels.
- Medium-sized token vocabularies (~50k tokens) — GPT-2, GPT-3, SmolLM2, OLMoE, GPT-NeoX, RoBERTa — fit in uint16 (max 65,535) but not int16 (max 32,767), at half the memory of int32. Larger tokenizers (GPT-4: 100k, Llama 3: 128k, Gemma: 256k) need uint32 or int32; the uint32 advantage there is semantic correctness for non-negative IDs and zero-cost interop with external uint32 data (NumPy, Arrow, Parquet) rather than memory savings over int32. But today neither uint16 nor uint32 tensors can even be added.
- Cryptography (Z_{2^64} and Z_{2^32}): Hash functions (SipHash, xxHash), PRNGs (xorshift64), and stream ciphers (ChaCha20) operate in unsigned integer rings requiring well-defined overflow (wrap modulo 2^N). This is guaranteed by C++ for unsigned types but undefined for signed types; int64 workarounds produce incorrect results.
Why ATen
- ATen is the operator foundation. Every PyTorch operator in eager mode, torch.compile, and autograd dispatches through ATen. Adding support here propagates everywhere.
- The infrastructure already exists. c10::ScalarType::UInt16/UInt32/UInt64, dispatch macros, and isIntegralType() all recognize these types. The gap is that individual kernels don’t include them in their dispatch tables.
- PT2 / Triton builds on top. ATen provides the eager-mode baseline and reference implementation. Once types work in eager mode, torch.compile can lower them through Inductor/Triton for GPU codegen. Triton makes it straightforward to JIT compile for new dtypes, so full ATen kernel coverage isn’t needed for every op, but the eager kernels must come first.
Current State (Runtime-Validated)
Works: copy_, fill_, clone, eq, ne, mul (CPU), div (float promotion), sum, prod, cumsum, cumprod, unique, sort (CPU), item, factories, NumPy/DLPack interop, bitwise_and/or/xor (CPU), index_select/gather (CUDA)..
Broken (34 ops, 183 failures): add, sub, mul (CUDA), floor_div, remainder, lt/le/gt/ge, maximum/minimum, abs, neg, sign, bitwise_not, bitwise_and/or/xor (CUDA), lshift/rshift, clamp, min/max/argmin/argmax, index_select/gather (CPU), index_put, masked_fill, where, arange, tril/triu, sort (CUDA).
The root cause for add/sub is torchgen/model.py — the codegen ScalarType enum and DTYPE_CLASSES don’t include UInt16/32/64, so ufunc-driven ops never get kernels generated. The rest just need their dispatch macros updated to include AT_BAREBONES_UNSIGNED_TYPES.
Reference Issues
- #58734: “Support for uint16, uint32, and uint64” (open since May 2021, 50+ thumbs-up)
- #176298: Binary ops not implemented for uint16/32/64
- #105264: Inductor generates incorrect CPU code for uint8 operations (related uint codegen issue)
Open questions
We’re looking for feedback from the PyTorch community on the best way to close this gap. We’re ready to contribute the implementation — the work touches codegen (torchgen/model.py) and per-op dispatch macros across ATen — and would appreciate guidance on scope, phasing, and any design constraints we should be aware of.
Specific input we would like from maintainers and users:
- Are there operators beyond the 34 identified that are critical for your uint16/32/64 workloads?
- Overflow on subtraction: should
torch.sub(a, b)whereb > awrap (C++ unsigned semantics, same as NumPy) or error? Current proposal: wrap. - torch.neg for unsigned types: Raise an error (preferred) or wrap via unsigned overflow? NumPy wraps with a RuntimeWarning.
- This targets CPU and CUDA. MPS currently silently produces wrong results for uint16/32/64 binary ops (#176296). Should MPS/XPU be in scope or tracked separately?
For questions 2 and 3, we plan to follow the same semantics as uint8, unless otherwise advised against it.
@jgroenen@redhat.com @cleonard530