Planning for Release 2.6.1

atalman · February 19, 2025, 3:33pm

Hi Team,

Following the process for patch release at pytorch/RELEASE.md at main · pytorch/pytorch · GitHub we have been looking at the issues related to 2.6.0. Currently we have some open items. You can see the whole list here: 2.6.1 Milestone · GitHub

There are only 2 critical items, and they are not resolved yet:

github.com/pytorch/pytorch

`torch.device(0)` makes CUDA init fail in subprocess since `2.5.0`

opened 04:01PM - 03 Jan 25 UTC

cbensimon

high priority module: cuda triaged module: regression module: accelerator

### 🐛 Describe the bug ```python from multiprocessing import Process impo…rt torch torch.device(0) # Note that torch.device('cuda') or torch.device('cuda:0') do not trigger the issue def cuda_init(): torch.Tensor([0]).cuda() p = Process(target=cuda_init) p.start() p.join() assert p.exitcode == 0 ``` This code snippet succeeds on PyTorch `2.4.1` and fails on `2.5.0`: ``` RuntimeError: CUDA error: initialization error Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Indeed, since `2.5.0`, `torch.device(0)` calls `at::getAccelerator`, which ends up calling `cudaGetDeviceCount` and thus initializing CUDA and preventing forks It seems to be directly linked with: - https://github.com/pytorch/pytorch/pull/131811 (especially the change in `torch/csrc/utils/python_arg_parser.h`) ### Versions Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.31 Python version: 3.10.9 (main, Feb 3 2023, 11:29:04) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-1028-aws-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.0.221 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 535.183.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 2820.130 BogoMIPS: 5599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 2 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.4.5.8 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.4.127 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.4.127 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.4.127 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.2.1.3 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.5.147 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.6.1.9 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.3.1.170 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.4.127 [pip3] torch==2.5.1 [pip3] torchaudio==2.5.1 [pip3] torchsde==0.2.6 [pip3] torchvision==0.20.1 [pip3] triton==3.1.0 [conda] Could not collect cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @ptrblck @eqy @albanD @guangyey @EikanWang

github.com/pytorch/pytorch

`Illegal instruction (core dumped)` on Raspberry Pi 4 when exporting ONNX with `torch 2.6.0`

opened 12:05PM - 09 Feb 25 UTC

Chizkiyahu

high priority triage review module: crash module: regression module: arm

### 🐛 Describe the bug #### **Description** On Raspberry Pi 4, `torch.onnx.expo…rt` fails with `Illegal instruction (core dumped)` in `torch 2.6.0`. The same code works fine on `torch 2.5.1`. The issue occurs when using `x.expand(x.shape[0], -1, -1)` inside a `torch.nn.Module`. The crash happens **only during ONNX export**, not during regular inference. #### **Code to Reproduce** ```python import torch class Module(torch.nn.Module): def forward(self, x): return x.expand(x.shape[0], -1, -1) # Crashes here during ONNX export model = Module() dummy_inputs = tuple(torch.randn(1, 1, 192)) # Running the model works fine res = model(*dummy_inputs) # Exporting to ONNX causes core dump torch.onnx.export(model, opset_version=20, f="./m.onnx", args=dummy_inputs) ``` #### **Error Output** ``` Illegal instruction (core dumped) ``` #### **Device and Environment Details** | Device | PyTorch Version | Execution Type | Status | |----------------------------|----------------|----------------|---------| | MacBook Pro M4 (native) | 2.6.0 | Native | ✅ Works | | MacBook Pro M4 (Docker) | 2.6.0 | Docker | ✅ Works | | Raspberry Pi 4 (native) | 2.5.1 | Native | ✅ Works | | Raspberry Pi 4 (Docker) | 2.5.1 | Docker | ✅ Works | | Raspberry Pi 4 (native) | 2.6.0 | Native | ❌ **Fails** | | Raspberry Pi 4 (Docker) | 2.6.0 | Docker | ❌ **Fails** | | Raspberry Pi 5 (native) | 2.6.0 | Native | ✅ Works | # raspi 4 vs 5 cpu Features running `cat /proc/cpuinfo | grep 'Fe' | uniq` ## raspi 4 ```bash Features : fp asimd evtstrm crc32 cpuid ``` ## raspi 5 ```bash Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ``` ### Versions Collecting environment information... PyTorch version: 2.6.0+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Debian GNU/Linux 12 (bookworm) (aarch64) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.36 Python version: 3.11.11 (main, Feb 4 2025, 13:44:55) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-5.15.32-v8+-aarch64-with-glibc2.36 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: ARM Model name: Cortex-A72 Model: 3 Thread(s) per core: 1 Core(s) per cluster: 4 Socket(s): - Cluster(s): 1 Stepping: r0p3 CPU(s) scaling MHz: 100% CPU max MHz: 1800.0000 CPU min MHz: 600.0000 BogoMIPS: 108.00 Flags: fp asimd evtstrm crc32 cpuid Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Vulnerable Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] onnx==1.16.1 [pip3] onnxruntime==1.19.2 [pip3] onnxruntime_extensions==0.13.0 [pip3] torch==2.6.0 [pip3] torchvision==0.21.0 [pip3] uni_pytorch==0.0.0 [conda] Could not collect cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @malfet @snadampal @milpuz01

We will be having Release 2.6.1 Go/NoGo Meeting on Tuesday Feb 25. Giving the current issues resolution is trending towards a no go decision.

If you have any questions or concerns regarding Release 2.6.1, please feel free to leave a comment under this post.

Cheers,
Team PyTorch

jeffdaily · February 20, 2025, 8:09pm

This one is critical for ROCm.

github.com/pytorch/pytorch

[ROCm] scaled_dot_product_attention using mem-efficient backend (aotriton) produces wrong outputs with custom attn_mask on torch 2.6.0+rocm6.2.4

opened 03:53PM - 19 Feb 25 UTC

fxmarty-amd

module: rocm

### 🐛 Describe the bug Hi, As discussed on slack and on https://github.com/hug…gingface/transformers/issues/30056#issuecomment-2657390613, SDPA with custom attn_mask using mem-efficient backend (aotriton 0.8.0) produces wrong outputs on torch 2.6.0 stable ROCm release. This is fixed on torch nightly that uses aotriton 0.8.2. Reproduction: ```python import torch from torch.nn.attention import SDPBackend, sdpa_kernel from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask_for_sdpa batch_size = 2 num_heads = 32 head_dim = 128 num_tokens_q = 7 num_tokens_kv = num_tokens_q device= "cuda" dtype = torch.float16 num_pad_tokens = 3 query = torch.rand(batch_size, num_heads, num_tokens_q, head_dim, dtype=dtype, device=device) - 0.5 key = torch.rand(batch_size, num_heads, num_tokens_q, head_dim, dtype=dtype, device=device) - 0.5 value = torch.rand(batch_size, num_heads, num_tokens_q, head_dim, dtype=dtype, device=device) - 0.5 attn_mask_2d = torch.ones(batch_size, num_tokens_q, dtype=torch.int32, device=device) attn_mask_2d[1][:num_pad_tokens] = 0 # simulate padding attn_mask_4d = _prepare_4d_causal_attention_mask_for_sdpa( attn_mask_2d, input_shape=(batch_size, num_tokens_q), inputs_embeds=query, # this is only used to retrieve device, dtype. past_key_values_length=0, ) print("attn_mask_4d", attn_mask_4d) with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): sdpa_out_efficient = torch.nn.functional.scaled_dot_product_attention( query, key, value, attn_mask=attn_mask_4d ) with sdpa_kernel(SDPBackend.MATH): sdpa_out_math = torch.nn.functional.scaled_dot_product_attention( query, key, value, attn_mask=attn_mask_4d ) with sdpa_kernel(SDPBackend.MATH): sdpa_out_math_cpu = torch.nn.functional.scaled_dot_product_attention( query.cpu(), key.cpu(), value.cpu(), attn_mask=attn_mask_4d.cpu() ) print("[rocm math vs rocm mem-efficient] Median abs diff, non padded sequence:", (sdpa_out_efficient[0] - sdpa_out_math[0]).abs().median()) print("[rocm math vs rocm mem-efficient] Max abs diff, non padded sequence:", (sdpa_out_efficient[0] - sdpa_out_math[0]).abs().max()) print("[rocm math vs rocm mem-efficient] Median abs diff, padded sequence:", (sdpa_out_efficient[1, :, num_pad_tokens:] - sdpa_out_math[1, :, num_pad_tokens:]).abs().median()) print("[rocm math vs rocm mem-efficient] Max abs diff, padded sequence:", (sdpa_out_efficient[1, :, num_pad_tokens:] - sdpa_out_math[1, :, num_pad_tokens:]).abs().max()) sdpa_out_efficient = sdpa_out_efficient.cpu() print("\n[cpu math vs rocm mem-efficient] Median abs diff, non padded sequence:", (sdpa_out_math_cpu[0] - sdpa_out_efficient[0]).abs().median()) print("[cpu math vs rocm mem-efficient] Max abs diff, non padded sequence:", (sdpa_out_math_cpu[0] - sdpa_out_efficient[0]).abs().max()) print("[cpu math vs rocm mem-efficient] Median abs diff, padded sequence:", (sdpa_out_math_cpu[1, :, num_pad_tokens:] - sdpa_out_efficient[1, :, num_pad_tokens:]).abs().median()) print("[cpu math vs rocm mem-efficient] Max abs diff, padded sequence:", (sdpa_out_math_cpu[1, :, num_pad_tokens:] - sdpa_out_efficient[1, :, num_pad_tokens:]).abs().max()) sdpa_out_math = sdpa_out_math.cpu() print("\n[cpu math vs rocm math] Median abs diff, non padded sequence:", (sdpa_out_math_cpu[0] - sdpa_out_math[0]).abs().median()) print("[cpu math vs rocm math] Max abs diff, non padded sequence:", (sdpa_out_math_cpu[0] - sdpa_out_math[0]).abs().max()) print("[cpu math vs rocm math] Median abs diff, padded sequence:", (sdpa_out_math_cpu[1, :, num_pad_tokens:] - sdpa_out_math[1, :, num_pad_tokens:]).abs().median()) print("[cpu math vs rocm math] Max abs diff, padded sequence:", (sdpa_out_math_cpu[1, :, num_pad_tokens:] - sdpa_out_math[1, :, num_pad_tokens:]).abs().max()) ``` which gives ``` attn_mask_4d tensor([[[[ 0., -65504., -65504., -65504., -65504., -65504., -65504.], [ 0., 0., -65504., -65504., -65504., -65504., -65504.], [ 0., 0., 0., -65504., -65504., -65504., -65504.], [ 0., 0., 0., 0., -65504., -65504., -65504.], [ 0., 0., 0., 0., 0., -65504., -65504.], [ 0., 0., 0., 0., 0., 0., -65504.], [ 0., 0., 0., 0., 0., 0., 0.]]], [[[ -0., -0., -0., -0., -0., -0., -0.], [ -0., -0., -0., -0., -0., -0., -0.], [ -0., -0., -0., -0., -0., -0., -0.], [-65504., -65504., -65504., 0., -65504., -65504., -65504.], [-65504., -65504., -65504., 0., 0., -65504., -65504.], [-65504., -65504., -65504., 0., 0., 0., -65504.], [-65504., -65504., -65504., 0., 0., 0., 0.]]]], device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Median abs diff, non padded sequence: tensor(0., device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Max abs diff, non padded sequence: tensor(0.0002, device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Median abs diff, padded sequence: tensor(0.0991, device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Max abs diff, padded sequence: tensor(0.6846, device='cuda:0', dtype=torch.float16) [cpu math vs rocm mem-efficient] Median abs diff, non padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm mem-efficient] Max abs diff, non padded sequence: tensor(0.0002, dtype=torch.float16) [cpu math vs rocm mem-efficient] Median abs diff, padded sequence: tensor(0.0991, dtype=torch.float16) [cpu math vs rocm mem-efficient] Max abs diff, padded sequence: tensor(0.6846, dtype=torch.float16) [cpu math vs rocm math] Median abs diff, non padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm math] Max abs diff, non padded sequence: tensor(6.1035e-05, dtype=torch.float16) [cpu math vs rocm math] Median abs diff, padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm math] Max abs diff, padded sequence: tensor(6.1035e-05, dtype=torch.float16) ``` As we can see, SDPA on ROCm with mem-efficient attention gives wrong outputs. This causes issues in batched generation in Transformers: https://github.com/huggingface/transformers/issues/30056#issuecomment-2657390613 The root cause is a bug in aotriton 0.8.0 that is shipped with PyTorch 2.6.0+rocm6.2.4. Using aotrition 0.8.2 (https://github.com/ROCm/aotriton/releases/tag/0.8.2b) fixes this issue, specifically grabbing the asset from https://github.com/ROCm/aotriton/releases/tag/0.8.2b and replacing `torch/lib/aotriton.images/` by the 0.8.2 release `aotriton.images/`. Diff between math and mem-efficient is then much more reasonable: ``` [rocm math vs rocm mem-efficient] Median abs diff, non padded sequence: tensor(0., device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Max abs diff, non padded sequence: tensor(0.0002, device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Median abs diff, padded sequence: tensor(0., device='cuda:0', dtype=torch.float16) [rocm math vs rocm mem-efficient] Max abs diff, padded sequence: tensor(0.0002, device='cuda:0', dtype=torch.float16) [cpu math vs rocm mem-efficient] Median abs diff, non padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm mem-efficient] Max abs diff, non padded sequence: tensor(0.0002, dtype=torch.float16) [cpu math vs rocm mem-efficient] Median abs diff, padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm mem-efficient] Max abs diff, padded sequence: tensor(0.0002, dtype=torch.float16) [cpu math vs rocm math] Median abs diff, non padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm math] Max abs diff, non padded sequence: tensor(0.0001, dtype=torch.float16) [cpu math vs rocm math] Median abs diff, padded sequence: tensor(0., dtype=torch.float16) [cpu math vs rocm math] Max abs diff, padded sequence: tensor(4.7684e-07, dtype=torch.float16) ``` Using aotriton 0.8.2 in a torch patch release might be considered? cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @xinyazhang @atalman Thank you! ### Versions ``` PyTorch version: 2.6.0+rocm6.2.4 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.2.41134-65d174c3e OS: Ubuntu 24.04 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.39 Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Instinct MI250X/MI250 (gfx90a:sramecc+:xnack-) Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: 6.2.41134 MIOpen runtime version: 3.2.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD EPYC 73F3 16-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 61% CPU max MHz: 4036.6211 CPU min MHz: 1500.0000 BogoMIPS: 6987.05 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Virtualization: AMD-V L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 16 MiB (32 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==2.2.2 [pip3] onnx==1.17.0 [pip3] onnxruntime==1.20.1 [pip3] onnxruntime_extensions==0.13.0 [pip3] onnxruntime-genai==0.5.2 [pip3] onnxsim==0.4.36 [pip3] pytorch-triton-rocm==3.2.0 [pip3] torch==2.6.0+rocm6.2.4 [pip3] torchaudio==2.6.0+rocm6.2.4 [pip3] torchvision==0.21.0+rocm6.2.4 [conda] Could not collect ```

atalman · February 25, 2025, 1:17am

Sounds good added it to 2.6.1 milestone issue

Topic		Replies	Views
PyTorch Release 2.0.1 Planning Release Announcements	2	1278	April 13, 2023
PyTorch 2.5.0 General Availability Release Announcements	7	1098	December 9, 2024
PyTorch Release 2.6.0 - Final RC is available Release Announcements	3	4353	March 20, 2025
Dropping Support for CUDA 11.6 and Python 3.7 from PyTorch 2.0 Release Release Announcements	0	2361	February 1, 2023
PyTorch Release 2.0.1 Important Information Release Announcements	1	4220	April 13, 2023

Planning for Release 2.6.1

Related topics