Found something risky in unit test which may cause case failure

Take case ‘python TestCuda.test_autocast_methods_fp32’ for example.

It firstly triggers the initialization of ‘AutocastTestLists’ where many ops with different args are built.
See it in Pytorch/torch/testing/_internal/, here we can find many codes like ‘torch.ones’ , ‘torch.randn’, etcs. In fact, these operations are performed by launching CUDA kernels which are bind with a CUDA stream. In this case, the stream used to init data is the CUDA NULL stream.

Then, it changes to a user created stream to run test. (Pytorch creates streams with flags CU_STREAM_NON_BLOCKING which will behave no synchronize with NULL stream.) Running test will use the expected initialized data, this introduces a problem that the data initialization may not finish. Testing result is not reliable.

Below code (in change stream to run test.
class CudaNonDefaultStream():
def enter(self):
# Before starting CUDA test save currently active streams on all
# CUDA devices and set new non default streams to all CUDA devices
# to ensure CUDA tests do not use default stream by mistake.
beforeDevice = torch.cuda.current_device()
self.beforeStreams =
for d in range(torch.cuda.device_count()):
deviceStream = torch.cuda.Stream(device=d)

It is not easy to trigger the case failure since GPU runs fast. We can make a simple case to simulate this.

import torch
print("****** do something busy ********")

oriStream = torch.cuda.current_stream()

t = torch.ones(1, device=‘cuda:0’, dtype=torch.float16)

newStream = torch.cuda.Stream(device=‘cuda:0’)

output_to_compare = getattr(t, ‘pow’)(2)
print('task on ori stream is finished ? ', oriStream.query())

The print results:
****** do something busy ********
<torch.cuda.Stream device=cuda:0 cuda_stream=0x0>
task on ori stream is finished ? False
tensor([0.], device=‘cuda:0’, dtype=torch.float16)

It is better to call ‘torch.cuda.synchronize()’ after the data initialization to guarantee the initialization is finished.