Found something risky in unit test which may cause case failure

Take case ‘python test_cuda.py TestCuda.test_autocast_methods_fp32’ for example.

It firstly triggers the initialization of ‘AutocastTestLists’ where many ops with different args are built.
See it in Pytorch/torch/testing/_internal/autocast_test_lists.py, here we can find many codes like ‘torch.ones’ , ‘torch.randn’, etcs. In fact, these operations are performed by launching CUDA kernels which are bind with a CUDA stream. In this case, the stream used to init data is the CUDA NULL stream.

Then, it changes to a user created stream to run test. (Pytorch creates streams with flags CU_STREAM_NON_BLOCKING which will behave no synchronize with NULL stream.) Running test will use the expected initialized data, this introduces a problem that the data initialization may not finish. Testing result is not reliable.

Below code (in common_utils.py) change stream to run test.
‘’’
class CudaNonDefaultStream():
def enter(self):
# Before starting CUDA test save currently active streams on all
# CUDA devices and set new non default streams to all CUDA devices
# to ensure CUDA tests do not use default stream by mistake.
beforeDevice = torch.cuda.current_device()
self.beforeStreams =
for d in range(torch.cuda.device_count()):
self.beforeStreams.append(torch.cuda.current_stream(d))
deviceStream = torch.cuda.Stream(device=d)
torch._C._cuda_setStream(deviceStream._cdata)
torch._C._cuda_setDevice(beforeDevice)
‘’’

It is not easy to trigger the case failure since GPU runs fast. We can make a simple case to simulate this.

‘’’
import torch
print("****** do something busy ********")

oriStream = torch.cuda.current_stream()
torch.cuda._sleep(4000000)

t = torch.ones(1, device=‘cuda:0’, dtype=torch.float16)

newStream = torch.cuda.Stream(device=‘cuda:0’)
torch._C._cuda_setStream(newStream._cdata)

output_to_compare = getattr(t, ‘pow’)(2)
print('task on ori stream is finished ? ', oriStream.query())
print(output_to_compare)
‘’’

The print results:
****** do something busy ********
<torch.cuda.Stream device=cuda:0 cuda_stream=0x0>
task on ori stream is finished ? False
tensor([0.], device=‘cuda:0’, dtype=torch.float16)

It is better to call ‘torch.cuda.synchronize()’ after the data initialization to guarantee the initialization is finished.