Take case ‘python test_cuda.py TestCuda.test_autocast_methods_fp32’ for example.
It firstly triggers the initialization of ‘AutocastTestLists’ where many ops with different args are built.
See it in Pytorch/torch/testing/_internal/autocast_test_lists.py, here we can find many codes like ‘torch.ones’ , ‘torch.randn’, etcs. In fact, these operations are performed by launching CUDA kernels which are bind with a CUDA stream. In this case, the stream used to init data is the CUDA NULL stream.
Then, it changes to a user created stream to run test. (Pytorch creates streams with flags CU_STREAM_NON_BLOCKING which will behave no synchronize with NULL stream.) Running test will use the expected initialized data, this introduces a problem that the data initialization may not finish. Testing result is not reliable.
Below code (in common_utils.py) change stream to run test.
‘’’
class CudaNonDefaultStream():
def enter(self):
# Before starting CUDA test save currently active streams on all
# CUDA devices and set new non default streams to all CUDA devices
# to ensure CUDA tests do not use default stream by mistake.
beforeDevice = torch.cuda.current_device()
self.beforeStreams =
for d in range(torch.cuda.device_count()):
self.beforeStreams.append(torch.cuda.current_stream(d))
deviceStream = torch.cuda.Stream(device=d)
torch._C._cuda_setStream(deviceStream._cdata)
torch._C._cuda_setDevice(beforeDevice)
‘’’
It is not easy to trigger the case failure since GPU runs fast. We can make a simple case to simulate this.
‘’’
import torch
print("****** do something busy ********")
oriStream = torch.cuda.current_stream()
torch.cuda._sleep(4000000)
t = torch.ones(1, device=‘cuda:0’, dtype=torch.float16)
newStream = torch.cuda.Stream(device=‘cuda:0’)
torch._C._cuda_setStream(newStream._cdata)
output_to_compare = getattr(t, ‘pow’)(2)
print('task on ori stream is finished ? ', oriStream.query())
print(output_to_compare)
‘’’
The print results:
****** do something busy ********
<torch.cuda.Stream device=cuda:0 cuda_stream=0x0>
task on ori stream is finished ? False
tensor([0.], device=‘cuda:0’, dtype=torch.float16)
It is better to call ‘torch.cuda.synchronize()’ after the data initialization to guarantee the initialization is finished.