Changes in Total Test Time

Changes in Total Test Time

As PyTorch gains contributors and increases in commit frequency, TTS and job durations become more important to developers who need to iterate and merge their changes. However, as PyTorch grows, so does the number of tests and the time it takes to run those tests. Thus, we would like to monitor how test times change in order to determine trends for what may happen in the future, along with possible causes and how to mitigate the negative effects that increasing test times bring.

TL;DR

  • Since 4 months ago, many of the longer running (>1hr) test configs have slowly increased in total test time by about 50%.
  • Decreases in total test time are generally dramatic and can be attributed to things like changing versions or no longer running a group of tests.
  • Increases in total test time are generally due to adding new tests, whether it be gradually (see cpu) or all at once (see multigpu).

How to read the chart:

  • Red = increase in time. The blue bar is the initial time from 4 months ago and it increased by the amount of the red bar.
  • Green = decrease in time. It started at the end of the green bar and decreased to the blue bar.

Problems

  • Data from before March is missing, likely because we did not collect webhook data before that

default cuda + debug

11.6 is doing much better than 11.3

11.3 had an issue with jiterator and nvfuser

We discovered one regression occurring on Jun 3, 2022 which was then fixed on Jun 14, 2022 (Skip extremely long chebyshev/legendre tests introduced in #78304 by janeyx99 · Pull Request #79529 · pytorch/pytorch · GitHub)

onnx

These test times have been very constant outside of major changes, unlike many of our other test times which generally grow gradually. It has been replaced by a focal, clang10 version which does not seem to have changed anything.

multigpu

Like onnx, this test config does not show the usual gradual increase, but remains steady outside of major test additions. It has been moved to periodic recently due to lack of capacity and the increase in time.

cuda slow

This test config generally shows sudden changes in test time.

macos

Previous wavelength was about 2 days, however it seems to have stopped this periodicity recently. One possible explanation for this change is that we are running fewer jobs on these machines, as we moved some mac jobs to periodic around that time (ci: Move ios-12-5-1-x86-64-coreml to periodic (#80455) · pytorch/pytorch@c8943f8 · GitHub).

XLA

Like onnx, it shows sudden changes in test time, but no gradual increases outside of that. The first jump is due to enabling the xla tests (Enable xla test by JackCaoG · Pull Request #76565 · pytorch/pytorch · GitHub). The most recent increase was likely due to disabling the build cache and the decrease after that was due to re-enabling the build cache.

Dynamo

First jump (~1.5 hrs) is turning on tests, second jump (~1.5 hrs) is probably due to adding more models from torchvision, whose enumeration in test_fx requires more time ([CI] Install vision without pep517 by malfet · Pull Request #81074 · pytorch/pytorch · GitHub).

Distributed

cuda

Slow increase over time due to adding test cases. Recently decreased due to refactoring of the testing technique to cut down time in FSDP tests ([BE][FSDP] Subtest prefetching in `test_mixed_precision_e2e_full_shard()` by awgu · Pull Request #80915 · pytorch/pytorch · GitHub).

rocm

Highly variable, but generally increased, likely due to enabling more tests. There seems to be a few dramatic jumps as well.

cpu

Slow increase over time

Windows

win cuda default

11.3 seems to show similar behaviors as its Linux counterpart, such as the regression Jun 3, 2022 and the fix around Jun 14, 2022, as well as the decrease in test time due to the change to cuda11.6.

Drop in May is due to removing cpu tests ([ci] don't run cpu tests on win-cuda builds by suo · Pull Request #77240 · pytorch/pytorch · GitHub)

11.6 got moved from periodic to trunk late June (green → blue) and the test time increased. Like the mac tests, one possible explanation is that we are now running more jobs on those machines.

win cuda force_on_cpu

A dramatic decrease due to no longer running cpu tests ([ci] don't run cpu tests on win-cuda builds by suo · Pull Request #77240 · pytorch/pytorch · GitHub), but no further changes.

4 Likes