Making PyTorch Test Suite usable from out-of-tree braindump

The purpose of this post is to brain-dump any context I had related to making the pytorch test suite device-generic as I am unfortunately moving teams out of PyTorch due to some reshuffling and will not be able to help drive this anymore :pensive_face:

Most of this information will not be new but the purpose is to consolidate all the context in one place.

This is what I think should have happened

  1. Provide (a) a sample test runner in openreg (b) documentation that classifies tests based on feature based on Device-Generic Refactoring Test Class Tracker and makes explicit which tests are legacy vs necessary to run to support a given feature

    • Tests will be discovered via instantiate_device_type_tests (implying that we need to migrate device-generic test registration as such – note that many distributed test classes are not currently instantiated as such).
    • From my initial exploration, I thought that run_test.py is too tightly coupled to PyTorch’s own CI infrastructure (e.g. it downloads historical test times from S3 for in-tree target determination that is no longer well supported).
      • I did not necessarily think that PyTorch had to provide a dedicated test runner for out-of-tree backends to use as out-of-tree backends might want to handle things like sharding of tests themselves. So I think a sample in openreg of how tests should be discovered/run is sufficient. Some very very drafty work on this is here.
  2. Provide a blessed mechanism for applying skips and decorators to tests from out of tree at all like how in-tree backends would. This applies at different granularities (e.g. OpInfo tests decorated with @ops vs for regular test methods).

    • The Ascend folks had a proposal for this and are working on the implementation. The work items are in this project board
    • This skip mechanism [issue 1 issue 2 , issue3] relies on accelerator tests being instantiated via instantiate_device_type_tests as in point (1), some discussion about distributed tests needing to be moved to be instantiated as such in order to use this mechanism here here.
    • Note that torch-spyre from IBM has already done this entirely from out-of-tree by inheriting from/patching the PrivateUse1TestBase. See the RFC and implementation
      • They also have a config mechanism, PyTorch itself will not be opinionated about the config system the downstream repository uses but this is a good example of how it can be achieved.
  3. Refactor relevant in-tree tests to run on device Device-Generic Refactoring Test Class Tracker. The above spreadsheet elucidates which test classes need to be refactored, and also attempts some priorities for the “generic” tab. Priority-wise I would have gone with generic → distributed → others.

    • This entails changing ‘cuda’ → device, moving test classes to be instantiated via instantiate_device_type_tests, changing onlyCUDA → onlyAccelerator if the onlyCUDA really meant "not CPU
    • Several folks are and have been working on this (e.g. Intel, aws neuron, RH etc.)
    • The biggest bottleneck here is PR review from PyTorch maintainers.
  4. Ensure that system persists

    • Linter to prevent new tests from hardcoding device, unless tests are explicitly device-specific https://github.com/pytorch/pytorch/issues/177970 (there is already a TEST_DEVICE_BIAS linter that enforces this for a small subset of tests, it should be extended).
    • Linter that tells people to add any new TestClasses they have added to the categorization in 1
    • Run small subset of test suite against openreg in CI
      • There will be some noisy failures here that are due to openreg’s implementation (e.g. mprotect semantic etc.).

After all the above is tackled we could optionally tackle

  1. Provide a registration mechanism for device-capabilities that is consumed by tests, annotate tests as appropriate (e.g. supported_dtypes, supports_memory_format etc.) like https://github.com/pytorch/pytorch/issues/146898

    • This is a “nice-to-have” and not necessary as skips can theoretically express this, but for better UX it could be explored
2 Likes

Hey @mikaylagawarecki , thank you for summarizing those great things, best wishes for your new role! :slight_smile: