The purpose of this post is to brain-dump any context I had related to making the pytorch test suite device-generic as I am unfortunately moving teams out of PyTorch due to some reshuffling and will not be able to help drive this anymore ![]()
Most of this information will not be new but the purpose is to consolidate all the context in one place.
This is what I think should have happened
-
Provide (a) a sample test runner in openreg (b) documentation that classifies tests based on feature based on Device-Generic Refactoring Test Class Tracker and makes explicit which tests are legacy vs necessary to run to support a given feature
- Tests will be discovered via
instantiate_device_type_tests(implying that we need to migrate device-generic test registration as such – note that many distributed test classes are not currently instantiated as such). - From my initial exploration, I thought that
run_test.pyis too tightly coupled to PyTorch’s own CI infrastructure (e.g. it downloads historical test times from S3 for in-tree target determination that is no longer well supported).- I did not necessarily think that PyTorch had to provide a dedicated test runner for out-of-tree backends to use as out-of-tree backends might want to handle things like sharding of tests themselves. So I think a sample in openreg of how tests should be discovered/run is sufficient. Some very very drafty work on this is here.
- Tests will be discovered via
-
Provide a blessed mechanism for applying skips and decorators to tests from out of tree at all like how in-tree backends would. This applies at different granularities (e.g. OpInfo tests decorated with
@opsvs for regular test methods).- The Ascend folks had a proposal for this and are working on the implementation. The work items are in this project board
- This skip mechanism [issue 1 issue 2 , issue3] relies on accelerator tests being instantiated via
instantiate_device_type_testsas in point (1), some discussion about distributed tests needing to be moved to be instantiated as such in order to use this mechanism here here. - Note that torch-spyre from IBM has already done this entirely from out-of-tree by inheriting from/patching the PrivateUse1TestBase. See the RFC and implementation
- They also have a config mechanism, PyTorch itself will not be opinionated about the config system the downstream repository uses but this is a good example of how it can be achieved.
-
Refactor relevant in-tree tests to run on device Device-Generic Refactoring Test Class Tracker. The above spreadsheet elucidates which test classes need to be refactored, and also attempts some priorities for the “generic” tab. Priority-wise I would have gone with generic → distributed → others.
- This entails changing ‘cuda’ → device, moving test classes to be instantiated via instantiate_device_type_tests, changing onlyCUDA → onlyAccelerator if the onlyCUDA really meant "not CPU
- Several folks are and have been working on this (e.g. Intel, aws neuron, RH etc.)
- The biggest bottleneck here is PR review from PyTorch maintainers.
-
Ensure that system persists
- Linter to prevent new tests from hardcoding device, unless tests are explicitly device-specific https://github.com/pytorch/pytorch/issues/177970 (there is already a TEST_DEVICE_BIAS linter that enforces this for a small subset of tests, it should be extended).
- Linter that tells people to add any new TestClasses they have added to the categorization in 1
- Run small subset of test suite against openreg in CI
- There will be some noisy failures here that are due to openreg’s implementation (e.g. mprotect semantic etc.).
After all the above is tackled we could optionally tackle
-
Provide a registration mechanism for device-capabilities that is consumed by tests, annotate tests as appropriate (e.g. supported_dtypes, supports_memory_format etc.) like https://github.com/pytorch/pytorch/issues/146898
- This is a “nice-to-have” and not necessary as skips can theoretically express this, but for better UX it could be explored