About PyTorch CI

I would like to ask if PyTorch’s CI has a rerun mechanism, for example, if a workflow fails to run unexpectedly, the developer needs to rerun it manually.
Is there an exemption mechanism, for example, if a workflow fails, but there is no time to wait or it fails unexpectedly, can it be merged directly?

PyTorch uses GitHub Actions for CI, so the rerun mechanism is as GHA defines it: Re-running workflows and jobs - GitHub Docs

Our pytorchbot has rules for certain exceptions if the failure looks irrelevant, and that information is usually displayed in our pytorchbot comment on every PR, e.g., POC for mixed prec optim frontend by janeyx99 · Pull Request #146640 · pytorch/pytorch · GitHub.

We’d advise erring on the side of caution in landing PRs, so we would recommend a rebase to get all green CI before merge instead of any sort of forced merging.

1 Like

Hello

  1. The rerun you mentioned is the method provided by GHA, which requires write permission to the repository, such as internal personnel. If it is a developer outside the company, and his CI fails unexpectedly, can he rerun it himself? Does pytorch have this method?
  2. Regarding the override method you mentioned without waiting for all CIs, I saw that you can @bot and then merge -f. Is it only internal developers who have this permission? How can bot avoid these failed CIs? How to set these failed CIs to be canceled after merging
  3. I saw that some PRs triggered trunk or inductor based on push events. Shouldn’t it only trigger CI of pull_request events? [reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x by henrylhtsang · Pull Request #147434 · pytorch/pytorch · GitHub

Ah, we triage every open source pull request and assign them a maintainer who has write permissions to the repository. To be clear, PyTorch maintainers are not limited to any one company. If you need to rerun some CI, you can request the maintainer to help you rerun.

That said, our bot is smart enough to ignore previously failed issues, so very often, you can merge your PR without doing anything extra. For example, the bot is designed to track what failures already existed on the base commit of the PR on main and report/act on that correctly.

  1. We also have other logic to automatically trigger CI based on the files touched to ensure we test sufficiently before merge.

Is there a PR you’re currently blocked on? Our Dev Infra team tries hard to ensure these issues don’t come up often!

Sorry, I don’t quite understand what you said about smart bot being able to intelligently identify errors. I’m curious about what method or mechanism is used if CI needs to be ignored by directly merging the code, and what permissions are required?
When you said files touched, does it mean that some files were modified to trigger the CI in trunk? Its triggering event seems to be push, and I understand that the bot can trigger CI by adding labels to pr. Does the bot detect which files were modified by pr to trigger other CIs besides pull and lint?

Ah, we do build infra on top of GHA to support the bot and rationalize when certain errors can be skipped vs not (among other things): GitHub - pytorch/test-infra: This repository hosts code that supports the testing infrastructure for the PyTorch organization. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.