Core ATen opset backward & forward compatibility policy

Context

Core ATen operator set (definition, list) is a new API surface that was introduced by PyTorch 2.0 and is currently used by ExecuTorch. In a nutshell, models being delivered to edge devices will go through torch.export(), then followed by a step called “core ATen decomposition” to make sure the operators being used in the exported model artifacts are only core ATen ops (with the exception of custom ops). At runtime the ExecuTorch inference engine will execute the core ATen ops with the input arguments from the model, thus posing the requirement that the schemas of these ops need to be the same across the inference engine and the model. Here we are defining the BC/FC policy for core ATen opset, as well as the workflow for PyTorch developers on how to comply with the policy.

Policy

Backward Compatibility

SLA (service level agreement): ExecuTorch models should continue to run within 180 days after its deployment, regardless of the updates on ExecuTorch inference engine.

To comply with this SLA, we will disallow BC breaking changes to core ATen operator (native function) schemas. Changes can be made if they are not BC breaking, including (updated from this definition):

  • bug fixes (where PyTorch’s behavior changes to better reflect its documentation)
  • small numerics changes
  • numerical accuracy improvements
  • changes to implementation (result unchanged)
  • adding a keyword argument with default value (result unchanged if given default value)

For our BC policy on torch frontend or other ATen operators, this wiki still applies.

Forward Compatibility

ExecuTorch does not have a forward compatibility SLA. Since we disallow BC breaking changes to core ATen operator schemas, the only types of FC breaking changes are:

  1. A new operator being added to core ATen op set and a new model containing that new operator being deployed without the ExecuTorch inference engine being updated (thus missing the new operator).
  2. A new default argument being added to core ATen op and a new model containing that argument (not with the default value) is being deployed.
    • If the new model contains the changed operator but only using the default value of the new argument, this won’t be a FC breakage because that argument can be omitted during serialization.

ExecuTorch inference engine will provide an API to allow users to retrieve information on whether the inference engine is compatible with the new model.

Developers Workflow

This section talks about developers workflows, when a developer is aiming to add/deprecate/modify a core ATen native function. We divide the native functions in native_functions.yaml into 3 categories:

  1. Core ATen ops
  2. ops that are guaranteed to be decomposable to Core ATen ops
  3. custom / extended ops that will fail if lowered to Core ATen and require some special handling

In order to keep the schema of category 1 intact, we ask developers aiming to make a BC breaking changes to category 1, to add a new native function into category 3 instead. The developer is encouraged to add a decomposition rule right after introducing the new native function.

Core ATen opset maintainers will regularly review new native functions being added to category 3 and make a decision on whether they need to be moved to category 1 or category 2.

Adding a non-core ATen native function

Developers can add a new native function to native_functions.yaml and reroute the frontend to the new native function. We encourage the developer to add a decomposition rule, but it’s not required. After the PR is merged, core ATen opset maintainers will make a decision on whether the newly added native function needs to be decomposed or added into Core ATen opset, before the first time it’s being used by ExecuTorch users. If a decomposition rule is needed, the new native function will be decomposed (through either a CompositeImplicitAutograd kernel or a decomposition rule) into core ATen ops, so that the model containing it can be lowered to PyTorch Edge/ExecuTorch. In order to keep track of these new native functions, we will add a CI job to check if a new native function can be decomposed into core ATen ops, then the developer adding it will have to override the CI job failure.

Adding a core ATen native function

Since adding a core ATen native function requires downstream ExecuTorch changes, adding such an operator will have to go through a review process, see details in this wiki. Due to this complexity, we don’t expect developers to go through this process if they are not working on PyTorch Edge / ExecuTorch stack directly. Instead, it’s likely going to be done by a core ATen opset maintainer. A CI job will be added to ensure PRs adding core ATen native functions will be tagged properly and requested review from core ATen opset maintainers. No decomposition rule is needed since the new op belongs to the core ATen op set.

Deprecating a native function

In order to deprecate a native function (here we are generalizing the workflow to be applied to the whole set of native functions), we are proposing these general steps:

  1. Add a “deprecating” tag to the native function in native_functions.yaml. As a source of truth, this tag enables PyTorch subsystems to give clear warning messages and be able to nudge users to migrate.

    • For example, torchgen can generate warning messages automatically for “deprecating” native functions.
  2. Make sure no users are using the deprecated native function.This is the hardest step, for ops that are widely used by eager mode model, it’s much harder to ensure no user is using it, comparing with the scenario of deprecating an internal only op since we have more control on it.

  3. Change the tag to “deprecated” no sooner than 2 weeks after the “deprecating” tag is added.

  4. Delete the entry in native_functions.yaml 180 days after step 3 is finished.

On top of these general steps we will build infrastructure on torch.export() to enforce the deprecation in exported artifacts:

  1. Give torch.export() users warning if any native function is tagged with “deprecating”.
  2. Give clear instructions along with the warning, to ask the user to make sure all the usages of deprecated native functions are addressed. Specifically, change torch.export() to land on another op instead of the deprecated op (by changing decomposition rule for example).
  3. As soon as torch.export() sees a “deprecated” tag, it gives an error message and fails the export process.

Changing Non-core ATen native function in a BC breaking way

Developers can still make BC breaking changes to non-core ATen operators, by following the policy for torch frontend and TorchScript.

Developers also need to make sure the new native function can still be decomposed into core ATen ops (enforced by CI, see Adding a non-core ATen native function section).

Changing Core ATen native function in a BC breaking way

BC breaking changes are not allowed on core ATen operators, instead the workflow will be:

  • Add a new native function to native_functions.yaml
  • Determine if it needs to show up in core ATen operator set
    • If yes, refer to the “Adding a core ATen native function” section.
    • If no, refer to the “Adding a non-core ATen native function” section.
  • Optionally deprecate the old operator following the “Deprecating a native function” section.

See diagram below for how we are going to leverage CI jobs to facilitate the developer flow:

Next Steps

  1. We will update the content here to this wiki so this is an official policy
  2. We plan to build enforcement mechanisms via CI jobs shown in the flow diagram:
    • In addition to the existing BC/FC test, we want to communicate our policy clearly in its error messages.
    • If a PR is adding a new native function, we want to make sure it has a decomposition rule for edge use cases.
    • If a PR changes the existing core ATen decomposition rule to introduce a non-core ATen op, we want to block it from merging.

Please comment if you have any questions!

2 Likes

Thank you for the detailed explanation @larryliu0820!

I did not quite understand what is the process to remove a core op in terms of simpler ops. At Add aten._masked_index by isuruf · Pull Request #116491 · pytorch/pytorch · GitHub, we are working on a primitive that allows to decompose plenty of complex ops (pooling and their backwards) in terms of it. What would be the policy and the necessary steps to do so?

Similarly, _unsafe_index and _unsafe_index_put are not marked as core, but they do not have decompositions. Is this a bug? Wouldn’t it be a better way to test what are the core ops by either by:

  1. Comparing against the ops that have a lowering in inductor
  2. Looking at the expect list from test_aten_core_operators?

I don’t see a test that test that tracks the difference between the ATen ops marked as core and these two measures. Is this difference expected?

1 Like

Thanks @Lezcano for bringing these issues up! We are in the process of ramping up automation tools to promote “non-core ATen ops” to “core ATen ops”, right now the process is very manual. I think @SS-JIA and other folks are forming a committee to review these native functions in question, to see if they either should have a core ATen decomposition rule or should be added into core ATen opset. Appreciate it for bringing these operators up!

  • _masked_index
  • _unsafe_index
  • _unsafe_index_put
  1. Comparing against the ops that have a lowering in inductor

I would imagine the core ATen opset has already started to diverge from inductor decomposition/lowering, but the decomposition over there can be served as a very good reference to core ATen ops.

  1. Looking at the expect list from test_aten_core_operators?

The source of truth of core ATen ops are the ones with core tag on native_functions.yaml. The list in test_has_decomposition.expect (name is confusing, but this is really another “core ATen ops” list) should be in sync with the core tag. Right now they are out of sync and is creating a lot of confusion.

In summary, me or some other core ATen opset maintainers will build some infra to:

  1. encourage people to add core ATen decomposition when they are trying to add a new native function like what #116491 is doing.
  2. keep track of these new native functions and review them in the core ATen opset committee.

Hope that answers your question.

So what is the process to add a function (like _masked_index) that allows to implement many core ops in terms of it? Can we just do it, or is there any sort of fw-bw compat issue?

1 Like

My suggestion is to send a PR which adds core tag to the function, tag a few core ATen ops maintainers @SS-JIA @SherlockNoMad and submit for review. I don’t think there’s a backward compatibility issue, but there may be downstream work on ExecuTorch side after the PR is merged.

Exactly as @larryliu0820 mentioned, I think in the case of adding operators to the ATen library that can potentially impact the core ATen opset, the core ATen opset should not block additions to the ATen library. In my view, the core ATen opset is a separate system that “tracks” the ATen library (or more specifically, the operators used in PyTorch programs) so it is the responsibility of the maintainers of the core ATen opset to respond to such changes in the ATen library.