Context
Core ATen operator set (definition, list) is a new API surface that was introduced by PyTorch 2.0 and is currently used by ExecuTorch. In a nutshell, models being delivered to edge devices will go through torch.export(), then followed by a step called “core ATen decomposition” to make sure the operators being used in the exported model artifacts are only core ATen ops (with the exception of custom ops). At runtime the ExecuTorch inference engine will execute the core ATen ops with the input arguments from the model, thus posing the requirement that the schemas of these ops need to be the same across the inference engine and the model. Here we are defining the BC/FC policy for core ATen opset, as well as the workflow for PyTorch developers on how to comply with the policy.
Policy
Backward Compatibility
SLA (service level agreement): ExecuTorch models should continue to run within 180 days after its deployment, regardless of the updates on ExecuTorch inference engine.
To comply with this SLA, we will disallow BC breaking changes to core ATen operator (native function) schemas. Changes can be made if they are not BC breaking, including (updated from this definition):
- bug fixes (where PyTorch’s behavior changes to better reflect its documentation)
- small numerics changes
- numerical accuracy improvements
- changes to implementation (result unchanged)
- adding a keyword argument with default value (result unchanged if given default value)
For our BC policy on torch frontend or other ATen operators, this wiki still applies.
Forward Compatibility
ExecuTorch does not have a forward compatibility SLA. Since we disallow BC breaking changes to core ATen operator schemas, the only types of FC breaking changes are:
- A new operator being added to core ATen op set and a new model containing that new operator being deployed without the ExecuTorch inference engine being updated (thus missing the new operator).
- A new default argument being added to core ATen op and a new model containing that argument (not with the default value) is being deployed.
- If the new model contains the changed operator but only using the default value of the new argument, this won’t be a FC breakage because that argument can be omitted during serialization.
ExecuTorch inference engine will provide an API to allow users to retrieve information on whether the inference engine is compatible with the new model.
Developers Workflow
This section talks about developers workflows, when a developer is aiming to add/deprecate/modify a core ATen native function. We divide the native functions in native_functions.yaml
into 3 categories:
- Core ATen ops
- ops that are guaranteed to be decomposable to Core ATen ops
- custom / extended ops that will fail if lowered to Core ATen and require some special handling
In order to keep the schema of category 1 intact, we ask developers aiming to make a BC breaking changes to category 1, to add a new native function into category 3 instead. The developer is encouraged to add a decomposition rule right after introducing the new native function.
Core ATen opset maintainers will regularly review new native functions being added to category 3 and make a decision on whether they need to be moved to category 1 or category 2.
Adding a non-core ATen native function
Developers can add a new native function to native_functions.yaml and reroute the frontend to the new native function. We encourage the developer to add a decomposition rule, but it’s not required. After the PR is merged, core ATen opset maintainers will make a decision on whether the newly added native function needs to be decomposed or added into Core ATen opset, before the first time it’s being used by ExecuTorch users. If a decomposition rule is needed, the new native function will be decomposed (through either a CompositeImplicitAutograd
kernel or a decomposition rule) into core ATen ops, so that the model containing it can be lowered to PyTorch Edge/ExecuTorch. In order to keep track of these new native functions, we will add a CI job to check if a new native function can be decomposed into core ATen ops, then the developer adding it will have to override the CI job failure.
Adding a core ATen native function
Since adding a core ATen native function requires downstream ExecuTorch changes, adding such an operator will have to go through a review process, see details in this wiki. Due to this complexity, we don’t expect developers to go through this process if they are not working on PyTorch Edge / ExecuTorch stack directly. Instead, it’s likely going to be done by a core ATen opset maintainer. A CI job will be added to ensure PRs adding core ATen native functions will be tagged properly and requested review from core ATen opset maintainers. No decomposition rule is needed since the new op belongs to the core ATen op set.
Deprecating a native function
In order to deprecate a native function (here we are generalizing the workflow to be applied to the whole set of native functions), we are proposing these general steps:
-
Add a “deprecating” tag to the native function in
native_functions.yaml
. As a source of truth, this tag enables PyTorch subsystems to give clear warning messages and be able to nudge users to migrate.- For example,
torchgen
can generate warning messages automatically for “deprecating” native functions.
- For example,
-
Make sure no users are using the deprecated native function.This is the hardest step, for ops that are widely used by eager mode model, it’s much harder to ensure no user is using it, comparing with the scenario of deprecating an internal only op since we have more control on it.
-
Change the tag to “deprecated” no sooner than 2 weeks after the “deprecating” tag is added.
-
Delete the entry in
native_functions.yaml
180 days after step 3 is finished.
On top of these general steps we will build infrastructure on torch.export() to enforce the deprecation in exported artifacts:
- Give
torch.export()
users warning if any native function is tagged with “deprecating”. - Give clear instructions along with the warning, to ask the user to make sure all the usages of deprecated native functions are addressed. Specifically, change
torch.export()
to land on another op instead of the deprecated op (by changing decomposition rule for example). - As soon as
torch.export()
sees a “deprecated” tag, it gives an error message and fails the export process.
Changing Non-core ATen native function in a BC breaking way
Developers can still make BC breaking changes to non-core ATen operators, by following the policy for torch frontend and TorchScript.
Developers also need to make sure the new native function can still be decomposed into core ATen ops (enforced by CI, see Adding a non-core ATen native function section).
Changing Core ATen native function in a BC breaking way
BC breaking changes are not allowed on core ATen operators, instead the workflow will be:
- Add a new native function to native_functions.yaml
- Determine if it needs to show up in core ATen operator set
- If yes, refer to the “Adding a core ATen native function” section.
- If no, refer to the “Adding a non-core ATen native function” section.
- Optionally deprecate the old operator following the “Deprecating a native function” section.
See diagram below for how we are going to leverage CI jobs to facilitate the developer flow:
Next Steps
- We will update the content here to this wiki so this is an official policy
- We plan to build enforcement mechanisms via CI jobs shown in the flow diagram:
- In addition to the existing BC/FC test, we want to communicate our policy clearly in its error messages.
- If a PR is adding a new native function, we want to make sure it has a decomposition rule for edge use cases.
- If a PR changes the existing core ATen decomposition rule to introduce a non-core ATen op, we want to block it from merging.
Please comment if you have any questions!