Hi! Nice to see this after engaging with you in recent discussions like Optimizers' `differentiable` flag doesn't work · Issue #141832 · pytorch/pytorch · GitHub! Since you’re interested in contributing more than just piecewise, I’d be excited to work with you in tackling a more holistic medium-sized project around differentiable optimizers if you’re interested.
You’ve already gotten some context around action items, so this is to set a clearer goal to frame the bigger picture. Ultimately, we want to see differentiable optimizers have better support, test coverage, and documentation that they do today.
What would an ideal end state look like?
(1) Better support: people can run differentiable optimizers with lr, betas, and weight_decay as Tensors that require grad, meaning people can train their optimizer hyper params.
(2) Better documentation: we have a tutorial in GitHub - pytorch/tutorials: PyTorch tutorials. showing a real use case for differentiable optimizers, and our pytorch/pytorch documentation has a simpler code example. We also raise proper errors/warnings within the code linking to these resources.
(3) Fuller test coverage: our differentiable tests were excluded from our general test infrastructure migration to OptimizerInfo, but ideally, we’d use OptimizerInfos for these tests as well. An example test case that we should have our differentiable tests look like could be found in pytorch/test/test_optim.py at main · pytorch/pytorch · GitHub, see test_foreach_large_tensor
. We’d want to use the OptimizerInfo infra to encompass all the new tests we want to add, like lr as a Tensor, etc.
Like most destinations, this end state can be achieve from several directions, and here’s a sample path taken from what we already delineated in the linked issue above:
Step 0 (could be done in parallel or first, based on preference): Migrate the current differentiable tests pytorch/test/optim/test_optim.py at main · pytorch/pytorch · GitHub to use OptimizerInfos + expand test coverage.
Step 1: support tensor LR when differentiable is True for SGD. Add a test case and docs in the code.
Step 2: now what if the tensor LR requires grad? Make sure this works and add a test case and docs in the code.
Step 3: Expand the above to different optimizers, Adam, AdamW, Adagrad, etc. Of course, add test cases and corresponding docs. This might be when it’d be good to consider using OptimizerInfos if you haven’t yet.
Step 4: Add error messaging.
Step 5: Add overarching docs on how to use differentiable optimizers and what’s supported. I could also see this being step 1, with gradual improvements as steps 1-3 are completed.
Let me know what you think!