Thanks a lot for sharing this! I’m excited to see the data that shows dispatcher’s role in the trace overhead.
We would appreciate any suggestions, feedback, etc. Also, would there be interest in merging a feature-complete version of this patch at some point?
I would consider two points before trying to merge this.
-
is this going to be useful/ add additional value on top of using torchdynamo + lazytensor? The current proposal (though it is early) is to support a mode where dynamo associates a lazily traced program with its own guards, making it safe to skip lazy tracing on iter2 and jump directly to the compiled computation as long as dynamo’s guards pass. With this approach, we can skip not only the dispatcher overhead but all the trace overhead, and even some overhead originating in python.
-
what would it take to make this 100% safe/consistent with eager pytorch behavior? mainly, there are probably cases where there is non-linear code between the THPVariable_foo and underlying foo operator. In these cases, jumping directly to lazy trace could make the lazy tensor behave differently from eager. Can we avoid this, and keep things consistent?