The "Ideal" PyTorch FLOP Counter (with __torch_dispatch__)

@Chillee Related question: Is there a way for me extract the post fused FX graph or better the inductor scheduled graph (I know there’s TORCH_COMPILE_DEBUG=1 but was looking at a programmatic way) ?

Interested in extending to byte counting while factoring in as much of fusion stuff as I can.