Hi @youkaichao , not sure you have addressed this. I reused part of your test workload and found Pytorch deliberately prohibited retry when using mempool.
I raised issue OOM when using use_mem_pool due to restriction on retry · Issue #159674 · pytorch/pytorch · GitHub which provides more information and figures out way to solve this.