Pytorch Benchmarks issues with general usability and issues with individual benchmarks

I am adding a Google Doc to keep a list of the items below in case this post is not edit-able.

Benchmarks Repository

This is a set of suggestions based on observations to make the benchmarks more usable and to improve individual benchmarks such that they highlight Pytorch improvements.

Suggestions for making the benchmarks more usable for an external user:

  • Instructions on how to install dependencies when running on TOT from source as benchmarking code under development is a common use. The torchvision and torchtext packages have to be installed from source as they are otherwise tied to a specific version of pytorch.
    • List of dependencies
      • pip install -r requirements.txt
      • apt-get update && apt-get install libgl1-mesa-glx
      • git clone && cd vision && FORCE_CUDA=1 pip install --no-deps --no-cache-dir -v .
      • git clone --recursive text && cd text && git submodule update --init --recursive && pip install --no-cache-dir -v .
      • python
  • Instructions on how to run individual timed benchmarks
    • It would be helpful to show how to specify filters for individual benchmarks and how to specify training and evaluation
      • Example: pytest -k "test_train[BERT_pytorch-cuda-jit]" --ignore_machine_config --fuser=eager
    • When evaluating fusions it is helpful to see stdout. Currently, stdout only appears to be printed on error via the pytest --show-capture flag.
  • Reporting the configuration of the benchmark
    • The user is only currently able to determine the configuration via reading the benchmark code. It would be better to include relevant parameters of the benchmark like batch size, sequence, length, image size, etc… and device usage like multi-gpu.
  • Setting an expectation of performance
    • Often setting up a benchmark is not successful and it is helpful to have some expectation of performance for a common device. Information, like seen is this link, is helpful.

Issues with specific benchmarks when evaluating Pytorch improvements:

I ran all benchmarks in the following manner on a DGX A100 machine:
pytest -k "test_train[NAME-cuda-jit]" --ignore_machine_config --fuser=eager


  • Batch Size and Sequence Length: The batch size (number of sequences) of 1 seems inappropriate for training along with a sequence length of 20. These seem like inference parameters as the benchmark is overloaded by allreduce communication, the optimizer, and CPU latency. I would suggest a minimum batch size of 32 with a sequence length of 128 (4096 total tokens). Bigger sequence lengths also expose improvements to multihead attention.
  • Multi-GPU usage: Is this benchmark intended to be for multi-gpu use? If it is, the use of DataParallel is not advised given each GPU is serviced by one process which makes it expose CPU usage. DistributedDataParallel should be used, instead.


  • Model Design: This benchmark is only running the classifier portion of text-classification, the fine tuning task on BERT or so the code suggests. Therefore, the entire network consists of a Linear, Dropout, and LogSoftmax. It is also a very small portion of work as it only consumes one Block on the entire GPU and 1 step iteration runs in 1ms.


  • Batch Size: The maximum token size of 4096 is not that small but this benchmark sees a large percentage of its time spent in the optimizer at 38%. This is mostly due to the fact that the optimizer is not built as a single kernel. Also, since the optimizer is not jitted, if the rest of the network speeds up with JIT improvements, it will not.
  • Device Syncs: This implementation also has device synchronizations at the data loader, loss function, and at stat collection. This does not allow the CPU to run ahead and limits the potential for improvements to be exposed

Hey Kevin, thanks a lot for pointing all of those out! I am interested in getting most of these fixed ASAP and for a few of them, discussing further. Please chime in at this PR for README updates.

Long story short, we didn’t design around multi-gpu, so the fact that the model code contains some of that is just an artifact of how we copied it from the original repo. In our runs, it should be configured as single-gpu. Not that this isn’t something we should benchmark, but hasn’t been contemplated yet.

I’d also encourage you (or others) to post github issues to pytorch/benchmark and add to the project board for better tracking and visibility.

I filed issues for BERT training config (pytorch/benchmark/issues/317), and fastNLP misconfiguration (#316, new users can only post 2 links in a post!) just now.

I’m not sure whether the ‘attention’ model is “misconfigured” or just isn’t ‘ideal for GPU’ perf currently? Maybe you could open a PR just to comment on what lines of code you want to change there, so it’s clear.

RE expectation of performance, we should share the latest performance numbers that we collect on our CI machines, but haven’t gotten around to setting up that infrastructure externally yet.

Any suggestion how to prepare an automated report from each model that shows concisely its configuration (layers, etc.) and hyperparams? that sounds like a nice feature to me.

1 Like

Yes - thank you for this feedback! Taking a look at this as a part of the core team and please use issues where you can. We’ll take a look at them and thanks. cc @bitfort, @harryskim