I am adding a Google Doc to keep a list of the items below in case this post is not edit-able.
This is a set of suggestions based on observations to make the benchmarks more usable and to improve individual benchmarks such that they highlight Pytorch improvements.
Suggestions for making the benchmarks more usable for an external user:
- Instructions on how to install dependencies when running on TOT from source as benchmarking code under development is a common use. The torchvision and torchtext packages have to be installed from source as they are otherwise tied to a specific version of pytorch.
- List of dependencies
pip install -r requirements.txt
apt-get update && apt-get install libgl1-mesa-glx
git clone https://github.com/pytorch/vision && cd vision && FORCE_CUDA=1 pip install --no-deps --no-cache-dir -v .
git clone --recursive https://github.com/pytorch/text text && cd text && git submodule update --init --recursive && pip install --no-cache-dir -v .
python install.py
- List of dependencies
- Instructions on how to run individual timed benchmarks
- It would be helpful to show how to specify filters for individual benchmarks and how to specify training and evaluation
- Example:
pytest test_bench.py -k "test_train[BERT_pytorch-cuda-jit]" --ignore_machine_config --fuser=eager
- Example:
- When evaluating fusions it is helpful to see stdout. Currently, stdout only appears to be printed on error via the
pytest --show-capture
flag.
- It would be helpful to show how to specify filters for individual benchmarks and how to specify training and evaluation
- Reporting the configuration of the benchmark
- The user is only currently able to determine the configuration via reading the benchmark code. It would be better to include relevant parameters of the benchmark like batch size, sequence, length, image size, etcā¦ and device usage like multi-gpu.
- Setting an expectation of performance
- Often setting up a benchmark is not successful and it is helpful to have some expectation of performance for a common device. Information, like seen is this link, is helpful.
Issues with specific benchmarks when evaluating Pytorch improvements:
I ran all benchmarks in the following manner on a DGX A100 machine:
pytest test_bench.py -k "test_train[NAME-cuda-jit]" --ignore_machine_config --fuser=eager
BERT_pytorch:
- Batch Size and Sequence Length: The batch size (number of sequences) of 1 seems inappropriate for training along with a sequence length of 20. These seem like inference parameters as the benchmark is overloaded by allreduce communication, the optimizer, and CPU latency. I would suggest a minimum batch size of 32 with a sequence length of 128 (4096 total tokens). Bigger sequence lengths also expose improvements to multihead attention.
- Multi-GPU usage: Is this benchmark intended to be for multi-gpu use? If it is, the use of DataParallel is not advised given each GPU is serviced by one process which makes it expose CPU usage. DistributedDataParallel should be used, instead.
fastNLP:
- Model Design: This benchmark is only running the classifier portion of text-classification, the fine tuning task on BERT or so the code suggests. Therefore, the entire network consists of a Linear, Dropout, and LogSoftmax. It is also a very small portion of work as it only consumes one Block on the entire GPU and 1 step iteration runs in 1ms.
attention_is_all_you_need:
- Batch Size: The maximum token size of 4096 is not that small but this benchmark sees a large percentage of its time spent in the optimizer at 38%. This is mostly due to the fact that the optimizer is not built as a single kernel. Also, since the optimizer is not jitted, if the rest of the network speeds up with JIT improvements, it will not.
- Device Syncs: This implementation also has device synchronizations at the data loader, loss function, and at stat collection. This does not allow the CPU to run ahead and limits the potential for improvements to be exposed