Benchmark GPU - PyTorch, ResNet50

ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.

ImageNet dataset download here: https://image-net.org/challenges/LSVRC/2012/2012-downloads.php You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:

ILSVRC2012_img_train.tar (138G)
ILSVRC2012_img_val.tar (6.3G)

I run benchmark on Rocky Linux 9.2, Python 3.11, CUDA 12.3

Prepare data and directories:

mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /cluster/shared/databases/ILSVRC2012/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /cluster/shared/databases/ILSVRC2012/ILSVRC2012_img_val.tar
curl -O https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
bash ./valprep.sh
cd /root

Install

I run my tests I'm using Python 3.11 available via EasyBuild modules. This benchmark should work with other Python releases. Activating Python 3.11:

module load Python/3.11.5-GCCcore-13.2.0

Install Python modules:

pip install --user nvidia-dali-cuda120
pip install --user git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install --user pynvml

Install PyTorch. I have installed CUDA 12.3. So I need to run these commands according to this guide: https://pytorch.org/get-started/locally/

pip install --user torch torchvision torchaudio

Download NVIDIA Data Loading Library (DALI) and change the directory:

git clone https://github.com/NVIDIA/DALI.git
cd DALI/docs/examples/use_cases/pytorch/efficientnet

Benchmark

Synthetic benchmark on 1 GPU (NVIDIA A100):

python multiproc.py --nproc_per_node 1 ./main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Synthetic benchmark on 4 GPUs (NVIDIA A100):

python multiproc.py --nproc_per_node 4 ./main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Output the last lines with the result:

DLL 2024-04-11 18:36:11.513059 - Summary: train.loss : -39.24695 None train.compute_ips : 2389.14 images/s train.total_ips : 2376.65 images/s train.lr : 0.005  train.grad_scale : 128.00000 None val.top1 : None % val.top5 : None % val.loss : None None val.compute_ips : None images/s val.total_ips : None images/s val.compute_latency : None s
DLL 2024-04-11 18:36:11.513112 - Summary: train.data_time : 0.00032 s train.compute_time : 0.06057 s val.data_time : None s val.compute_latency_at100 : None s val.compute_latency_at99 : None s val.compute_latency_at95 : None s

Metrics gathered through training:

train.loss - training loss
train.total_ips - training speed measured in images/second
train.compute_ips - training speed measured in images/second, not counting data loading
train.data_time - time spent on waiting on data
train.compute_time - time spent in forward/backward pass

Benchmark results on NVIDIA H100 single server

 1 x GPU:  4110.35 images/s
 2 x GPU:  7752.51 images/s
 4 x GPU: 15495.70 images/s
 8 x GPU: 30897.06 images/s

Benchmark results on NVIDIA A100 single server

 1 x GPU:  2389.14 images/s
 2 x GPU:  4472.56 images/s
 4 x GPU:  8582.74 images/s
 8 x GPU: 17181.05 images/s 
10 x GPU: 21361.83 images/s

Benchmark results on NVIDIA V100 single server

 1 x GPU:  1528.06 images/s
 2 x GPU:  2785.35 images/s

Benchmark results on NVIDIA P100 single server

 1 x GPU:   475.16 images/s
 2 x GPU:   929.20 images/s

GPU memory allocation

To allocate more GPU memory: change --batch-size. Start with 128 and increase the number. I was able to use 600 on 40GB A100 card without a memory allocation error.

Usefull links:

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/use_cases/pytorch/efficientnet/readme.html
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/resnet50v1.5/README.md