Benchmark GPU - PyTorch, ResNet50

Posted on Sat 13 April 2024 by Pavlo Khmel

ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.

ImageNet dataset download here: https://image-net.org/challenges/LSVRC/2012/2012-downloads.php You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:

ILSVRC2012_img_train.tar (138G)
ILSVRC2012_img_val.tar (6.3G)

I run benchmark on Rocky Linux 9.2, Python 3.11, CUDA 12.3

Prepare data and directories:

mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /cluster/shared/databases/ILSVRC2012/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /cluster/shared/databases/ILSVRC2012/ILSVRC2012_img_val.tar
curl -O https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
bash ./valprep.sh

Install

I run my tests I'm using Python 3.11 available via EasyBuild modules. This benchmark should work with other Python releases. Activating Python 3.11:

module load Python/3.11.5-GCCcore-13.2.0

Install Python modules:

pip install --user nvidia-dali-cuda120
pip install --user git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install --user pynvml

Install PyTorch. I have installed CUDA 12.3. So I need to run these commands according to this guide: https://pytorch.org/get-started/locally/

pip install --user torch torchvision torchaudio

Download NVIDIA Data Loading Library (DALI) and change the directory:

git clone https://github.com/NVIDIA/DALI.git
cd DALI/docs/examples/use_cases/pytorch/efficientnet

Benchmark

Synthetic benchmark on 1 GPU (NVIDIA A100):

python multiproc.py --nproc_per_node 1 ./main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Synthetic benchmark on 4 GPUs (NVIDIA A100):

python multiproc.py --nproc_per_node 4 ./main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Output the last lines with the result:

DLL 2024-04-11 18:36:11.513059 - Summary: train.loss : -39.24695 None train.compute_ips : 2389.14 images/s train.total_ips : 2376.65 images/s train.lr : 0.005  train.grad_scale : 128.00000 None val.top1 : None % val.top5 : None % val.loss : None None val.compute_ips : None images/s val.total_ips : None images/s val.compute_latency : None s
DLL 2024-04-11 18:36:11.513112 - Summary: train.data_time : 0.00032 s train.compute_time : 0.06057 s val.data_time : None s val.compute_latency_at100 : None s val.compute_latency_at99 : None s val.compute_latency_at95 : None s

Metrics gathered through training:

  • train.loss - training loss
  • train.total_ips - training speed measured in images/second
  • train.compute_ips - training speed measured in images/second, not counting data loading
  • train.data_time - time spent on waiting on data
  • train.compute_time - time spent in forward/backward pass

Benchmark results on NVIDIA H100 single server

 1 x GPU:  4110.35 images/s
 2 x GPU:  7752.51 images/s
 4 x GPU: 15495.70 images/s
 8 x GPU: 30897.06 images/s 

Benchmark results on NVIDIA A100 single server

 1 x GPU:  2389.14 images/s
 2 x GPU:  4472.56 images/s
 4 x GPU:  8582.74 images/s
 8 x GPU: 17181.05 images/s 
10 x GPU: 21361.83 images/s

Benchmark results on NVIDIA H100 single server

 1 x GPU:  1528.06 images/s
 2 x GPU:  2785.35 images/s

Benchmark results on NVIDIA P100 single server

 1 x GPU:   475.16 images/s
 2 x GPU:   929.20 images/s

GPU memory allocation

To allocate more GPU memory: change --batch-size. Start with 128 and increase the number. I was able to use 600 on 40GB A100 card without a memory allocation error.

Usefull links:

  • https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/use_cases/pytorch/efficientnet/readme.html
  • https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/resnet50v1.5/README.md