ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.
ImageNet dataset download here: You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:
ILSVRC2012_img_train.tar (138G)
ILSVRC2012_img_val.tar (6.3G)
I run benchmark on Rocky Linux 9.2, Python 3.11, CUDA 12.3
Prepare data and directories:
mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /cluster/shared/databases/ILSVRC2012/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /cluster/shared/databases/ILSVRC2012/ILSVRC2012_img_val.tar
curl -O
bash ./
cd /root
I run my tests I'm using Python 3.11 available via EasyBuild modules. This benchmark should work with other Python releases. Activating Python 3.11:
module load Python/3.11.5-GCCcore-13.2.0
Install Python modules:
pip install --user nvidia-dali-cuda120
pip install --user git+
pip install --user pynvml
Install PyTorch. I have installed CUDA 12.3. So I need to run these commands according to this guide:
pip install --user torch torchvision torchaudio
Download NVIDIA Data Loading Library (DALI) and change the directory:
git clone
cd DALI/docs/examples/use_cases/pytorch/efficientnet
Synthetic benchmark on 1 GPU (NVIDIA A100):
python --nproc_per_node 1 ./ --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch
Synthetic benchmark on 4 GPUs (NVIDIA A100):
python --nproc_per_node 4 ./ --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch
Output the last lines with the result:
DLL 2024-04-11 18:36:11.513059 - Summary: train.loss : -39.24695 None train.compute_ips : 2389.14 images/s train.total_ips : 2376.65 images/s : 0.005 train.grad_scale : 128.00000 None val.top1 : None % val.top5 : None % val.loss : None None val.compute_ips : None images/s val.total_ips : None images/s val.compute_latency : None s
DLL 2024-04-11 18:36:11.513112 - Summary: train.data_time : 0.00032 s train.compute_time : 0.06057 s val.data_time : None s val.compute_latency_at100 : None s val.compute_latency_at99 : None s val.compute_latency_at95 : None s
Metrics gathered through training:
- train.loss - training loss
- train.total_ips - training speed measured in images/second
- train.compute_ips - training speed measured in images/second, not counting data loading
- train.data_time - time spent on waiting on data
- train.compute_time - time spent in forward/backward pass
Benchmark results on NVIDIA H100 single server
1 x GPU: 4110.35 images/s
2 x GPU: 7752.51 images/s
4 x GPU: 15495.70 images/s
8 x GPU: 30897.06 images/s
Benchmark results on NVIDIA A100 single server
1 x GPU: 2389.14 images/s
2 x GPU: 4472.56 images/s
4 x GPU: 8582.74 images/s
8 x GPU: 17181.05 images/s
10 x GPU: 21361.83 images/s
Benchmark results on NVIDIA V100 single server
1 x GPU: 1528.06 images/s
2 x GPU: 2785.35 images/s
Benchmark results on NVIDIA P100 single server
1 x GPU: 475.16 images/s
2 x GPU: 929.20 images/s
GPU memory allocation
To allocate more GPU memory: change --batch-size. Start with 128 and increase the number. I was able to use 600 on 40GB A100 card without a memory allocation error.
Usefull links: