massive-website-banner

Running GPU Jobs

1. Running GPU Batch Jobs:

On MASSIVE, there are 244 GPU cards:

a) 76 NVIDIA K20 GPU

b) 20 NVIDIA M2070Q (Vis node)

c) 148 NVIDIA M2070

When requesting a GPU you can leave it up to the system to decide which GPU you are given

#SBATCH --gres=gpu:1

You can also explicitly select the gpu type you would like

#SBATCH --gres=gpu:m2070:1

or

#SBATCH --gres=gpu:k20m:1


To submit a job, if you need 1 nodes with 2 cores and 2 GPUs, then the slurm submission script should look like: 

#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --account=monash001
#SBATCH --time=01:00:00
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1

 

if you need 6 nodes with 4 cpu cores and 2 GPUs on each node, then the slurm submission script should look like: 

#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --account=monash001
#SBATCH --time=01:00:00
#SBATCH --ntasks=24
#SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:2

On MASSIVE, the sample slurm submission scripts have been prepared and can be found here:

/usr/local/training/samples/slurm/

 

2. Compile your own CUDA or OpenCL codes and run on MASSIVE

The MASSIVE cluster has been configured to allow CUDA (or OpenCL) applications to be compiled (device independent code ONLY) on the Login node (no GPUs installed) for execution on a Compute node (with GPU).

massive_gpu_structure_0

Login node: can compile some of CUDA (or OpenCL) source code (device independent code ONLY) but cannot run it

Compute node: can compile all CUDA (or OpenCL) source code as well as execute it.

We strongly suggest you compile your code on a compute node. To do that, you need to use sinteractive session get on a compute node

sinteractive --account=monash001 --gres=gpu:1

To load the cuda module

module load cuda

To check the GPU device information

nvidia-smi
deviceQuery

Then you should be able to compile the GPU codes. And the compilation is done, you can run your codes now.

Attention:

If you attempt to run any CUDA (or OpenCL) application (compiled executable) on the Login node, errors of ‘no CUDA device found’ may be reported. This is because no CUDA-enabled GPU was installed on the Login node. Instead, you have to run them on a compute node.

Copyright © 2016 MASSIVE. All Rights Reserved.