massive-website-banner

Running Low-QoS Jobs

Users have the options to run computation jobs with the Low-QoS flag, For an example, when a project credit is closed to exhausted. The main purpose is to let a user (who runs out of credit but still need to use the cluster) to run small and short jobs for free when the cluster becomes idle. The features of a Low-QoS job:

1) It will not check for credit allocations before been scheduled to run

2) Can be preempted/requeued by a normal compute job or vis job

 

To specify the job is a Low-QoS job, add the following statement in your Slurm submission script:

#SBATCH --qos="low_qos_m2"

 

A Concrete Example:

David runs out of project credits. During mid-night when MASSIVE becomes idle, he submits a Low-QoS GPU job - Job A, which occpies a GPU node. Next day morning when more users are back, the cluster becomes busy. Then James submits a normal-QoS gpu job - Job B. When the Job B is in, it first scans the cluster and see if there is any free GPU node. Unfortunately all GPU nodes are busy. Then the Job B checks and finds the Job A is a Low-QoS job which is occupying one GPU node. The Job B will kill Job A and put the Job A to the status of 'Requeue', meaning the Job A will automatically restart running from begining when the cluster has a free GPU node later. 

 

Principles of Low-QoS Submission

1) Small Size

One-core serial job is a best fit. One-node (12-core or 16-core) job  is acceptable. Be cautious of running multi-node MPI jobs. More rosource a Low-QoS job occupies, more chance it will be preempted.

2) Short Running Time

Less than 4 hours is best. 24 hours is acceptable. If the job needs 1-2 week, you should be careful. Any preemption can cause the job being requeued and re-run from the beginning. For a 14-day job, you do not expect it runs for 13 days and be killed and re-run from scratch.

3) Not too Many

 Job preemption, re-queuing, re-schedule and re-run will significantly increase the workload of Slurm Controller. Therefore a user needs to control the number of Low-QoS jobs he/she submits.

 

Enabling Low-QoS Job

The MASSIVE SysAdmins hold the right of cancelling any Low-QoS jobs if the jobs are considered as non-reasonable or they affect normal jobs' running. If you need further help to run Low-QoS job, please contact MASSIVE team: help@massive.org.au

 

Copyright © 2016 MASSIVE. All Rights Reserved.