Running Low-QoS Jobs
Users have the options to run computation jobs with the Low-QoS flag, For an example, when a project credit is closed to exhausted. The main purpose is to let a user (who runs out of credit but still need to use the cluster) to run small and short jobs for free when the cluster becomes idle. The features of a Low-QoS job:
1) It will not check for credit allocations before been scheduled to run
2) Can be preempted/requeued by a normal compute job or vis job
To specify the job is a Low-QoS job, add the following statement in your Slurm submission script:
A Concrete Example:
David runs out of project credits. During mid-night when MASSIVE becomes idle, he submits a Low-QoS GPU job - Job A, which occpies a GPU node. Next day morning when more users are back, the cluster becomes busy. Then James submits a normal-QoS gpu job - Job B. When the Job B is in, it first scans the cluster and see if there is any free GPU node. Unfortunately all GPU nodes are busy. Then the Job B checks and finds the Job A is a Low-QoS job which is occupying one GPU node. The Job B will kill Job A and put the Job A to the status of 'Requeue', meaning the Job A will automatically restart running from begining when the cluster has a free GPU node later.
Principles of Low-QoS Submission
1) Small Size
One-core serial job is a best fit. One-node (12-core or 16-core) job is acceptable. Be cautious of running multi-node MPI jobs. More rosource a Low-QoS job occupies, more chance it will be preempted.
2) Short Running Time
Less than 4 hours is best. 24 hours is acceptable. If the job needs 1-2 week, you should be careful. Any preemption can cause the job being requeued and re-run from the beginning. For a 14-day job, you do not expect it runs for 13 days and be killed and re-run from scratch.
3) Not too Many
Job preemption, re-queuing, re-schedule and re-run will significantly increase the workload of Slurm Controller. Therefore a user needs to control the number of Low-QoS jobs he/she submits.
Enabling Low-QoS Job
The MASSIVE SysAdmins hold the right of cancelling any Low-QoS jobs if the jobs are considered as non-reasonable or they affect normal jobs' running. If you need further help to run Low-QoS job, please contact MASSIVE team: firstname.lastname@example.org