massive-website-banner

Diagnosing Problems with Jobs

Job Pending

Many reason can cause a job be in 'Pending' Status. Make good use of following user scripts. It can tell you the reason, covering 90% of the cases:

show_job

and

show_job [JOBID]

and the following script to show the cluster status:

show_cluster

 

For example, for a running job, 'show_job' will show as:

show job run

 

For a pending job, it will give you red light:

show job pending resource

And last column tells you the reason: No Avail Resource. It is self-explaning - the MASSIVE cluster is busy, there is no available resource to match your request. You need to wait until others' jobs complete to get resource. There could be other types of reasons which leads to a pending job.  

 

User Limit

On MASSIVE, we take care of every single user. To achieve the fairness in a shared cluster, the user limit has been applied. For each user, there are number of cpu core limit, memory limit, number of desktop job limit. For instance, on Massive 2, each user can only consume up to 300 cpu cores simutanously, if this limit be hit, job pending will occurs. Something looks like:

show job pending cpu

If you encounter the above situation, very likely you are already consuming many cpu cores. This job will automatically become normal when some of your jobs complete and the number of your occupied cpu cores reduces. 

The 'show_job' command can show you a statistical report indicating how many cpu cores you are consuming:

show job pending summary

 

 

Reject Job Submission

On MASSIVE, Slurm is configured to perform account balance check during job submission time. Slurm will reject a job submission if:

 

1) Project's available credits is insufficient for running the job

CASE: a user project has 1000 cpu hours available, however the user submit a job with 2000 cpu hours request, Slurm will reject the job and pop up an error message:

------------------------------------------------------------------------
                 Insufficient Credit!           
            Job Reqires 2000 (CPU mins)
   Project 'xxx' Only has 1000 (mins) Available
   Please Contact MASSIVE Helpdesk: help@massive.org.au
------------------------------------------------------------------------

 

2) Project has enough available credit, however some running jobs (of the same project) are consuming credits and will sooner or later used out all project credits. The usable credit is insufficient for new incoming job.

CASE: User project has 20k cpu hours available, The user press the button and send 2000 jobs each (request 100 cpu hours) with in total 200k cpu hours request.

Slurm will check all jobs that uses this project account (Running or Pending) and reserve the cpu hours of each job. The actual usage credit of the account is calculated as:
Usable_Credit=(Project_Available_Credit -Reserved_Credit).

In above case:
Submit 1 job   ->  Usable_Credit=19900
Submit 2 jobs ->  Usable_Credit=19800
......
Submit 199 jobs ->  Usable_Credit=100
Submit 200 jobs ->  Usable_Credit=0
Submit 201 jobs ->  Reject
Submit 202 jobs ->  Reject

Since Job#201, user will be notified as:

------------------------------------------------------------------
                 Insufficient Credit!           
------------------------------------------------------------------
This Job Reqires 200 (CPU mins)    
Project 'xxx' has 20000 (mins) Available
Existing Jobs are Consuming 20000 (mins)  
Account has 0 (mins) for New Job Submission    
Please Contact MASSIVE Helpdesk: help@massive.org.au
------------------------------------------------------------------

If you require more information then email help@massive.org.au for more assistance.

Copyright © 2016 MASSIVE. All Rights Reserved.