Stage1 and Stage2 Software
Stage 1 and Stage 2 Software
MASSIVE was purchased in 2 stages.
Stage1 consisted of Tesla M2070 GPU’s and Intel Xeon X5650 (Westmere) CPU’s with all Desktop nodes being Stage 1.
Stage2, a few years later, consisted of Tesla K20M GPU’s and Intel E5-2670 (SandyBridge) CPU’s.
The more modern CPU’s have new inbuilt “opcodes” which allow some instructions to run much faster. Some code compiled on these nodes will detect and use the new opcodes (e.g. openblas/0.2.13). This is good for speed but unfortunately the code compiled on Stage2 nodes may not run on Stage1 nodes. When the Stage1 CPU tries to execute the new opcode, it crashes with “Illegal Instruction” or worse, no message at all.
It has been found that Python and a few other libraries have been compiled for “stage2” nodes and have the above behaviour.
To resolve the issues we will use the following approach:
Modules with a known issue will be unchanged but the description will be tagged with (“stage2 compiled”)
module show openblas/0.2.13
module-whatis an optimized BLAS library based on GotoBLAS2 1.13 BSD version (stage2 compiled) (v0.2.13)
For above modules a new stage1 will be compiled and will have the name stage1 in the module name (e.g. openblas/0.2.13stage1)
Note: The exception to the above is Python where we changed the default to a new version compiled to run on all nodes
Selecting Stage 1 or Stage 2 Nodes in Slurm
Users can make sure that code runs on specific nodes in Slurm jobs by using the constraint option:
This is useful if you wish to isolate a problem to a specific arcitecture or if you have code that you wish to take advantage of new opcodes.