Back to HPC Home Page
Leo5: Known Problems and Configuration Changes
Known Problems
Unstable Nodes
As of 14 Feb, 2023, Leo5 worker nodes occasionally become
unresponsive and need to be rebooted. Problem analysis is underway.
Preliminary results (17 Feb 2023) indicate that there seems to be a
bug in the kernel (?) of our systems that gets triggered by the way we
had Slurm configured to gracefully handle (minor)
memory-over-consumption. As a work-around we disallow that completely
for now (see below).
Intel OneAPI Compilers
The intel-oneapi-compilers modules contain two generations of compilers, Intel-Classic (based on the GNU compiler front end - declared deprecated by Intel), and Intel-OneAPI (based on an LLVM front end - in active development and currently supported by Intel).
In Leo5 acceptance tests - in particular in connection with OpenMPI and possibly Fortran (Netlib HPL) - there appeared to be signs of unexpected results with the OneAPI compilers. As far as we are informed, Intel are looking into this issue.
Problems were also reported by users. Some standard open source packages, such as Kerberos and OpenSSH, do not build with Spack using the OneAPI toolchain.
For the time being, we have removed all packages built with OneAPI from our default Spack-leo5-20230116 instance (spack/v0.19-leo5-20230116-release). For users interested in exploring Intel OneAPI, we are deploying these packages, using the latest available Intel compilers, in the spack/v0.20-leo5-20230124-develop instance.
Mpich and Intel MPI
The introduction of Slurm has made it much easier to support multiple MPI implementations, in particular those that often come with 3rd party software. Our OpenMPI integration with Slurm works as it should and can be used without technical limitations
However, we currently have an issue with the Slurm/Cgroups
integration of Mpich and Intel MPI, which causes all remote processes
to be limited to CPU#0 when using the mpirun/mpiexec command.
We are looking into this problem - for the time
being, jobs using Mpich or IntelMPI should be run only in single-node
configurations.
Mpich and Intel MPI work fine if you place your processes with Slurm's srun --mpi=pmi2 command, so this is what we recommend. The option --mpi=pmi2 is necessary lest all tasks be started as Rank 0.
RPATH Works Only With OS Supplied Compilers
In LEO5 Intro - RPATH we describe that executables using libraries built with Spack will get a list of all requisite libraries in the RPATH attribute, so there is no need to module load the libraries at runtime. This effect is achieved by having module load set the $LD_RUN_PATH environment variable to a list of the directories containing these libraries at link time.
This mechanism currently works only when you use the OS-supplied compilers (gcc 8.5.0). When any of the compilers installed by Spack are used, the mechanism is effectively broken (overriden by an undocumented change to Spack since version 0.17).
As a temporary workaround, we recommend to do either of the following (pending detailed verification):
- Either:
- When building your software with one of the Spack-supplied compilers, make note of the environment modules needed.
- Before running your programs, first load the modules as noted in
step 1, then do
export LD_LIBRARY_PATH=$LD_RUN_PATH. Do this if you do not want to re-build your software.
- Or (recommended for every new build of your software):
- Add the option
-Wl,-rpath=$LD_RUN_PATH
to the commands by which your programs are linked, e.g. by defining
ADD_RPATH = -Wl,-rpath=$(LD_RUN_PATH)
in your Makefile and making sure that the link step in your Makefile contains
$(CC) .... $(ADD_RPATH)
or similar. This will add the contents of LD_RUN_PATH to the RPATH attribute of your executable, and there will be no need to set LD_LIBRARY_PATH at runtime.
- Add the option
This should fix the problem for the time being. The root cause of the problem is a deliberate change of behaviour by the Spack developers. Unfortunately, at the moment, there appears no simple way to restore the previous behaviour (which was consistent with documented behaviour of compilers) without necessary user interventions.
Hints
Hyperthreading
The Slurm option --thrads-per-core as originally documented yields incorrect CPU-affinity for single-threaded tasks. Use --hint=multithread instead.
MPI and Python
Currently, the following Python modules have been tested for MPI (mpi4py):
Anaconda3/2023.03/python-3.10.10-anaconda+mpi-conda-2023.03obsolete(*)
Anaconda3/2023.03/python-3.10.10-anaconda-mpi-conda-2023.03- Anaconda3/2023.03/python-3.11.0-numpy+mpi-conda-2023.03
These are using the MPICH libraries. To correctly map MPI ranks to individual processes in batch jobs, you need to use Slurm's srun command with the following options:
srun --mpi=pmi2 --export=ALL ./myprogram.py
Omitting the --mpi option will cause all processes to be run independently and have rank 0.
(*) Note: Module names containing a "+" character no longer work with newer versions of the Environment Modules software (e.g. as installed on LEO5). All affected module names have been duplicated with "+" replaced by "-".
Configuration Changes
2023-02-14: Default Time Limit for Batch Jobs
To improve scheduling flexibility (esp. with backfilling) and encourage users to specify expected execution time for jobs, the default time limit has been lowered from 3 days to 24 hours. Use
--time=[[days-]hours:]minutes[:seconds]
to set a different time limit. The maximum still is 10 days.
2023-02-14 and 2023-02-17: Terminating Jobs Overusing Memory Allocation
In our transition from SGE to Slurm, memory allocations no longer refer to virtual memory, but to resident memory. This allows programs that malloc more memory than they actually need to run with no modification and with realistic Slurm memory allocations. When, however, programs access more memory than has been allocated by Slurm, they will begin to page in/out, leading to thrashing. This can severely impact the performance of your job as well as of the entire node and it looks like it triggers a bug in our systems.
To discover and correct this situation, we now have Slurm terminate
jobs that overuse more than 10% of their memory allocation. This still
allows jobs that over-malloc memory and occasionally cause page faults,
but should help preventing thrashing conditions.
If your job was terminated for this type of overuse, you will find the following error messages:
- in your output file: error: Detected 1 oom-kill event(s)[...]. Some of your processes may have been killed by the cgroup out-of-memory handler., and
- in your job summary: The job ran out of memory. Please re-submit with increased memory requirements.
We might further adjust this limit based on future experience.
2023-03-15: Links to Anaconda3 Modules
As part of the deployment of the neweset Anaconda version, the directory structure of the Anaconda modules has been adjusted to conform to the other LEO systems.
2023-06-21 2023-07-18: Module names containing the "+" character
As a result of upgrading the Environment Modules software from
version 4.5 to 5.0, loading any modules containing the "+" character
in their name will fail due to "+" invoking new functionality.
All affected module names have been
duplicated with "+" replaced by "-". Please replace
module names in your jobs accordingly, e.g. replace
module load Anaconda3/2023.03/python-3.11.0-numpy+mpi-conda-2023
by
module load Anaconda3/2023.03/python-3.11.0-numpy-mpi-conda-2023