Leo5: Known Problems and Configuration Changes

Known Problems

Unstable Nodes

As of 14 Feb, 2023, Leo5 worker nodes occasionally become unresponsive and need to be rebooted. Problem analysis is underway.
Preliminary results (17 Feb 2023) indicate that there seems to be a bug in the kernel (?) of our systems that gets triggered by the way we had Slurm configured to gracefully handle (minor) memory-over-consumption. As a work-around we disallow that completely for now (see below).

Intel OneAPI Compilers

The intel-oneapi-compilers modules contain two generations of compilers, Intel-Classic (based on the GNU compiler front end - declared deprecated by Intel), and Intel-OneAPI (based on an LLVM front end - in active development and currently supported by Intel).

In Leo5 acceptance tests - in particular in connection with OpenMPI and possibly Fortran (Netlib HPL) - there appeared to be signs of unexpected results with the OneAPI compilers. As far as we are informed, Intel are looking into this issue.

Problems were also reported by users. Some standard open source packages, such as Kerberos and OpenSSH, do not build with Spack using the OneAPI toolchain.

For the time being, we have removed all packages built with OneAPI from our default Spack-leo5-20230116 instance (spack/v0.19-leo5-20230116-release). For users interested in exploring Intel OneAPI, we are deploying these packages, using the latest available Intel compilers, in the spack/v0.20-leo5-20230124-develop instance.

Mpich and Intel MPI

The introduction of Slurm has made it much easier to support multiple MPI implementations, in particular those that often come with 3rd party software. Our OpenMPI integration with Slurm works as it should and can be used without technical limitations

However, we currently have an issue with the Slurm/Cgroups integration of Mpich and Intel MPI, which causes all remote processes to be limited to CPU#0 when using the mpirun/mpiexec command. ~~We are looking into this problem - for the time being, jobs using Mpich or IntelMPI should be run only in single-node configurations.~~

Mpich and Intel MPI work fine if you place your processes with Slurm's srun --mpi=pmi2 command, so this is what we recommend. The option --mpi=pmi2 is necessary lest all tasks be started as Rank 0.

RPATH Works Only With OS Supplied Compilers

In LEO5 Intro - RPATH we describe that executables using libraries built with Spack will get a list of all requisite libraries in the RPATH attribute, so there is no need to module load the libraries at runtime. This effect is achieved by having module load set the $LD_RUN_PATH environment variable to a list of the directories containing these libraries at link time.

This mechanism currently works only when you use the OS-supplied compilers (gcc 8.5.0). When any of the compilers installed by Spack are used, the mechanism is effectively broken (overriden by an undocumented change to Spack since version 0.17).

As a temporary workaround, we recommend to do either of the following (pending detailed verification):

Either:
- When building your software with one of the Spack-supplied compilers, make note of the environment modules needed.
- Before running your programs, first load the modules as noted in step 1, then do
  export LD_LIBRARY_PATH=$LD_RUN_PATH. Do this if you do not want to re-build your software.
Or (recommended for every new build of your software):
- Add the option
  -Wl,-rpath=$LD_RUN_PATH
  to the commands by which your programs are linked, e.g. by defining
  ADD_RPATH = -Wl,-rpath=$(LD_RUN_PATH)
  in your Makefile and making sure that the link step in your Makefile contains
  $(CC) .... $(ADD_RPATH)
  or similar. This will add the contents of LD_RUN_PATH to the RPATH attribute of your executable, and there will be no need to set LD_LIBRARY_PATH at runtime.

This should fix the problem for the time being. The root cause of the problem is a deliberate change of behaviour by the Spack developers. Unfortunately, at the moment, there appears no simple way to restore the previous behaviour (which was consistent with documented behaviour of compilers) without necessary user interventions.

Hints

Hyperthreading

The Slurm option --thrads-per-core as originally documented yields incorrect CPU-affinity for single-threaded tasks. Use --hint=multithread instead.

MPI and Python

Currently, the following Python modules have been tested for MPI (mpi4py):

~~Anaconda3/2023.03/python-3.10.10-anaconda+mpi-conda-2023.03~~ obsolete(*)
Anaconda3/2023.03/python-3.10.10-anaconda-mpi-conda-2023.03
Anaconda3/2023.03/python-3.11.0-numpy+mpi-conda-2023.03

These are using the MPICH libraries. To correctly map MPI ranks to individual processes in batch jobs, you need to use Slurm's srun command with the following options:

srun --mpi=pmi2 --export=ALL ./myprogram.py

Omitting the --mpi option will cause all processes to be run independently and have rank 0.

(*) Note: Module names containing a "+" character no longer work with newer versions of the Environment Modules software (e.g. as installed on LEO5). All affected module names have been duplicated with "+" replaced by "-".

Configuration Changes

2023-02-14: Default Time Limit for Batch Jobs

To improve scheduling flexibility (esp. with backfilling) and encourage users to specify expected execution time for jobs, the default time limit has been lowered from 3 days to 24 hours. Use

--time=[[days-]hours:]minutes[:seconds]

to set a different time limit. The maximum still is 10 days.

2023-02-14 and 2023-02-17: Terminating Jobs Overusing Memory Allocation

In our transition from SGE to Slurm, memory allocations no longer refer to virtual memory, but to resident memory. This allows programs that malloc more memory than they actually need to run with no modification and with realistic Slurm memory allocations. When, however, programs access more memory than has been allocated by Slurm, they will begin to page in/out, leading to thrashing. This can severely impact the performance of your job as well as of the entire node and it looks like it triggers a bug in our systems.

To discover and correct this situation, we now have Slurm terminate jobs that overuse ~~more than 10% of~~ their memory allocation. ~~This still allows jobs that over-malloc memory and occasionally cause page faults, but should help preventing thrashing conditions.~~

If your job was terminated for this type of overuse, you will find the following error messages:

in your output file: error: Detected 1 oom-kill event(s)[...]. Some of your processes may have been killed by the cgroup out-of-memory handler., and
in your job summary: The job ran out of memory. Please re-submit with increased memory requirements.

We might further adjust this limit based on future experience.

2023-03-15: Links to Anaconda3 Modules

As part of the deployment of the neweset Anaconda version, the directory structure of the Anaconda modules has been adjusted to conform to the other LEO systems.

2023-06-21 2023-07-18: Module names containing the "+" character

As a result of upgrading the Environment Modules software from version 4.5 to 5.0, loading any modules containing the "+" character in their name will fail due to "+" invoking new functionality. All affected module names have been duplicated with "+" replaced by "-". Please replace module names in your jobs accordingly, e.g. replace
module load Anaconda3/2023.03/python-3.11.0-numpy+mpi-conda-2023
by
module load Anaconda3/2023.03/python-3.11.0-numpy-mpi-conda-2023

2024-11-27: New Spack Instance - Updated Software Available

On 27 November 2024, we have added another Spack instance Spack-leo5-20241125 based on Spack v0.23 to the default set of software environment modules. For the time being, the primary focus is on updated compilers and middleware:

Gnu Compiler Collection (GCC) 13.3.0 (latest version compatible with current CUDA release);
Latest Intel OneAPI (2025.0.0) components: Compilers, MKL, MPI
OpenMPI 4.1.7:
- this version no longer has the spurious error messages "WARNING: There was an error initializing an OpenFabrics device" present in the output of previous installations
- please note that hybrid MPI/OpenMP jobs still must be run with srun --cpus-per-task=$SLURM_CPUS_PER_TASK (or equivalently setting the environment variable SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK). Omitting specification of CPUs per task or using mpirun will lead to incorrect CPU assignments.
- There are two versions of non-CUDA-enabled OpenMPI:
  openmpi/4.1.7-gcc-13.3.0-maw4icj OpenMPI using the UCX communication library
  openmpi/4.1.7-gcc-13.3.0-angxf5d OpenMPI using the UCC communication library
  UCC is allegedly more advanced than UCX, but preliminary tests seem to show that the UCX version yields better performance. We offer both versions for users to experiment.
CUDA 12.6.2 and a CUDA enabled version of OpenMPI 4.1.7
openmpi/4.1.7-cuda-12.6.2-gcc-13.3.0-7sj3wnn
with support for the GPU architectures present on Leo5;
Singularity CE 4.1.0 for running unprivileged containers on HPC systems;
The HPCToolkit 2024.01.1 performance measurement and analysis suite;

Please note that the preferred Python installation is still via Anaconda3 - after clarification of licensing issues we also plan to install a new instance of Anaconda3.

Existing modules have not changed and you may continue to use these, but note that module load commands relying on default versions (i.e. not using fully qualified names) may now load a more recent version of the same software.

More packages may be added or updated upon request. For an overview of available packages, please see software items available via Spack.