Supercomputer LEO4 of the ZID (IT-Center)

Overview

LEO4 is a high performance compute cluster of the ZID (IT-Center) at the University of Innsbruck, operated in close cooperation with the Research Area Scientific Computing.

The system consists of 50 nodes with Intel Xeon (Broadwell/Skylake) processors.

Node types in detail:

Node Type	# Nodes	Cores/Node	memory/Node	GPUs
Standard	44	28 × Broadwell	64 GB	-
Big Memory	4	28 × Broadwell	512 GB	-
GPU	1	28 × Skylake	384 GB	4 × Nvidia Tesla V100
Fat Memory	1	80 × Skylake	3000 GB	-
Total	50	1452	8.4 TB	4 × V100

The system has a high-performance low-latency Infiniband interconnect for MPI communications between nodes and GPFS file system traffic. The GPFS ($SCRATCH) file system has a usable capacity of 157TB (up to 26 TB Flash).

Applying for User Accounts

Note: all active Leo3 and Leo3e accounts are automatically activated for Leo4.

Please proceed with the following steps if you intend to get an account for this cluster:

As the computer center has several other high performance computing machines it is recommended to consult the system administrators (zid-cluster-admin@uibk.ac.at) to evaluate if this is the right system for your needs beforehand.
Note: In case you need more scratch space please contact the HPC staff directly.
Please download and fill in this application form. You need a user account name corresponding to one of the University's institutes (no student accounts). By default, all new accounts will be created as power-user accounts with the corresponding service parameters of the system. If you are unsure about how to fill in the form, the ZID HPC staff will gladly assist you.
1. If there is a representative of the Research Area Scientific Computing within your field of reasearch, ask this person to confirm the feasibility of your project by signature. If there is no appropriate representative available, directly proceed to b.
2. Each application needs to be confirmed by signature by the Head of the Research Area Scientific Computing.
Once the application form has been filled in and signed by the Head of the Research Area, please contact the ZID HPC staff to arrange an appointment for a usage briefing of about half an hour, in which we provide basic usage instructions and information about the HPC system.
The application form will be sent to the ZID HPC Team (Technikerstrasse 23, A-6020 Innsbruck) by the Research Area. Alternatively, you may take it with you to the arranged appointment.
After all the preceding steps have been performed, it usually takes one business day to set up your account with the ZID User Services (ZID Benutzerservice).

Acknowledging Cluster Usage

Users are required to recognize their use of the UIBK HPC systems by assigning all resulting publications to the Research Area Scientific Computing within the Forschungsleistungsdokumentation (FLD, https://www.uibk.ac.at/fld/) of the University of Innsbruck, and by adding the following statement to each publication's acknowledgment:

The computational results presented here have been achieved (in part) using the LEO HPC infrastructure of the University of Innsbruck.

Using the Cluster

All HPC clusters at the University of Innsbruck hosted at the ZID comply with a common set of usage regulations, which are summarized in the following sub-sections. Please do take the time to read the guidelines carefully, in order to make optimal use of our HPC systems.

First Time Instructions

See this quick start tutorial to jump the first hurdles after your account was activated:

Login to the cluster
Change your password
Copy files from and to the cluster

Setting up the Software (Modules) Environment

There are a variety of application and software development packages available on our cluster systems. In order to utilize these efficiently and to avoid inter-package conflicts, we employ the Environment Modules package on all of our cluster systems.

See the modules environment tutorial to learn how to customize your personal software configuration.

Submitting Jobs to the Cluster

On most of our systems, the distribution of jobs is handled by the former Sun Grid Engine, now the Son of Grid Engine (SGE) batch scheduler.

See the SGE usage tutorial to find out how to appropriately submit your jobs to the batch scheduler, i.e. the queuing system.

Submitting Jobs to the GPU Node

One node (n049.leo4) has 4 NVIDIA V100 GPUs (32 GB of RAM each) and 384 GB of RAM. To use this node you have to select the gpustd.q Queue. In the following link you will find information about the hardware equipment and the GPU performance (SP,DP,ML/DL), the available software (CUDA, Pytorch, Anaconda, Tensorflow, ...) and how to submit GPU jobs.

Status Information and Resource Limitations

In order to provide an efficient cluster utilization, an optimized workload and, most importantly, a fair share of resources to all of our cluster users, there are several limitations imposed on the queuing system, which need to be considered when submitting a job.

See the resource requirements and limitations document to learn how to handle these limitations efficiently.

Checkpointing and Restart Techniques

As High Performance Comuting (HPC) systems are by design no high availability systems, it is highly recommended to integrate some sort of checkpointing facility within your application in order to avoid job failure and loss of results.

See the checkpointing and restart tutorial for guidance on how to integrate this checkpoint procedure with the SGE batch scheduler.

Storing Your Data

Every time you login to the cluster, all storage areas available to you, i.e. the corresponding directories, as well as the used percentages, are listed before the current important messages. In general, the first two list items are of major importance:

Your home directory (also available via the environment variable $HOME) provides you with a small but highly secure storage area, which is backed up every day. This is the place to store your important data, such as source code, valuable input files, etc.
The secondly listed storage area represents the cluster's scratch space. It is also accessible via the environment variable $SCRATCH. This area provides you with enough space for large data sets and is designed for the cluster's high speed I/O. Use this storage for writing the output of your calculations and for large input data.
Please note, that the scratch space is designed for size and speed and is therefore no high availability storage. So make sure to secure important files regularly, as total data loss - though improbable - cannot be excluded.

Further listed storage areas are mostly for data exchange purposes. Please contact the ZID cluster administration, if you feel unsure about storage usage.

Available Software Packages

On each of our clusters we provide a broad variety of software packages, such as compilers, parallel environments, numerical libraries, scientific applications, etc.

Hardware Components

The Leo4 cluster system consists of 50 compute nodes (with a total of 1452 cores), 2 redundant file servers and one login node. 2 nodes are for special purpose (1x node with 28 cores, 4x Nvidia V100 Gpus and 384GB memory, 1x node with 80 cores and 3TB memory) All nodes are connected by a 100Gb/s Infiniband high speed interconnect (consolidated MPI and storage network). The storage system offers 157 TB (up to 26 TB Flash) of usable storage.
The cluster was purchased from EDV-Design. It is an IBM/Lenovo Nextscale system.

Cluster status and history

March 2023 Replacement of the IBM flash system FS900 with FS5200 (+10 TB more capacity)

Q2/Q3 2020 Start of Test Operation and Regular User Operation of GPU node and 3TB fat-memory node

November 2018 Start of Regular User Operation

August 2018 Start of Friendly User Test Operation

January 2018 Delivery, installation

Contact

Statement of Service

Maintenance Status and Recommendations for ZID HPC Systems

March 2023	Replacement of the IBM flash system FS900 with FS5200 (+10 TB more capacity)
Q2/Q3 2020	Start of Test Operation and Regular User Operation of GPU node and 3TB fat-memory node
November 2018	Start of Regular User Operation
August 2018	Start of Friendly User Test Operation
January 2018	Delivery, installation