HPC Systems of the ZID - Statement of Service Levels for Systems
Stable IT services require adequate protection against outages and loss of data. To ensure continued operation and to minimize the risk of data loss, essential components of our HPC clusters are designed in a redundant manner, ensuring that certain defects (such as an individual disk failure) will not affect availability of data or services. For a period of several years after acquisition of each system, service contracts allow timely replacement of failed components, keeping the entire system in a trouble-free state of operation.
To maximize the benefit from its investments, the ZID will try to continue operation of existing systems for some time even after our maintenance contracts have expired. If you use our systems, you should understand the implications and risks of this strategy, so you will be able to plan your research projects accordingly.
In the following paragraphs, we inform you about the current and planned maintenance status of each of our systems, and what it means to use systems that are no longer under maintenance.
Risks of using systems not under maintenance
Modern hardware is relatively stable. Individual failure probabilities, though, multiplied by the number of machines in a cluster, result in a significant failure rate at least for certain components.
An individual failure may, depending on its nature, result in one or more of the following consequences:
- Unplanned malfunction or termination of individual jobs,
- Reduction of processing capacity (loss of individual nodes),
- Downgraded communication bandwidth,
- Temporary or permanent loss of access to data or system functionality,
- Complete termination of system operation, possibly including the permanent loss of certain data.
Depending on the failed component and its replacement cost, the ZID may or may not decide to repair a system. In the first case, the time to repair may be significantly longer than with a system that is under maintenance, resulting in outages that may last for days or even several weeks. We estimate this risk to be relatively low, but we definitely cannot rule it out.
Recommendations and precautions
- Regularly back up important data, particularly in temporary file systems such as SCRATCH. For the Leo systems only, data in HOME directories are backed up by the ZID and thus are at significantly lower risk. As with all systems, you may still decide to keep a backup of your own.
- For the systems operated by external partners (MACH2 and VSC), there is no backup of user data at all. Please save your data by regularly copying important data to your own media using e.g. rsync.
- All machines are operated under a best effort principle. Try to not depend on continuous system availability to meet important deadlines or research goals.
Maintenance status of individual systems
Existing LEOx user accounts are valid on all LEO clusters.
LEO3
Decommissioned May 24, 2022. Will be replaced by the planned LEO5 system.
LEO3e
Central components under maintenance until March 2024. Failed compute nodes may or may not be repaired. EOL: TBD
LEO4
Central components under maintenance until November 2024. Failed compute nodes may or may not be repaired. EOL: TBD
LEO5
Fully operative and covered by warranty. Planned start of "friendly users" test operation: Feb 1, 2023.
MACH2
Successor system to MACH.
Warranty expired as of end of 2020. System is being operated on a
best effort basis and may become degraded or go out of
operation at any time. No successor system is currently planned.
NOTE: in contrast to
other systems, there is no data
backup of any user directories, including
HOME. Users are responsible for safeguarding their own data.
Planned End Of Service 31 March, 2023. Please start retrieving your
data as soon as possible.
VSC
Please visit the VSC Systems Overview website.