UNIverse - Public Research Portal
Project cover

MODA: Monitoring and Operational Data Analytics for HPC Systems

Research Project
 | 
01.08.2020
 - 31.12.2025
The goal of this project is to improve HPC operations and research regarding system performance, resilience, and efficiency. The performance optimization aspect targets optimal resource allocation and job scheduling. The resilience aspect strives to ensure orderly operations when facing anomalies or misuse, this includes security mechanisms against malicious applications. The efficiency aspect is about resource management and energy efficiency of HPC systems. To this end, appropriate techniques are employed to (a) monitor the system and collect data, such as sensor data, system logs, and job resource usage, (b) analyze system data through statistical and machine learning methods, and (c) make control and tuning decisions to optimize the system and avoid waste and misuse of computing power. The operational ideals that this project follows are (a) to gain a data-driven understanding of the system instead of operating it like a black box, (b) to continuously monitor all system states and application behavior, (c) to holistically consider the interaction between system states and application behavior, and (d) to develop solutions that can detect and resolve performance issues autonomously.
Publications
Jakobsche, Thomas et al. (2021) ‘An Execution Fingerprint Dictionary for HPC Application Recognition’, in IEEE International Conference on Cluster Computing. IEEE COMPUTER SOC: IEEE COMPUTER SOC (IEEE International Conference on Cluster Computing). Available at: https://doi.org/10.1109/cluster48925.2021.00092.
URLs
URLs

Members (2)
Profile Photo
Florina M. Ciorba
Principal Investigator
MALE avatar
Thomas Jakobsche
Project Member