UNIverse - Public Research Portal
Project cover

MODA at sciCORE: Monitoring and Operational Data Analytics at sciCORE

Research Project
 | 
01.08.2020
 - 31.12.2025

The goal of this project is to improve HPC operations and research regarding system performance, resilience, and efficiency. The performance optimization aspect targets optimal resource allocation and job scheduling. The resilience aspect strives to ensure orderly operations when facing anomalies or misuse, this includes security mechanisms against malicious applications. The efficiency aspect is about resource management and energy efficiency of HPC systems. To this end, appropriate techniques are employed to (a) monitor the system and collect data, such as sensor data, system logs, and job resource usage, (b) analyze system data through statistical and machine learning methods, and (c) make control and tuning decisions to optimize the system and avoid waste and misuse of computing power. The operational ideals that this project follows are (a) to gain a data-driven understanding of the system instead of operating it like a black box, (b) to continuously monitor all system states and application behavior, (c) to holistically consider the interaction between system states and application behavior, and (d) to develop solutions that can detect and resolve performance issues autonomously.

Publications

Ciorba, Florina M. et al. (2019) ‘Data Analysis for Improving High-Performance Computing Operations and Research. An Eucor Seed Money Project’, in Janczyk, Michael; von Suchodoletz, Dirk; Wiebelt, Bernd (ed.). Tübingen Library Publishing: Tübingen Library Publishing. Available at: https://doi.org/10.15496/publikation-29062.

URLs
URLs

Ghiasvand, Siavash and Ciorba, Florina M. (2019) ‘Anonymization of System Logs for Preserving Privacy and Reducing Storage’, in Arai, Kohei; Kapoor, Supriya; Bhatia, Rahul (ed.). Springer: Springer. Available at: https://doi.org/10.1007/978-3-030-03405-4_11.

URLs
URLs

Ghiasvand, Siavash and Ciorba, Florina M. (2019) ‘Anomaly Detection in High Performance Computers: A Vicinity Perspective’. IEEE: IEEE. Available at: https://doi.org/10.1109/ispdc.2019.00024.

URLs
URLs

Ghiasvand, Siavash and Ciorba, Florina M. (2018) ‘Assessing Data Usefulness for Failure Analysis in Anonymized System Logs’. IEEE: IEEE. Available at: https://doi.org/10.1109/ispdc2018.2018.00031.

URLs
URLs

Ciorba, Florina M. (2017) ‘The importance and need for system monitoring and analysis in HPC operations and research’, in Richling, Sabine; Baumann, Martin; Heuveline, Vincent (ed.). heiBOOKS: heiBOOKS. Available at: https://doi.org/10.11588/heibooks.308.418.

URLs
URLs

Members (3)

Profile Photo

Florina M. Ciorba

Principal Investigator
MALE avatar

Thierry Sengstag

Co-Investigator
MALE avatar

Thomas Jakobsche

Project Member