Evaluation of the Quality of AI Systems

Evaluation of the quality of AI systems

Evaluation of AI methods is a crucial aspect in the development and deployment of artificial intelligence systems. It involves systematically assessing how well an AI model performs its intended tasks, measuring aspects such as accuracy, robustness, fairness (or bias), and calibration. To ensure meaningful results, the evaluation process needs to use relevant datasets and metrics. 

The Fraunhofer Heinrich Hertz Institute (HHI), together with TÜV Association and the Federal Office for Information Security (BSI) have published two jointly developed whitepapers. The first whitepaper published in 2021, is entitled "Towards Auditable AI Systems: Current status and future directions" [1] and outlines a roadmap to examine artificial intelligence (AI) models throughout their entire lifecycle. The second whitepaper published in 2022, is entitled "Towards Auditable AI Systems: From Principles to Practice" [2] and proposes to employ a newly developed “Certification Readiness Matrix” (CRM) and presents its initial concept. 

As part of a project with the BSI (P540 “Use of Artificial Intelligence in Medical Diagnosis and Prognosis Systems”), a prototype development of test criteria and testing procedures was addressed to ensure the reliability and safety of AI systems in medicine. This work was published by the BSI [3]. 

In addition, our work extends to a systematic assessment of fundamental model characteristics, with a particular (but not exclusive) focus on robustness and uncertainty. These aspects are examined across diverse application domains—for instance, robustness in ECG and histopathological analyses [4,5], and uncertainty quantification in chest X-ray interpretation [6].  

You can find a detailed list of our work in the references below. If you have questions or would like to learn about opportunities for collaboration such as research projects or student theses, please get in touch. 

Publications

[1] Berghoff, C., Biggio, B., Brummel, E., Danos, V., Doms, T., Ehrich, H., ... & Wiegand, T. (2021). Towards Auditable AI Systems–Current status and future directions.  

[2] Berghoff, C., Böddinghaus, J., Danos, V., Davelaar, G., Doms, T., Ehrich, H., ... & Wiegand, T. (2022). Towards Auditable AI Systems: From Principles to Practice

[3] BSI (2024). Einsatz von Künstlicher Intelligenz in medizinischen Diagnose- und Prognosesystemen. 

[4] Strodthoff, N., Wagner, P., Schaeffter, T., & Samek, W. (2020). Deep learning for ECG analysis: Benchmarks and insights from PTB-XL. IEEE Journal of Biomedical and Health Informatics, 25(5), 1519-1528. 

[5] Springenberg, M., Frommholz, A., Wenzel, M., Weicken, E., Ma, J., & Strodthoff, N. (2023). From modern CNNs to vision transformers: Assessing the performance, robustness, and classification strategies of deep learning models in histopathology. Medical Image Analysis, 87, 102809

[6] Baur, S., Samek, W., Ma, J. (2026). Benchmarking Uncertainty and Its Disentanglement in Multi-label Chest X-Ray Classification. In: Sudre, C.H., et al. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. UNSURE 2025. Lecture Notes in Computer Science, vol 16166. Springer, Cham. 

[7] Ma, J., Weicken, E., Pahde, F., Weitz, K., Lapuschkin, S., Samek, W., & Wiegand, T. (2025). Künstliche Intelligenz auf dem Prüfstand: Anforderungen, Qualitätskriterien und Prüfwerkzeuge für medizinische Anwendungen. Bundesgesundheitsblatt-Gesundheitsforschung-Gesundheitsschutz, 1-9.