Evaluation of the quality of AI systems
Evaluation of AI methods is a crucial aspect in the development and deployment of artificial intelligence systems. It involves systematically assessing how well an AI model performs its intended tasks, measuring aspects such as accuracy, robustness, fairness (or bias), and calibration. To ensure meaningful results, the evaluation process needs to use relevant datasets and metrics.
The Fraunhofer Heinrich Hertz Institute (HHI), together with TÜV Association and the Federal Office for Information Security (BSI) have published two jointly developed whitepapers. The first whitepaper published in 2021, is entitled "Towards Auditable AI Systems: Current status and future directions" [1] and outlines a roadmap to examine artificial intelligence (AI) models throughout their entire lifecycle. The second whitepaper published in 2022, is entitled "Towards Auditable AI Systems: From Principles to Practice" [2] and proposes to employ a newly developed “Certification Readiness Matrix” (CRM) and presents its initial concept.
As part of a project with the BSI (P540 “Use of Artificial Intelligence in Medical Diagnosis and Prognosis Systems”), a prototype development of test criteria and testing procedures was addressed to ensure the reliability and safety of AI systems in medicine. This work was published by the BSI [3].
In addition, our work extends to a systematic assessment of fundamental model characteristics, with a particular (but not exclusive) focus on robustness and uncertainty. These aspects are examined across diverse application domains—for instance, robustness in ECG and histopathological analyses [4,5], and uncertainty quantification in chest X-ray interpretation [6].
You can find a detailed list of our work in the references below. If you have questions or would like to learn about opportunities for collaboration such as research projects or student theses, please get in touch.