Schwarz, Markus ORCID: 0009-0003-0457-0289; Hinske, Ludwig Christian
ORCID: 0000-0001-7273-5899; Mansmann, Ulrich
ORCID: 0000-0002-9955-8906; Albashiti, Fady
ORCID: 0000-0002-0671-152X
(2025):
ML Auditing and Reproducibility: Applying a Core Criteria Catalog to an Early Sepsis Onset Detection System.
IEEE Access, 13.
pp. 104899-104915.
ISSN 2169-3536
![ML_Auditing_and_Reproducibility_Applying_a_Core_Criteria_Catalog_to_an_Early_Sepsis_Onset_Detection_System.pdf [thumbnail of ML_Auditing_and_Reproducibility_Applying_a_Core_Criteria_Catalog_to_an_Early_Sepsis_Onset_Detection_System.pdf]](https://oa-fund.ub.uni-muenchen.de/style/images/fileicons/text.png)
Veröffentlichte Publikation
ML_Auditing_and_Reproducibility_Applying_a_Core_Criteria_Catalog_to_an_Early_Sepsis_Onset_Detection_System.pdf

![ML_Auditing_and_Reproducibility_Applying_a_Core_Criteria_Catalog_to_an_Early_Sepsis_Onset_Detection_System.pdf [thumbnail of ML_Auditing_and_Reproducibility_Applying_a_Core_Criteria_Catalog_to_an_Early_Sepsis_Onset_Detection_System.pdf]](https://oa-fund.ub.uni-muenchen.de/style/images/fileicons/text.png)
Veröffentlichte Publikation
ML_Auditing_and_Reproducibility_Applying_a_Core_Criteria_Catalog_to_an_Early_Sepsis_Onset_Detection_System.pdf

Abstract
Background: On the way towards a commonly agreed framework for auditing ML algorithms, in our previous paper we proposed a 30-question core criteria catalog. In this paper, we apply our catalog to an early sepsis onset detection system use case. Methods: The assessment of the ML algorithm behind the sepsis prediction system takes place in a kind of external audit. We apply the questions of our catalog with described context to the available sepsis project resources made publicly available. For the audit process we considered three steps proposed by the Supreme Audit Institutions of Finland et al. and utilized inter-rater reliability techniques. We also conducted an extensive reproduction study, as being encouraged by our catalog, including data perturbation experiments. Results: We were able to successfully apply our 30-question catalog to the sepsis ML algorithm development project. 37% of the questions were rated as fully addressed, 33% of the questions as partially addressed and 30% of the questions as not addressed, based on the first auditor. The weighted Cohen’s kappa agreement coefficient results in κ = 0.51. The focus of the sepsis project is on algorithm design, data properties and assessment metrics. In our reproduction study, using externally validated pooled prediction on the self-attention deep learning model, we achieved an AUC of 0.717 (95% CI, 0.693-0.740) and a PPV of 28.3 (95% CI, 24.5-32.0) at 80% TPR and 18.8% sepsis-case prevalence harmonization. For the lead time to sepsis onset, we could not reproduce meaningful values. In the perturbation experiment, the model showed an AUC of 0.799 (95% CI, 0.756-0.843) with modified input data in contrast to an AUC of 0.788 (95% CI, 0.743-0.833) with original input data, when trained on the AUMC dataset and validated externally. Discussion: The catalog application results are visualized in a radar diagram, allowing an auditor to quickly assess and compare strengths and weaknesses of ML algorithm development or implementation projects. In general, we were able to reproduce the magnitude of the sepsis project’s reported performance metrics. However, certain steps of the reproduction study proved to be challenging due to necessary code changes and dependencies on package versions and the runtime environment. The extent of the deviation in the result metrics was −5.83% for the AUC and −11.03% for the PPV, presumably explained by our absence of tuning. The AUC change of 1.45% indicates resilience of the self-attention deep learning model to input data manipulation. An algorithmic error is most likely responsible for the missing lead time to sepsis onset metric. Even though the acquired weighted Cohen’s kapa coefficient is interpreted as having a “fair to good” agreement between both auditors, there exists potential subjectivity showing room for improvement. This could be mitigated if more groups (multiple auditors) would apply our catalog to existing ML development and implementation projects. A subsequent “catalog application guideline” could be established this way. Our activities might also help development or implementation teams to prepare themselves for future, legally required audits of their newly created ML algorithms/AI products.
Dokumententyp: | Artikel (Klinikum der LMU) |
---|---|
Organisationseinheit (Fakultäten): | 07 Medizin > Klinikum der LMU München |
DFG-Fachsystematik der Wissenschaftsbereiche: | Lebenswissenschaften |
Veröffentlichungsdatum: | 07. Aug 2025 06:50 |
Letzte Änderung: | 07. Aug 2025 06:50 |
URI: | https://oa-fund.ub.uni-muenchen.de/id/eprint/1920 |