top of page
Search

Challenges and Considerations of Developing and Implementing Machine Learning Tools in Clinical labs

nacccaus

Updated: Nov 6, 2022

He Sarina Yang, PhD, DABCC, Assistant Professor of Pathology and Laboratory Medicine, Weill Cornell Medicine, Cornell University.



Laboratory medicine is data-rich due to the enormous volume of laboratory test results produced by different sections of the clinical laboratory. It is estimated that up to 70% of the data in the electronic health record (EHR) is derived from the clinical laboratory. Most of this data are test results reported as individual numerical or categorical values in a structured format. Patient laboratory test profiles are high-dimensional datasets, as each patient usually has multiple individual laboratory test results generated from a single physician visit as well as longitudinal test results to monitor “wellness” status or to follow one or more disease processes. The enormity of the data, including the number of tests and interdependent multidimensional relationships of the different test results, is difficult for us, as humans, to interpret without computational assistance. Thus, Machine learning (ML) has emerged as a powerful tool for analyzing and interpreting massive quantities of laboratory test results as well as integrating clinical findings with laboratory data. In recent years, there has been a surge of interest in employing ML on a variety of applications in clinical laboratories. However, despite familiarity with traditional data approaches, many laboratory professionals are not familiar with the workflow of machine learning analysis, resulting in a knowledge gap with respect to the development, understanding, and use of ML models. There are risks of generating biased or unrepresentative models, which can lead to misleading clinical conclusions or over-estimation of the model performance.

In the recent review published in Archives of Pathology and Laboratory Medicine, the core clinical medical journal published by the College of American Pathologists, Dr. Yang and Dr. Wang et al. discussed the four major components for creating ML models, including data collection, data preprocessing, model development, and model evaluation[1]. They also highlighted many challenges and pitfalls in developing accurate ML models, which could result in misleading clinical impressions or inaccurate model performance, and provided suggestions and guidance on how to overcome these challenges. In particular, this review addressed the questions of how to collect sufficiently large and high-quality data, properly report the dataset characteristics, and combine data from multiple institutions with proper normalization; how to properly handle missing data and determine the inclusion or exclusion of outliers; and how to evaluate the completeness of a dataset. They also discussed the selection of a suitable ML model for a specific clinical question, as well as the evaluation of model performance based on objective criteria. It was highly recommended to use multiple criteria to evaluate model performance rather than a single criterion. Evaluation using external datasets and/or prospective data collection was preferred to understand model generalizability better. In addition, they demonstrated the causes of model overfitting and under-specification in clinical scenarios.

The role of laboratorians is not just to provide data but also to use their clinical knowledge with the data to guide model development, to correctly interpret the model, and to evaluate its performance in the patient care setting. The future of personalized and generalized medicine requires interdisciplinary collaboration between laboratory medicine and data science experts to create innovative, accurate ML models, which will advance the medical field, provide needed support in periods of health care crisis, and better treat individual patients.


在临床实验室开发和应用机器学习模型的挑战和考量


临床检验医学在实践中产生海量的病人检验数据,因而在应用先进的大数据与机器学习算法上具有得天独厚的优势。据估计,在病人的电子健康档案中,大约70%的数据来自临床检验实验室,其中绝大多数检验结果都已数字化且具有规范的结构和类别。这些检验结果通常保存为高维结构化数据,因为病人每次看病会产生多个检测项目的结果,且会定期跟踪检测某些项目来监控病情发展和健康状况。

这样巨大的数据量,各项检验数据之间的复杂关系,是很难通过人工手动分析来解析理解的。因此,近些年来,应用机器学习算法逐渐发展为分析解释复杂实验结果并且辅助临床诊断的有力工具,成为临床医学领域方兴未艾的热点研究方向。越来越多的检验医学人员探索如何将机器学习应用于临床病理的各个方面。然而,风险与机遇并存,尽管实验室专业人员对于传统的数据分析方法很熟悉,很多人对于机器学习数据分析的流程和方法却比较陌生,在开发、理解并运用机器学习模型上存在知识盲区。如果不恰当地应用机器学习的算法,研究人员可能会训练出有偏差的模型,不能反映数据分布的真实情况,导致有误的结论或者过高估计模型的性能。

在近期发表于美国病理医师学会官方期刊《病理学与实验医学档案》的一篇综述中,杨鹤与王飞博士等人深入讨论了创建机器学习模型的四个主要组成步骤,包括数据采集、数据预处理、模型开发和模型评估。 他们提纲携领地分析了在这个过程中每个步骤可能出现的技术问题和常见疏漏,提出了具体可行的建议和指导。这篇综述讨论了如何收集大量高质量的数据,正确评估数据的分布特点;如何标准化来自多个医院的数据并加以整合;如何正确处理缺失的信息和判断极少离群值避免其干扰;如何根据临床医学知识来评估数据集的完整性。文章进一步讨论了选取与数据量相匹配的机器学习模型去解决特定的临床问题,以及基于客观标准公平地评估模型准确度性能。 该综述强烈推荐利用多标准取代单一标准去评估模型性能表现,采用外部数据或者未来数据评估将更有利于模型的推广应用。此外,文章还探讨了模型过度拟合以及验证模型难以推广于临床案例等问题的成因。

临床实验室的贡献不仅是提供检验数据和诊疗结果,更重要的是利用其临床医学知识去指导机器学习模型构建,正确解释模型,评估模型在病人医疗系统中的表现。未来个体化定制化的医学实践,需要临床实验室和数据分析专家的跨学科合作,一起创造性地构建精准的机器学习模型,分析检验数据和预测诊疗效果,这将极大推动医学的发展,在健康危机中提供更多助力,从而更好地诊断治疗每一个病人。



References:

  1. He S. Yang, Daniel.D.Rhoads, Jorge Sepulveda, Chengxi Zang, Amy Chadburn, Fei Wang, Building the Model: Challenges and Considerations of Developing and Implementing Machine Learning Tools for Clinical Laboratory Medicine Practice. Archives of Pathology and Laboratory Medicine, 2022.

 
 
 

Comments


bottom of page