By: Catherine Li, Contributing Writer
Many healthcare workers are afraid of being replaced by AI in the workforce–but as far as we know, there’s no reason for medical practitioners to worry.
AI’m Here to Stay
Artificial Intelligence (AI) models have rapidly gained traction over the past few decades, which has prompted discussion about their integration into industries that affect our daily lives and how they might pose a threat to careers. In their 2020 Future of Jobs Report, the World Economic Forum predicted that “by 2025, 85 million jobs may be displaced by a shift in the division of labour from humans to machines” (1). Models also suggest that jobs with high AI exposure typically require higher education and more cognition-related tasks (2). Fields like medicine, for instance, will inevitably change as a result of the emergence and improvement of AI models.
Currently, deep learning models, which use pattern detection in image-based diagnostics, have emerged as a candidate diagnostic method in medicine. Chatbots have also emerged in the healthcare industry and on the internet as a diagnostic and administrative aid. However, the application of AI to the healthcare industry has been concerning because of the risk associated with using algorithms, often trained on biased data, to diagnose patients. Evidently, there exists a strong and complex relationship between AI and medicine—the concern lies in its nature.
AI’ll Take Care of You
Many of us may have consulted Google at some point to nervously search up any symptoms we might be experiencing and try to pinpoint the underlying condition. Amidst a crippling shortage of medical professionals and an epidemic of long wait times in clinics and emergency rooms, the internet has turned into a medical practitioner of sorts, working to efficiently address medical doubts.
Patients aren’t only turning to Google for advice and diagnostics. Chatbots like OpenAI’s ChatGPT and Google’s Med-PaLM are large language models (LLMs) that can be easily used to seek medical advice. Other chatbots, in the form of commercial chatbot-based symptom checkers (CSCs), are accessible as mobile applications and can assess medical symptoms and provide users with prompt diagnoses. In 2021, it was estimated that there have been over 1 million downloads of popular diagnosis apps (3).
Currently, all LLMs are regarded as “black box” models, as they are unable to explain the process that uses these algorithms to turn input into output. If asked how a specific diagnostic was made for the provided input, LLMs would struggle to explain the process, which makes it much harder to catch and fix diagnostic errors.
Beyond providing unexplainable outputs, LLM-based chatbots are also known to “hallucinate”, or bring forth false information. For example, when asked for literature to support a generative answer, ChatGPT procured a list of Digital Object Identifiers (DOIs) that led to unrelated publications. When questioned about previous studies, certain answers were also “patently incorrect”, reporting completely false results (4). Another study discovered that of the 115 provided references in ChatGPT-produced medical articles, 47% were fabricated, 46% were authentic and inaccurate, and a meager 7% were authentic and accurate (5). These “hallucinations” make it difficult to trust the credibility of these chatbots, especially when used in a high-stakes field like diagnostics.
LLMs also face bias susceptibility. To begin this discussion, it’s important to draw a distinction between symptoms and signs. Symptoms are experiences described by someone (“I feel pain in my chest and coughed through my entire class, I think I got frosh flu”), while signs are objective proof of disease (a high fever and opaque nodes in a CT scan may reveal our subject has pneumonia). When consulting CSCs or LLMs for medical advice, the majority of the input provided by users are symptoms. However, word choice and demographics are critical to a model’s output.
Imagine a middle-aged man and a middle-aged woman both tell the same chatbot they have intense abdominal pain and nausea and provide it with their demographic backgrounds. One of them may be informed that they have food poisoning, gallstones, or appendicitis, while the other may be told they have endometriosis or are experiencing perimenopausal symptoms. The difference in output can be attributed to past diagnoses made by practitioners, who assessed symptoms with patient sex in mind. As a result, records show associations between a certain demographic and a specific–and sometimes misdiagnosed–condition. The historical biases used in past diagnostics, which are then retrieved by LLMS, have rendered the performance of these models relatively poor (6).
Racial biases in medical diagnostics are also apparent; Black patients with depression, for instance, are much more likely to also receive a schizophrenia diagnosis (7), and minority groups were found to be less likely than their Caucasian counterparts to receive an earlier diagnosis for dementia (8). Medical disparities and social biases, reflected in past diagnostics and data, allow for present misdiagnoses that can lead to disability or death. Concerningly, past databases containing inaccurate diagnoses towards certain demographics may be used to train LLMs, perpetuating biases that are only now beginning to be addressed in medical schools.
I’m not human, but AI sure am helpful
Fortunately, other deep learning models such as convolutional neural networks (CNNs) are trained on medical images, which can enhance diagnosis accuracy and efficiency without the concern of the bias susceptibility that generative AI models face. The predictive ability of deep learning models can be incredibly useful, especially when early detection is critical. Furthermore, CNNs–which are trained on signs, rather than symptoms of disease–are remarkably accurate (9,10).
When given various MRI images, CNNs have been able to scan for endometrial cancers with as much, or even greater, accuracy and sensitivity than radiologists. Here, interobserver agreement between the CNNS and radiologists were generally lower than between the radiologists themselves, perhaps suggesting that the CNNS approached the images using a different framework (11). Given CNNs’ diverse perspective and enhanced accuracy, we should consider using them in the healthcare industry–for example, in detecting cases with more subtle signs earlier on or providing additional confirmation of a previous diagnosis.
Studies also suggest that AI-practitioner collaboration may enhance diagnostic performance overall. One study investigating CNN diagnostics for Acute Respiratory Distress Syndrome using chest radiographs found that the highest rates of diagnostic accuracy was observed when CNNs first attempted to give a diagnosis, and deferred to physicians when faced with uncertainty (12). Deep learning models that are trained on images do not have the biases of chatbots, which may increase diagnostic accuracy, especially when collaborating with diagnosticians.
AI’m Just Here to Help
Right now, an “AI takeover” in medicine seems unlikely given the unreliable nature of commercial and internet-accessed chatbots and synergistic performance between practitioners and deep learning models. Using AI as tools, rather than replacements for practitioners may improve overall efficiency and performance by partially alleviating the heavy workload on practitioners that frequently results in fatigue and burnout (13). While the World Economic Forum, forecasted 85 million careers being displaced, it also estimated that “97 million new roles may emerge that are more adapted to the new division of labour between humans, machines and algorithms” (1). Perhaps instead of being displaced by machines, medical practitioners will readjust their roles to co-exist—and even collaborate—with them.
Integrating deep learning models into healthcare institutions and adjusting educational curricula to include responsible AI utilization will be another challenge to overcome. But first, models must be further honed before being incorporated in the medical field on a wide scale, with lots of time to tackle this challenge. And if we still don’t know how, we can always ask our helpful friend ChatGPT.
References
- The Future of Jobs Report 2020. World Economic Forum. (2020, October 20). https://www.weforum.org/publications/the-future-of-jobs-report-2020/in-full/chapter-2-forecasts-for-labour-market-evolution-in-2020-2025/#2-2-emerging-and-declining-jobs
- Mehdi, T., & Morissette, R. (2024, September 3). Experimental Estimates of Potential Artificial Intelligence Occupational Exposure in Canada. Statistics Canada. https://www150.statcan.gc.ca/n1/pub/11f0019m/11f0019m2024005-eng.htm
- You, Y., & Gui, X. (2021). Self-Diagnosis through AI-enabled Chatbot-based Symptom Checkers: User Experiences and Design Considerations. AMIA … Annual Symposium proceedings. AMIA Symposium, 2020, 1354–1363.
- Emsley, R. ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophr 9, 52 (2023). https://doi.org/10.1038/s41537-023-00379-4
- Bhattacharyya, M., Miller, V. M., Bhattacharyya, D., & Miller, L. E. (2023). High rates of fabricated and inaccurate references in CHATGPT-generated medical content. Cureus. https://doi.org/10.7759/cureus.39238.
- Ceney, A., Tolond, S., Glowinski, A., Marks, B., Swift, S., & Palser, T. (2021). Accuracy of online symptom checkers and the potential impact on service utilisation. PloS one, 16(7), e0254088. https://doi.org/10.1371/journal.pone.0254088
- Gara, M. A., Minsky, S., Silverstein, S. M., Miskimen, T., & Strakowski, S. M. (2019). A naturalistic study of racial disparities in diagnoses at an outpatient behavioral health clinic. Psychiatric Services, 70(2), 130–134. https://doi.org/10.1176/appi.ps.201800223
- Kugunavar, S., & Prabhakar, C. J. (2021). Convolutional neural networks for the diagnosis and prognosis of the coronavirus disease pandemic. Visual Computing for Industry, Biomedicine, and Art, 4(1). https://doi.org/10.1186/s42492-021-00078-w
- Byrne, M. F., Chapados, N., Soudan, F., Oertel, C., Linares Pérez, M., Kelly, R., Iqbal, N., Chandelier, F., & Rex, D. K. (2017). Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut, 68(1), 94–100. https://doi.org/10.1136/gutjnl-2017-314547
- Tsoy E, Kiekhofer RE, Guterman EL, et al. Assessment of Racial/Ethnic Disparities in Timeliness and Comprehensiveness of Dementia Diagnosis in California. JAMA Neurol. 2021;78(6):657–665. doi:10.1001/jamaneurol.2021.0399
- Urushibara, A., Saida, T., Mori, K., Ishiguro, T., Inoue, K., Masumoto, T., Satoh, T., & Nakajima, T. (2022). The efficacy of deep learning models in the diagnosis of endometrial cancer using MRI: a comparison with radiologists. BMC medical imaging, 22(1), 80. https://doi.org/10.1186/s12880-022-00808-3
- Sjoding, M. W., Taylor, D., Motyka, J., Lee, E., Co, I., Claar, D., McSparron, J. I., Ansari, S., Kerlin, M. P., Reilly, J. P., Shashaty, M. G., Anderson, B. J., Jones, T. K., Drebin, H. M., Ittner, C. A., Meyer, N. J., Iwashyna, T. J., Ward, K. R., & Gillies, C. E. (2021). Deep learning to detect acute respiratory distress syndrome on chest radiographs: A retrospective study with external validation. The Lancet Digital Health, 3(6). https://doi.org/10.1016/S2589-7500(21)00056-X
- Singh R, Volner K, Marlowe D. Provider Burnout. [Updated 2023 Jun 12]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK538330/
Image source: Wyatt, Martha (2021, October 13). Barriers to AI in Healthcare. Greenbook. https://www.greenbook.org/insights/barriers-to-ai-in-healthcare.
