Review Article Artificial Intelligence 2024 Volume 21 Volume 21, Issue 2, 10/31/2024

Improving Clinical Documentation with Artificial Intelligence: A Systematic Review

Scott W. Perkins, BA;Justin C. Muste, MD;Taseen Alam, BS;Rishi P. Singh, MD

Abstract

Clinicians dedicate significant time to clinical documentation, incurring opportunity cost. Artificial intelligence (AI) tools promise to improve documentation quality and efficiency. This systematic review overviews peer-reviewed AI tools to understand how AI may reduce opportunity cost. PubMed, Embase, Scopus, and Web of Science databases were queried for original, English language research studies published during or before July 2024 that report a new development, application, and validation of an AI tool for improving clinical documentation. 129 studies were extracted from 673 candidate studies. AI tools improve documentation by structuring data, annotating notes, evaluating quality, identifying trends, and detecting errors. Other AI-enabled tools assist clinicians in real-time during office visits, but moderate accuracy precludes broad implementation. While a highly accurate end-to-end AI documentation assistant is not currently reported in peer-reviewed literature, existing techniques such as structuring data offer targeted improvements to clinical documentation workflows.

Keywords: artificial intelligence; documentation; automation; clinical guidelines; electronic health records; informatics

Introduction

Robust clinical documentation is critical for efficiency and quality of care, diagnosis related group (DRG) coding and reimbursement, and is required to be in compliance with the Joint Commission on Accreditations of Healthcare Organizations (JCAHO).^1–3 Physicians spend 34 percent to 55 percent of their work day creating and reviewing clinical documentation in electronic health records (EHRs), translating to an opportunity cost of $90 to $140 billion annually in the United States — money spent on documentation time which could otherwise be spent on patient care.^1,3,4 This clerical burden reduces time spent with patients, decreasing quality of care and contributing to clinician dissatisfaction and burnout.^3,5 Clinical documentation improvement (CDI) initiatives have sought to reduce this burden and improve documentation qualtiy.⁶

Background

Despite the need for increased documentation efficiency and quality, CDI initiatives are not always successful.⁷ Artificial intelligence (AI) tools have been proposed as a means of improving the efficiency and quality of clinical documentation,^8,9 and could reduce opportunity cost while producing JCAHO-compliant documentation and assisting coding and billing ventures.¹⁰ This study seeks to summarize available literature and describe how AI tools could be implemented more broadly to improve documentation efficiency, reduce documentation burden, increase reimbursement, and improve quality of care.

Methods

Best practices established in the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines were used to search, organize, and review papers (Figure 1). As no patient information was involved, no Institutional Review Board was undertaken. The review was registered on the Open Science Framework registry. The PubMed, Embase, Scopus, and Web of Science databases were queried using the search strategies found in Figure 1. Articles were screened by three independent graders for the following inclusion criteria: full-text research article, written in English, describing novel development and application of AI tools to improve clinical documentation, and published between the earliest searchable year of each database and July 2024. Covidence software was used to organize the review (Veritas Innovation Ltd, Melbourne, Australia). The search results were last accessed on August 1, 2024. Exclusion criteria included studies which did not involve a new method or application of a tool, those which did not use an AI technique, and those which proposed methodology but did not validate an applicable tool. Disagreement between graders was resolved by discussion. Data extracted from studies include clinical data types, AI methods, tasks, reported effectiveness, and publication dates. Funding was not received for this review.

Results

Six hundred and seventy studies were extracted after querying PubMed, Embase, Scopus, and Web of Science and three additional studies were found in the references of related literature. After screening articles for relevance and eligibility according to inclusion and exclusion criteria, 129 studies were included in the narrative systematic review. A complete overview of studies may be found in Table 1. Twenty-three were excluded due to reporting a non-novel tool or application,^11–33 10 did not use an AI approach,^34–43 and two proposed but did not evaluate new methodology.^44,45

The earliest included study was published in 2005, with the number of studies increasing from 2005 to 2022 (Figure 2). Notably, while 25 studies (an average of 2.08 per month) were published in 2022, only 18 studies (an average of 0.95 per month) were published from January 2023 to July 2024. (Figure 2) This 46 percent decrease in peer-reviewed studies per month was noted to coincide with the release of ChatGPT on November 30, 2022. Current AI tools improved clinical documentation by aiding clinicians or CDI initiatives in six domains expanded on below: tools aided clinicians by structuring data, annotating notes, detecting errors, or serving as AI documentation assistants; or aided CDI initiatives by evaluating documentation quality or identifying trends. Seventy-seven percent of studies aided clinicians, while 23 percent aided CDI initiatives (Figure 3). Most studies concerned data structuring algorithms (68 percent), followed by evaluating quality (18 percent), identifying trends (5 percent), detecting errors (3 percent), AI-enabled assistants (5 percent), and annotating notes (1 percent) (Figure 3). While the prevalence of studies in each domain varies, each has the potential to improve clinical documentation as discussed below.

Structuring Free-Text Data

Once the standard in documentation, free-text notes are flexible and allow clinicians to dictate or type. In contrast, structured data consists of pre-populated fields offering a less flexible but organized, searchable, and easily analyzed note format.^46,47 AI tools have the potential bridge this gap, saving clinicians time by organizing text into paragraphs, presenting only the most relevant options in picklists, and automatically placing important information in structured data fields.

By necessity, clinic notes contain headings and an inherent organization on which an AI system can be applied. Rule-based approaches have been effective for various data structuring tasks, including classifying race with F-score = 0.911-0.984 (higher F-score indicating better performance),^48,49 identifying confidential content in adolescent clinical notes,⁵⁰ extracting social needs,⁵¹ and identifying stroke.⁵² Wang et al. developed an AI-guided dynamic picklist to display the most probable allergic reactions once an allergen is entered, resulting in 82.2 percent of the top 15 picklist choices being relevant to a given note.⁵³ Beyond rule-based methods, Gao et al. developed an adaptive unsupervised algorithm which automatically summarizes patients’ main problems from daily progress notes with significant performance gains compared to alternative rule-based systems.⁵⁴ To further enable data structuring, natural language processing (NLP) models were built by Ozonoff et al. and Allen et al. to extract patient safety events and social factors from EMR, with accuracy > 0.9 and positive predictive value from 0.95-0.97, respectively.^55,56 Yoshida et al. improved the accuracy of automated gout flare ascertainment using least absolute shrinkage and selection operator methods in addition to a Medicare claims-only model.⁵⁷

In recent years, neural networks have been increasingly used for AI CDI. Moen et al. used a neural network to organize free-text sentences from nursing notes into paragraphs and assign subject headings with 69 percent coherence.⁵⁸ Deep learning and generative models have been applied to extract social determinants of health,⁵⁹ classify acute renal failure with AUC = 0.84 (AUC = 1.0 indicates perfect classification),⁶⁰ extract headache frequency,⁶¹ and identify autism spectrum disorders.⁶² Hua et al. identified psychosis episodes in psychiatric admission notes, showing that decision-tree and deep-learning methods outperformed rule-based classification.⁶³

Algorithms were also developed to structure data for inter-department communication and even communication between institutions. For example, Visser et al.’s random forest model detected actionable findings and nonroutine communication in radiology reports with an AUC of 0.876.⁶⁴ Kiser et al. developed models to group EHR transfers to improve transfer between institutions, yielding AUC difference-in-difference ranging from 0.005 to 0.248.⁶⁵ Other studies have structured a wide variety of data, often with high accuracy; in total, 88 studies in the domain of structuring free text data were identified (Table 1). While promising, the above methods were not compared against the accuracy and efficiency of unassisted physicians, limiting external applicability.

Increasing Patient Understanding

As patient access to clinical documentation is increasingly mediated through online portals, medical terminology remains difficult for patients to understand.^66,67 Chen et al. developed a system with rule-based and hybrid methods to link medical terms in clinical notes to lay definitions, which improved note comprehension among laypeople.⁶⁸^,²¹ Also towards this goal, Moramarco et al. used an ontology-based algorithm to convert sentences in medical language to simplified sentences in lay terms.⁶⁹ In the future, these two methods of increasing patient understanding could increase patient adherence to treatment and decrease costs associated with nonadherence (Figure 1).^21,68,70

Speech Recognition and Error Detection

AI-based speech recognition (AI-SR) programs promise to create the possibility of a “digital scribe” to decrease documentation burden. Such programs evaluated in the peer-reviewed literature are limited by an increased error rate while newer commercial programs have not been well-studied.⁷¹ Five studies reported a 19.0 to 92.0 percent decrease in mean documentation time with AI-SR, four studies reported increases of 13.4 to 50.0 percent, and three studies reported no significant difference.^72–83 The ability of NLP tools to identify features of grammatically correct versus incorrect language shows promise for improving SR error rates by detecting errors in SR-generated notes.^84–86 Lybarger et al. detected 67.0 percent of sentence-level edits and 45.0 percent of word-level edits in clinical notes generated with SR,⁸⁴ while Voll et al. detected errors with 83.0 percent recall and 26.0 percent precision.⁸⁶ Lee et al. developed a model which was able to detect missing vitrectomy codes with an AUC of 0.87 while detecting 66.5 percent of missed codes.⁸⁷ Since high error rate could be rectified in the future, clinicians already appear interested in adopting AI-SR: a survey of 1731 clinicians at two academic medical centers reported high levels of satisfaction with AI-SR and the belief that SR improves efficiency.⁸⁸ In total, four studies were identified which addressed the domain of speech recognition and error detection (Table 1).

Integrative Documentation Assistance

AI has also been proposed as a real-time assistant to improve documentation during patient encounters by recording patient encounter audio, supporting physician decisions, calculating risk scores, and suggesting clinical encounter codes.^8,89 Wang et al. developed an AI-enabled tablet computer-based program that transcribes conversations from patient encounter audio, generates text from handwritten notes, automatically suggests clinical phenotypes and diagnoses, and incorporates the desired images, photographs, and drawings into clinical notes with 53.0 percent word error rate for the SR component and 83.0 percent precision and 51.0 percent recall for the phenotype recognizer.⁹⁰ Mairittha et al. developed a dialogue-based system that increased average documentation speed by 15 percent with 96 percent accuracy.⁹¹ Kaufman et al. developed an AI tool to transcribe data from speech and convert the resulting text into a structured format, decreasing documentation time but causing a slight decrease in documentation quality.⁹² Xia et al. developed a speech-recognition based EHR with accuracy of 0.97 that reduced documentation time by 56 percent.⁹³ Owens et al. reported that use of an ambient voice documentation assistant was associated with significantly decreased documentation time and provider disengagement, but not total provider burnout.⁹⁴ Hartman et al. developed an integrated documentation assistant for automated generation of summary hospital course texts for neurology patients, 62 percent of which were judged to meet the standard of care by board-certified physicians.⁹⁵ The increased efficiency of these studies is promising, but clinical implementation may be precluded by time spent correcting errors resulting from decreased documentation quality (Table 1).

Assessing Clinical Note Quality

A component of CDI initiatives is often manual chart review to assess clinical notes for timeliness, completeness, precision, and clarity.^6,96 AI tools can assist in that end by recognizing the presence or absence of knowledge domains, social determinants of health, performance status, and topic discussion, prompting clinicians to make additional notes relating to a domain when needed (Table 1).^97–101 In addition to these domains, note unclarity and redundant information comprise major problems in clinical documentation.¹⁰² Deng et al. used an NLP system to evaluate the quality (classified as high, intermediate, or low) of contrast allergy records based on ambiguous or specific concepts, finding that 69.1 percent were of low quality.¹⁰³ Zhang et al. developed a method with conditional random fields and long-short term memory to classify information in clinical notes as new or redundant at the sentence level, achieving 83.0 percent recall and 74.0 percent precision.^33,104 Similarly, Gabriel et al. developed an algorithm to classify pairs of notes as similar, exact copies, or common output (automatically generated notes such as electrocardiogram, laboratory results, etc.).¹⁰⁵ Seinen et al. both detected and improved note quality, using semi-supervised models to refine unspecific condition codes in EHR notes.¹⁰⁶ Zuo et al. also improved quality by standardizing clinical note format using a transformer-based approach.¹⁰⁷ Regarding specific content standards, Barcelona et al. used natural language processing to identify stigmatizing language in labor and birth clinical notes.¹⁰⁸

The concepts of content domains, note clarity, and redundancy do not account for changes in these domains over time. This meta-data element can be harnessed to improve clinical documentation as demonstrated by Bozkurt et al., who used an NLP pipeline to evaluate documented digital rectal examinations (DRE) by insurance provider and classify them as current, historical, hypothetical, deferred, or refused.^15,109 Other studies identified time-sensitive documentation concerns including goals-of-care discussions at the end of life, patient priorities language, and adherence to care pathways in heart failure.^110–112 Another model developed by Marshall et al. detected diagnostic uncertainty from EHR notes using a rule-based NLP with a sensitivity of 0.894 and specificity of 0.967.¹¹³ In total, 20 studies were identified which addressed the domain of clinical note quality (Table 1). Such algorithms could be used to prompt clinicians if protocols and procedures are not correctly documented within a given time after disease diagnosis.

Identifying documentation trends

While recognizing and tracking meta-data trends in documentation may improve documentation, it also has a role in intelligent modification of EHR systems and documentation policies. Young-Wolff et al. used an iterative rules-based NLP algorithm to demonstrate that electronic nicotine delivery system (ENDS) documentation increased over nine years; the team recommend an ENDS structured field be added to the EHR.¹¹⁴ Since clinicians vary in their documentation styles, Gong et al. extracted a “gold standard” style by evaluating note-level and user-level production patterns from clinical note metadata with unsupervised learning. Their results implied uninterrupted, morning charting could improve efficiency.¹¹⁵

Besides individual styles, documentation is complicated by health system factors such as heterogeneous medical forms and many compartmentalized specialties.^20,116 AI may collimate these trends leading to intelligently standardized forms and a more efficient system. Dugas et al. automatically compared and visualized trends in medical form similarity using semantic enrichment, rule-based comparison, grid images, and dendrograms.¹¹⁶ Modre-Osprian et al. analyzed topic trends in notes from a collaborative health information network, yielding insights about wireless device usage that improved network functioning.¹¹⁷ Further studies used AI to find trends in EHR audit logs and utilization patterns of notes, allowing efficiency trends to be identified.^118,119 In total, six studies were identified which addressed the domain of identifying documentation trends (Table 1). By overviewing meta-data trends in both individual clinical documentation patterns and rapidly changing health systems, AI tools could aid system optimization as medical infrastructure changes and care is delivered in new, increasingly specialized ways.

Discussion

As reviewed above, current AI tools improved clinical documentation by structuring data, annotating notes, and providing real-time assistance. Other features of AI CDI tools include assessing clinical note quality based on concept domains and providing insight into hospital systems and provider practices. A truly practical comprehensive clinical AI assistant has not yet been reported in the peer-reviewed literature to our knowledge, but current AI tools could confer specific improvements to documentation workflows.

To overcome limitations in generalizability, future work should involve larger datasets and broader training data availability. AI processing of large amounts of data would require large computational processing power, a requirement which may become feasible as computational power continues to increase in the future.^9,120 This necessitates carefully regulated and secure computing systems which must also account for documentation variations between geographic regions, institutions, and EHRs.¹²¹ AI-based systems that promote documentation inter-operability could play a role in overcoming these challenges by creating larger unified training datasets.^65,107 While a widely generalizable AI system could possibly be trained, such data is often proprietary and not readily shared. Transfer learning techniques, which apply previously learned information to a new situation or context, may bridge this gap, enabled by collaboration and data sharing between health systems.¹²² Lexical variations can be overcome either by semantic similarity in rule-based NLP or implementing machine learning techniques.¹²¹

Legal and ethical concerns relating to encounter recording and AI processing must also be addressed simultaneously with these changes for these systems to be successful long-term.¹²³ Patients may have privacy concerns regarding the automatic collection, storage, and processing of encounter data, and the liability implications of AI-assisted clinical documentation, such as where blame falls when a documentation error occurs, are currently unclear. Ethical concerns raised in the literature include the nature of informed consent, algorithmic fairness and biases, data privacy, safety, and transparency.¹²⁴ Legal challenges include safety, effectiveness, liability, data protection, privacy, cybersecurity, and intellectual property.¹²⁴

For AI CDI systems to implemented clinically, they must increase efficiency without sacrificing accuracy.⁷¹ In some cases, time spent fixing errors produced by AI outweighs time saved using the AI tool.⁷¹ The accuracy of AI-assisted versus clinician-generated notes has not been widely compared,⁸⁴ and there is also a lack of studies investigating clinical outcomes and patient care which must be assessed before widespread AI CDI implementation.^90–95

While further studies of AI CDI tools are needed, this systematic review is the first to our knowledge to highlight a decrease in peer-reviewed AI CDI studies published following the release of ChatGPT on November 30, 2022.¹²⁵ Reasons for this trend are not entirely clear, but may be due to researchers publishing on preprint servers amidst rapidly advancing techniques or developing proprietary models without publishing. The advent of large transformer language models shows promise, but rigorous peer-reviewed evaluation of proprietary models for improving clinical documentation is lacking.^126,127

Limitations and Future Studies

Strengths of this narrative systematic review include that it presents AI tools for clinical documentation improvement in the context of medical practice and health systems, and that it is the first study to do so comprehensively. This study is subject to several limitations: the relevance of studies was determined by the authors, the efficacy of all tools was not objectively compared, commercial programs not studied in the peer-reviewed literature could not be evaluated, and studies may exist outside of the queried databases. Concerns of cost, physician and hospital system acceptance, and potential job loss regarding AI CDI tools are not negligible. However, this is beyond the scope of this review which set out to report solely on methods and formats of improving documentation.

Conclusion

While current AI tools offer targeted improvements to clinical documentation processes, moderately high error rates preclude the broad use of a comprehensive AI documentation assistant. While large language models have the potential to greatly reduce error rates, many of these models are proprietary and not well-studied in the peer-reviewed literature. In the future, this hurdle may be overcome with further rigorous tool evaluation and development in direct consultation with physicians, as well as robust discussion of the legal and ethical ramifications of AI CDI tools.

References

1. Arndt BG, Beasley JW, Watkinson MD, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Ann Fam Med. 2017;15(5):419-426. doi:10.1370/afm.2121

2. Blanes-Selva V, Tortajada S, Vilar R, Valdivieso B, Garcia-Gomez J. Machine Learning-Based Identification of Obesity from Positive and Unlabelled Electronic Health Records. In: Vol 270.; 2020:864-868. doi:10.3233/SHTI200284

3. Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Annals of Internal Medicine. Published online September 6, 2016. Accessed January 1, 2022. https://www.acpjournals.org/doi/abs/10.7326/M16-0961

4. Tai-Seale M, Olson CW, Li J, et al. Electronic Health Record Logs Indicate That Physicians Split Time Evenly Between Seeing Patients And Desktop Medicine. Health Aff (Millwood). 2017;36(4):655-662. doi:10.1377/hlthaff.2016.0811

5. Shanafelt TD, Dyrbye LN, Sinsky C, et al. Relationship Between Clerical Burden and Characteristics of the Electronic Environment With Physician Burnout and Professional Satisfaction. Mayo Clinic Proceedings. 2016;91(7):836-848. doi:10.1016/j.mayocp.2016.05.007

6. Towers AL. Clinical Documentation Improvement—A Physician Perspective: Insider Tips for getting Physician Participation in CDI Programs. Journal of AHIMA. 2013;84(7):34-41.

7. Dehghan M, Dehghan D, Sheikhrabori A, Sadeghi M, Jalalian M. Quality improvement in clinical documentation: does clinical governance work? J Multidiscip Healthc. 2013;6:441-450. doi:10.2147/JMDH.S53252

8. Lin SY, Shanafelt TD, Asch SM. Reimagining Clinical Documentation With Artificial Intelligence. Mayo Clinic Proceedings. 2018;93(5):563-565. doi:10.1016/j.mayocp.2018.02.016

9. Luh JY, Thompson RF, Lin S. Clinical Documentation and Patient Care Using Artificial Intelligence in Radiation Oncology. Journal of the American College of Radiology. 2019;16(9):1343-1346. doi:10.1016/j.jacr.2019.05.044

10. Campbell S, Giadresco K. Computer-assisted clinical coding: A narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals. HIM J. 2020;49(1):5-18. doi:10.1177/1833358319851305

11. Agaronnik ND, Lindvall C, El-Jawahri A, He W, Iezzoni LI. Challenges of Developing a Natural Language Processing Method With Electronic Health Records to Identify Persons With Chronic Mobility Disability. Archives of Physical Medicine and Rehabilitation. 2020;101(10):1739-1746. doi:10.1016/j.apmr.2020.04.024

12. Barrett N, Weber-Jahnke JH. Applying Natural Language Processing Toolkits to Electronic Health Records – An Experience Report. In: Advances in Information Technology and Communication in Health. Vol 143. Studies in Health Technology and Informatics. IOS Press; 2009:441-446.

13. Blackley SV, Schubert VD, Goss FR, Al Assad W, Garabedian PM, Zhou L. Physician use of speech recognition versus typing in clinical documentation: A controlled observational study. International Journal of Medical Informatics. 2020;141:104178. doi:10.1016/j.ijmedinf.2020.104178

14. Blanco A, Perez A, Casillas A. Exploiting ICD Hierarchy for Classification of EHRs in Spanish Through Multi-Task Transformers. IEEE Journal of Biomedical and Health Informatics. 2022;26(3):1374-1383. doi:10.1109/JBHI.2021.3112130

15. Bozkurt S, Kan KM, Ferrari MK, et al. Is it possible to automatically assess pretreatment digital rectal examination documentation using natural language processing? A single-centre retrospective study. BMJ Open. 2019;9(7):e027182. doi:10.1136/bmjopen-2018-027182

16. Friedman C. Discovering Novel Adverse Drug Events Using Natural Language Processing and Mining of the Electronic Health Record. In: Combi C, Shahar Y, Abu-Hanna A, eds. Artificial Intelligence in Medicine. Vol 5651. Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2009:1-5. doi:10.1007/978-3-642-02976-9_1

17. Guo Y, Al-Garadi MA, Book WM, et al. Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes. J Am Heart Assoc. 2023;12(13):e030046. doi:10.1161/JAHA.123.030046

18. He J, Mark L, Hilton C, et al. A Comparison Of Structured Data Query Methods Versus Natural Language Processing To Identify Metastatic Melanoma Cases From Electronic Health Records. Int J Computational Medicine and Healthcare. 2019;1(1):101-111.

19.&nbs

KEYWORDS

Volume 21 Issue 2 artificial intelligence documentation automation clinical guidelines electronic health records informatics clinical documentation