Abstract
Healthcare fraud is an expensive, white-collar crime in the United States, and it is not a victimless crime. Costs associated with fraud are passed on to the population in the form of increased premiums or serious harm to beneficiaries. There is an intense need for digital healthcare fraud detection systems to evolve in combating this societal threat. Due to the complex, heterogenic data systems and varied health models across the US, implementing digital advancements in healthcare is difficult. The end goal of healthcare fraud detection is to provide leads to the investigators that can then be inspected more closely with the possibility of recoupments, recoveries, or referrals to the appropriate authorities or agencies. In this article, healthcare fraud detection systems and methods found in the literature are described and summarized. A tabulated list of peer-reviewed articles in this research domain listing the main objectives, conclusions, and data characteristics is provided. The potential gaps identified in the implementation of such systems to real-world healthcare data will be discussed. The authors propose several research topics to fill these gaps for future researchers in this domain.
Keywords: Medicaid, fraud detection, class imbalance, machine learning, health insurance claims
Healthcare Fraud Introduction
Background and Significance
Caring for health has become more expensive, making both private and public administrators more cost conscious in recent years. Therefore, health decision-makers are actively looking for ways to reduce costs. One such avenue of saving potentially billions of dollars is to avoid and detect healthcare fraud. The National Health Care Anti-Fraud Association1 conservatively estimates that about 3 percent of our healthcare spending is lost to fraud ($300 billion approximately) yearly. Fraud is a complex and difficult problem. It is important to acknowledge that fraud schemes constantly evolve, and fraudsters adapt their methods accordingly. The earliest account2 of “fraud” in the healthcare literature is from the 1860s when railway collisions were a frequent occurrence, leading to a controversial condition called “railway spine,” which later became a leading cause of personal injury compensation in rail accidents. These accidental events were made profitable by means of insurance settlements in-court or out-of-court by opportunistic claimants, and these events laid the groundwork for fraud definitions and fraud management in the insurance industry.
Healthcare fraud has evolved in the 21st century and has a varied set of profiles ranging from simple fraud schemes to complex networks. The twin objectives of fraud management have always been fraud prevention and fraud detection3 (see the definitions section below). The consequence of submitting a fraudulent claim remains the same: the fraudster is prosecuted by means of sanctions and prosecutions in a court of law. However, the methods used in both prevention and detection have evolved since the 1800s, and so have the methods of detecting fraudulent claimants. With the advances in computing, and the more rapid availability of aggregated datasets in the healthcare domain, there are several opportunities for potential advancements in healthcare fraud management. Despite these advancements, it is very difficult to quantify the number of undetected fraudulent cases that do not get prosecuted. The identified limitations4 in achieving these advancements are manifold, including using legacy systems in claims processing; processing systems that are siloed due to involvement of multiple entities (e.g., enrollment, approvals, authorizations, claims adjudications); having sensitivity related to healthcare data privacy (e.g., sensitive healthcare domains such as family planning and mental health); and difficulty in proving intent of fraud in litigation settings.
The objectives of this review article are to summarize the methods and approaches used in healthcare fraud detection and to discuss the implementation gaps between the academic literature and real-world use by industry settings. Fraud detection in the literature encompasses data mining (rule-based to advanced statistical methods), over-sampling, and extrapolation techniques. The literature concerning overpayment and sampling estimation are important steps in fraud detection’s business workflow and are addressed by Ekin et al. (2018).5
Definitions
There are many definitions in the literature and social media regarding what constitutes a healthcare fraud incident. Healthcare fraud is defined as an individual, a group of people, or a company knowingly misrepresenting or misstating something about the type, scope, or nature of the medical service provided, which, in turn, results in unauthorized6,7 payments.
There is a vast amount of literature8,9 available on fraud management techniques and models in different industries, such as healthcare, telecommunications, credit card services, insurance, and finance. Fraud management,10 in theory, is divided into two goals: fraud prevention and fraud detection. Fraud prevention in healthcare can be defined as any action or policy that is in place to prevent any system abuse. For example, there is a Medicaid policy in the state of Texas11 for outpatient mental health services where certain types of providers, such as psychologists and licensed professional counselors, are limited to billing a combined maximum of 12 hours per day, regardless of the number of patients seen. This policy requirement is in effect to prevent fraud (by means of overbilling in this case) before it occurs. Fraud detection, on other hand, is defined as identifying fraud as quickly as possible once a fraudulent scheme has already been perpetrated.
Fraud Actors, Types, and Facts
Healthcare fraud takes many forms. Some of the more prevalent forms12,13 are traditional fraud schemes implemented by shell vendors, ghost employees who obtained access to bill payers, and employees who continue billing with expired licensures. Some of the main actors committing or involved in fraud are providers (those who are authorized to provide services to beneficiaries), beneficiaries (those who receive medical or associated services), medical equipment manufacturers, drug manufacturers, and agencies authorized to provide special services, such as home healthcare.
Some of the healthcare fraud schemes commonly discussed in literature and used often to develop fraud detection algorithms or analytics within regulatory entities such as the Office of Inspector General (OIG), the Department of Justice (DOJ), and the Centers for Medicare and Medicaid Services (CMS) are as follows:
- Diagnosis Related Groups (DRG) creep – when actors manipulate diagnostic and procedural codes to increase reimbursement amounts in an institutional setting
- Unbundling and fragmentation of procedures – billing individual service codes versus group service codes
- Up-coding of services – billing for a higher level of service than provided
- Phantom billing – billing for services not rendered to clients
- Excess number of services – billing unnecessary services that could lead to client harm
- Kickback schemes – actors might improperly pay for or waive the client’s out-of-pocket expense to make up for that cost in additional business
- Billing for mutually exclusive procedures
- Duplicate claims
- Billing errors
Figure 1 illustrates the percentages of improper payments in the United States Health & Human Services (HHS) government programs from 2012 to 2019. Such improper payments include any kind of underpayment, overpayment, fraud, and any unknown payments. The government healthcare programs that were included from the original data source14 are the following HHS agency programs: Children’s Health Insurance Program (CHIP), Medicaid, Medicare Fee-For-Service (FFS), Medicare Part C, and Medicare Part D. As seen in Figure 1, the Medicaid and CHIP programs have generally shown a steady increase in the percentage of improper payments.
Figure 2 reports the recoveries from the False Claims Act15 in years 1985 to 2020. In 2020 alone, $2.2 billion was recovered by the government, out of which $1.8 billion was from the healthcare industry. The recoveries are estimated to be significantly higher for 2021-2022 considering the ongoing difficulties in litigations in closed-court settings due to COVID-19.
Scope and Objectives
The scope of this article is twofold: to provide a comprehensive review of current healthcare-related fraud detection methods and to provide a discussion on implementation gaps in the application of such methods to real-world settings in the US. Related work section entails a comparative evaluation of review studies in literature. This is followed by a review of study methods section, which details selected fraud detection methods with discussions around gaps in applying these methods to real-world data. The next section focuses on implementation gaps, followed by conclusions and future research section, which summarizes the main points and future research directions for healthcare fraud detection. Table 1 includes an extensive (not exhaustive) tabulated summary of healthcare fraud literature for prospective researchers in this area.
The literature reviewed here does not incorporate articles that included holistic healthcare as an objective, such as those of disease prediction, readmission, or length of stay, in which fraud identification is not necessarily the primary objective. In addition, only articles pertaining to healthcare fraud in the US were considered. In contrast to prior review articles,16-19 this article discusses the literature from a business workflow perspective starting from a data-driven lead to the end point of litigation/recoupment, and provides recommendations to address the research gaps in existent methods.
Related Work
The value of this review is not only for comparative purposes on the methods employed in the literature but, more importantly, to start a discussion of how relevant current academic healthcare fraud detection methods are to the downstream process of proving intent of fraud by investigators in an industry setting. An understanding of the implementation gaps and overall fraud detection process (i.e., starting from data leads provided by a model to a conviction phase in a legal setting) will help leverage the already available collective knowledge to help improve practical fraud detection methods.
Several articles discussed healthcare fraud data-mining methods in the literature with similar goals but from different perspectives. Li et al. (2008)20 categorized the three different actors in healthcare fraud—namely, providers, patients, and the payers—and focused on the provider fraud literature. They further highlighted the scarcity in the data pre-processing methods (from raw claims datasets to flattened datasets) and commented on the importance of this step in identifying healthcare fraud using supervised and unsupervised methods. They also highlighted the two main types of classifier performance metric categories; 1) the error-based methods and 2) the cost-based methods, with error-based classifiers being more common in healthcare fraud literature. An article by Bauder et al. (2017)21 focused specifically on up-coding fraud in several healthcare domains using medical claims data. They highlighted the lack of literature pertaining specifically to using supervised techniques in up-coding fraud detection.
Ekin et al. (2018)22 provided a comprehensive discussion of statistical methods in healthcare fraud, including sampling, over-payment estimation methods, and data-mining methods such as supervised, unsupervised, and outlier detection methods from the literature. The authors focused on describing unsupervised methods in more detail, such as using concentration functions and Bayesian co-clustering. Both Ekin et al. (2018)23 and Li et al. (2008)24 highlighted the lack of literature in identifying the potential drivers of fraud.
The most recent review by Ai et al. (2021)25 discussed medical fraud detection methods in the literature using qualitative methods. They provided a methodological literature search using Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines on the methods, number of peer-reviewed articles and a qualitative analysis of statistical methods, model performance, using evaluation metrics (when available) for health care domain. Their research is quite comprehensive, with a focus on being able to assess the strength of model performance and accuracy of existent fraud detection methods in the literature. They concluded that the evidence to provide a consolidated best method to identify healthcare fraud was inadequate considering the literature models were applicable to different domains within healthcare and therefore not directly comparable. They also highlighted that there was no literature available to estimate the cost of investigations in order to estimate potential cost savings using a fraud detection model.
Healthcare administration and payments have changed in the last two decades, especially from a data quality and data integration perspectives. Although the standard forms, such as the CMS-1500 or the UB-04 used for data collection (for payment processing), have not changed significantly over time—except for the volume increase in electronic submissions in the past two decades—there is still a significant gap in the application of literature methods to real-world settings. Other published review articles in this domain focused on the overall state of healthcare fraud literature and methods. This review extends the available literature by focusing on the applicability of these methods to real-world claims data and highlights the research gaps in the practical implementation of these methods.
Policy Statutes Overview
A range of civil, criminal penalties and laws exist within healthcare fraud.26,27 Government agencies such as the US Department of Justice (DOJ) and the HHS Office of Inspector General (OIG) are the enforcers of such laws and penalties. A quick overview of these laws would aid the understanding of the end goals of the fraud detection business workflow in real-world cases.
The business workflow starts from converting data-based fraud leads to a civil or criminal case indictment, depending on the path an investigation takes, followed by legal proceedings on a case-by-case basis. Data-driven fraud detection tools are only a piece of the complete fraud puzzle; nevertheless, it is an important part considering this is a targeted methodological means to find fraud leads. A simplistic business workflow of how a fraudulent case progresses through a normal course of an investigation/audit is shown in Figure 3. The pictograph identifies the most relevant and helpful analytical methods used to identify fraud, waste, or abuse among provider or client or payer.
The common statutes under which fraudulent cases are prosecuted include both civil statutes (the False Claims Act and the Physician Self-Referral Law) and criminal statutes (Anti-Kickback Statute and Criminal Healthcare Fraud Statute).
False Claims Act28,29 – Many of the fraud cases are lawsuits filed under the False Claims Act (FCA). This is a federal statute originally enacted in the 1800s, and penalties could include recovery of up to three times the damages sustained by the government, in addition to financial penalties for each falsely submitted claim. Most fraudulent recoupments reported by DOJ are claimed under this act.30
Physician Self-Referral Law or Stark Law31 – Under this law, a physician is prohibited from referring patients to receive “designated health services” to an entity in which the physician or immediate family member of the physician has an investment.
Anti-Kickback Statute32 – Under this law, a medical provider is prohibited from soliciting or receiving any remuneration or rewards directly or indirectly for patient referrals or business generation from anyone.
Criminal Healthcare Fraud Statute33 – Under this law, any service provider is prohibited from executing a scheme in connection with delivery of health care benefits or services to defraud a health care program.
Data Sources
Healthcare data, in general, are broadly categorized as practitioners’ data, administrative claims data, and clinical data.34 The three sources of data together form a near-complete picture of the fraud data puzzle. However, it is extremely difficult to be in possession of all three data sources under one entity. Second, even if data are available from all three sources, integration of these sources of data can be extremely challenging in real-world practice due to the varied systems and identifiers involved in the data collection and ETL (extract, transform, and load) process. For purposes of fraud detection, the most commonly used data source in the literature is administrative claims.
The collected administrative claims data among insurers do not differ much in their basic structure because of the standard template used in the electronic claims processing. For example, the CMS 1500 form is used in the adjudication process of all professional claims. However, not all collected data are utilized for purposes of adjudication; hence, some data/field values can be considered informational. The data collection and utilization of such informational column values are also dependent on the payer (e.g., fee-for-service versus managed care organization in different state and federal programs). In the next section, the current state-of-the-art fraud detection and prevention methods is briefly described.
Most fraud detection/prevention models discussed in literature are based on either synthetic data or data collected in a de-identified manner and made available as open-source or agency-specific data, such as Veterans Affairs TRICARE, Health and Human Services, or Texas Department of State Health Services. For example, aggregated Medicare/Medicaid data are now made available through the CMS.gov35 website. The Medicaid Analytic eXtract contains data collected by CMS from all states on a quarterly basis. Such data are available for researchers to study utilization patterns such as healthcare resource utilization or disease-based utilization. The fraud detection models developed using such aggregated data extracts are difficult for relevant parties to adopt due to the many logistical issues involved, such as the difficulty in linking results tied to the identified provider back to specific claim-line level data.
Rule-Based Fraud Detection
One of the most common approaches to identify fraud is to use domain or expert knowledge to identify anomalies in billing practices. Expert knowledge is often used and is very effective in keeping common fraud schemes in check.
Some common healthcare fraudulent claims as seen in literature fall into the categories mentioned earlier. Simple to medium-complex rules are developed to identify billing errors or duplicate claims to identify fraud categories such DRG creep or up-coding.36 These are not to be confused with edits and audits in a claims processing system, as these rules are developed based on schemes rather than policy. These rules can be developed at a transaction level or actor level. This is a straightforward and effective approach even though static in nature.
The inherent limitation with such rule-based detection is that once the fraudster becomes aware of the rules—either due to unpaid/rejected/held out claims, or due to a retrospective inspection or audit of adjudicated claims—their fraudulent patterns could change, and these rule-based detection programs cannot quickly adapt to the fraud pattern modifications. Other limitations to having a rule-based detection system are that these engines are very expensive to build, as they require constant inputs from fraud experts and are quite difficult to maintain and manage in the fast-changing healthcare landscape. It is thus very difficult to keep a rule-based system lean and up to date.
Data-Driven Fraud Detection
Data-driven fraud detection is becoming commonly popular in all domains, and the healthcare domain is no exception. Implementing data-driven fraud detection methods offers a higher fraud detection power along with operational and cost efficiencies. The fraud literature regarding the applications of advanced statistical techniques in various healthcare domains (medical, dental etc.) covers three main aspects of the business process: fraud detection, statistical sampling, and oversampling estimation methods. Fraud detection methods37-42 all have one common motivation, which is to mine data to assess patterns.
Data-driven methods can be categorized broadly as supervised, unsupervised, and hybrid learning methods. These techniques can be summarized from a fraud perspective as below:
- Supervised learning methods employ samples of previously known fraudulent and legitimate transactions or providers.
- Unsupervised learning methods do not require a prior knowledge of fraudulent transactions or providers. They focus more on anomalies based on distributions of a provider’s billing behavior. They also use descriptive statistics to help learn such patterns in some cases.
- Hybrid learning is where a mix of both supervised and unsupervised techniques are used.
It is also worth mentioning that these data-mining methods are dependent on a well-defined problem statement and the acquisition of relevant, adequate, and clean data. The process flow of modeling (irrespective of the learning methods used) involves a sequence of steps as it relates to fraud and is described in Figure 4. The different level of complexities involved in a data-driven fraud models from literature are discussed in the next section.
Review of Study Methods in Healthcare Fraud
This section presents selected study methods and discusses practical implementation gaps of these methods. The studies were screened from a structured database search using search terms such as “fraud,” “healthcare,” “secondary data,” “prescriptions,” “Medicaid management information system,” “Medicaid,” “Medicare,” and any possible combinations of these search terms. From this, the studies were further narrowed down focusing on the data, methods, and implementation of fraud algorithms. A subset of such studies are discussed in this section, as they attempt to address some implementation gaps such as class imbalance in real-world data, missing fraud labels, and data pre-processing techniques before applying algorithmic models to data.
Supervised Learning
A supervised learning task is to learn a function that maps response variables to the inputs based on the available labeled response data. Researchers using supervised learning methods in fraud detection have the following in common: a labeled dataset (i.e., fraudulent: yes or no), a domain-specific justification to choose one algorithm versus another, and a performance metric of choice to determine the best algorithm. The general concept that stands out in the development of such supervised models is the identification of features that can discriminate a fraudulent provider from legitimate providers. The methods of identifying such features vary between researchers and are mostly focused from a provider-level rather than a transaction-level.
Considerations in Defining Ground Truth
It is important to acknowledge that any supervised technique application is inherently dependent on the validity of the labeled dataset used to categorize the data to their corresponding classes. Supervised learning algorithms thus require confidence in the correct classification/labeling of the providers. The fraud labels for the reviewed providers are classified to one of two categories: fraudulent or not fraudulent (legitimate). But it is not known if providers who were never reviewed did or did not commit fraud. Some published studies43-45 address this uncertainty partially by having a varied range as an estimate for class distribution of the “never reviewed” providers. Thus, there will always be cases where fraud is mislabeled as non-fraud. Binary classification of providers as fraudulent or legitimate does not allow for uncertainty to remain after providers are investigated. In contrast, the confidence that a provider committed fraud (“fraud” confidence) could be used for supervised learning in lieu of a binary ground truth.
The labeled fraud dataset is skewed in nature, irrespective of methods used for label associations in a dataset. The skewness arises from the practical fact that only a small number of the reviewed providers are categorized as fraudulent while the majority of the reviewed providers are legitimate. This nature of skewness in a categorical label assignment is called “class imbalance” and has its own literature46 stemming from computer science and its applications to real-world problems.
Review of Supervised Learning in Healthcare Fraud Detection
Bauder et al. (2018 and 2018, May)47,48 categorized different supervised learning techniques (Random Forest, C4.5 decision tree, support vector machine, and logistic regression) to find the effect of class imbalance in fraud detection. The authors used publicly available claims data (Medicare Provider Utilization and Payment Data: Physician and Other Supplier) from CMS. The labels for known fraudulent medical providers across all specialties, and provider types were obtained from the OIG’s publicly available database of List of Excluded Individuals/Entities (LEIE) in 2017. The final merged Medicare dataset (claims and labeled fraud data) was highly imbalanced (about nine out of every 100,000 providers were marked fraudulent). The performance metrics used were area under curve (AUC); false positive rate (FPR is the ratio of non-fraud cases incorrectly categorized as fraudulent cases to the total number of non-fraudulent cases); and false negative rate (FNR is the ratio of fraud cases incorrectly categorized as non-fraud cases to the total number of fraudulent cases). Two main conclusions were:
- The C4.5 (decision tree) algorithm had the best performance on the AUC metric (0.883).
- As the minority class distribution was varied from 20 percent to 50 percent, the learners became worse on their performance metrics.
Herland et al. (2018 and 2019)49,50 also investigated the effects of class imbalance on supervised learning for fraud detection using the same publicly available datasets (claims and fraud labels) as Bauder et al. (2018, May). The authors concluded that a logistic regression model followed by gradient tree boosting performed well based on the AUC metric (0.828) evaluation.
Fan et al. (2019)51 focused on physician fraud detection combining the different open datasets on claims (CMS data), social media ratings on physicians (Healthgrades.com), and ground truth fraud datasets such as LEIE and Board Actions. The different classifiers that were trained included logistic regression, naïve Bayes, and a decision tree classifier. The board action dataset features did not prove to be beneficial to their classification model, although it is not clear which features from the dataset were included in the modeling process. Some feature engineering was performed to determine the final set of features resulting in a best classifier. The authors concluded that their classification performance was highest using a decision tree with features (based on rating) from social media, open payment, and prescriber (CMS) datasets.
Ekin et al. (2021)52 provided an overview of pros and cons in addressing three steps of the statistical fraud detection modeling process. In their experimental design, they manipulated the claims data to address the variance in the model performance from:
- Correlated features – e.g., principal component analysis (PCA) on the features to address multicollinearity
- Classifier type – nine supervised classification algorithms such as random forest, naïve Bayes, and neural networks.
- Class imbalance – this effect was addressed by using four sampling techniques (e.g., random walk oversampling (RWO))
They utilized a wide range of evaluation metrics to assess the different model’s performance with the aggregated public (CMS’s Part B, CMS’s zipcode to carrier locality file, and CMS’s Geographic Variation Public Use File) datasets. To simulate an adjustment to the well-known method of considering LEIE data as the only source of ground truth for fraud labels, they performed an experiment with a range of possible fraud proportions (0.06 percent to 45.76 percent). The combination of these data manipulations led to a total of 40