Healthcare Fraud Detection: An open data case study


I recently attended a data science event in Boston hosted by QuantUniversity.  The topic was anomaly detection using machine learning.   I was hoping to take away a few pearls of wisdom to test drive on our data.   I have read about anomaly detection for claims data, which struck me as a fascinating challenge.  HealthyHive has amassed a sea of healthcare claims data over the years so the topic should be an interesting content education piece for those data and transparency aficionados.

My interest was piqued when the professor noted he starts most anomaly studies with data visualization.  According to the professor, anomaly detection models employing machine learning are tricky to train.  Some initial data visualization work can assist in determining how one decides to define observations as anomalous.    Furthermore, the topic was more approachable given my limited ggplot2 (R package for data visualization) skills.  At it turns out, a simple box plot visual is one way to at least start a fraud detection inquiry.

The Biggest Fraud You’ve Never Read About

Healthcare fraud a multi-billion dollar problem.  There are many different flavors of fraud in healthcare.  This article is a good reference.

I’ve recently brainstormed a methodology to identify potential ‘upcoding’ in our data.  As the above article notes, ‘upcoding’ takes place when a provider over-bills for his or her services.


First, background on the jargon.

  • The two healthcare interactions I am covering are a 15-minute office visit, CPT code 99213:  ‘Established patient office or other outpatient visit, typically 15 minutes‘.
  • The second code is 99214:  ‘Established patient office or other outpatient, visit typically 25 minutes‘. These two medical codes are among the two most common in all of healthcare.  The majority of office visits are coded with the 99213 CPT code.
  • The 99214 code, as the description notes, is 25 minutes, and is intended for a more thorough examination.  As we will discuss, that extra ten minutes can lead to a huge percentage change in the reimbursement rate.
  • ‘CPT’ stands for Common Procedural Terminology.  The intellectual property rights to the codes are actually owned by the American Medical Association.  That’s a story for another post.


I started with 0ver 400,000 office visit claim records (‘99214’ and ‘99213’ combined) from 2015.  I ended up using 75% of the initial 400,000 records after some data cleansing.

As noted, the vast majority of the visits were 15-minutes in duration:

While the volume of visits clearly favors the 15-minute code, it’s almost a push when we examine the cost breakdown between the two codes:

Incentive to Upcode

The average cost for the shorter (15-min visit) was  $117 while the 25-min visit averaged $174.  Adding an extra 10 minutes to a patient’s office visit equates to a 49% boost in revenue for the doctor.

Let’s see what the average cost range looks like across all providers for the two codes:

The upper range of the 25-minute visits were over $200 after trimming the top and bottom 5% of observations.  Notice too that the inter-quartile range (IQR, or observations between the 25th & 75th percentile values) for the 15-min visit is a lot tighter than the 25-min visits.

Provider Coding Distribution

Next, we wondered ‘Is the split between 15 and 25-minute visits consistent across providers?’

To start, we  transformed the data to create a new variable:  Share of visits that are 25-min in duration.  (Note, for this we filtered  out providers with fewer than 10 visits in total and less than five 15 and 25-min visits each.  We ended up with 265,000 observations where the average percentage of 25-min visits is 43.5% and one standard deviation of 21.5%.)

For the sake of this post, let’s isolate those providers where longer visits as a percentage of total visits is at least 2-standard deviations above average.  By doing so, we are isolate 2.5% of providers with the highest percentage of 25-min claims.

Right away the violin plot is curious.  The top tip is suggestive as it is not symmetrical with the bottom. Are these the bad apples who tend to upcode way more than average because it’s easy?  Or does something else explain these 2.5% of providers?  Does gender have a role?  Do men have a higher percentage of 25-min visits or vice versa?  It does not appear so:

Critical Analysis

How about looking at the various diagnoses?  Perhaps some diagnoses command a longer office visit given the relative complexity of care:

Scoping the Prey

There appears to be a good amount of uniformity across diagnoses, suggesting that an office visit duration isn’t based on diagnosis.

So what does the upcoding profile look like for the top 5 providers?  How many 25-min visits versus 15-min visits?  First, a reminder of the total sampling distribution:

Keep in mind that this is only looking at the 2.5% of the extreme.  The level of fraud is likely much higher that 2.5% of providers. We conservatively estimate that this one quick survey of potential provider upcoding cost consumers and employers at least $40,000.  One small sample in one county in one state.

From here we can transform the data to new variables and also standardize certain features of the dataset before feeding random forest decision tree models for our machine learning applications.

We hope you found this informative.  Constructive feedback is always welcomed!


Please contact me (Carl Hall) at with inquiries as to how we can help your company build a data-driven financial wellness framework.

Data source:  New Hampshire Comprehensive Health Information System (CHIS)

Comments are closed.