TLDR: This is a dataset of 100 million prices that Blue Cross of Washington (Premera Blue Cross) has negotiated with every in-network provider for healthcare procedures.
On July 1st of this year, American health insurers published their negotiated (previously non-public) prices for every covered medical procedure as required by the Transparency in Coverage Final Rule. The datasets published by each insurer are massive: they run into the tens or hundreds of terabytes. Thankfully most insurers have adhered to a schema dictated by the Centers for Medicare & Medicaid Services (CMS), which is documented quite well on their Github.
As part of their compliance, Premera Blue published their data, which you can access for yourself, in its raw form, starting on this page.
Working with the raw data is inconvenient, to put it mildly. Abiding by the CMS schema (as well-intentioned as this was) meant that the files published by Premera Blue ranged from 100 bytes to 100s of gigabytes. The resulting raw data ran into the terabytes, spread over nearly half a million files that each need to be downloaded individually.
To make the data easier to work with, I downloaded and parsed all the data, flatting the JSON files into CSVs, which at times required 1000 machines running on Google Cloud and a Spark cluster with a terabyte of RAM.
I have tried to be faithful to the underlying data: save for any unintentional parsing errors, the only “loss” of data was that the prices has been rounded to the nearest integer, i.e. a price of $101.79 will be stored simply as $102. However, I do not guarantee the accuracy of this data in any way.
If you don’t want to download terabytes of data yourself, we’re making the cleaned dataset available as a demo dataset on Shape. Once you sign in, simply select “Connect Demo Dataset” and then “Connect Transparency Database.”
Here’s a quick demo on how you can use Shape on the dataset to calculate a range of in-network prices for a specific procedure:
https://www.youtube.com/watch?v=RrgELSj5e24&vq=hd1080
<aside> 💡 If you’re accessing this soon after this post is published, please forgive any slow queries - we only have a single small Snowflake instance backing tables with billions of rows, so expect some congestion as everyone writes queries. On the other hand if you work at Snowflake or Firebolt and can get us some free credit to host this data, we’d be very grateful 🙂
</aside>
We will publish the raw CSV data, along with the code used to transform it, and update this post later this week!
This dataset is important, and complex, so I’m spinning up a Slack for anyone that wants to discuss it, stay in the loop on its evolution, or share findings!
Join us on Slack here.
All the data was converted into the 8 tables described below.