By: Alice Chen, Michael Richards, and Kosali Simon
 We provide an overview of some commonly-used data sources for research using health care encounter data. We introduce these data sources, identify their relative accessibility—which may differ across institutions and networks— and provide download links.  Like prior columns on hospital financial data and data on medical providers, the resources we document here are not comprehensive; we encourage researchers to continue the conversation on Twitter with the hashtag #EncounterResearchData and share more details about these or other resources (as well as their own publications that use them, so we may learn from each other’s research).
We provide an overview of some commonly-used data sources for research using health care encounter data. We introduce these data sources, identify their relative accessibility—which may differ across institutions and networks— and provide download links.  Like prior columns on hospital financial data and data on medical providers, the resources we document here are not comprehensive; we encourage researchers to continue the conversation on Twitter with the hashtag #EncounterResearchData and share more details about these or other resources (as well as their own publications that use them, so we may learn from each other’s research).
Publicly Available Data (Colleen Carey and Ian McCarthy)
Several surveys and publicly-available administrative datasets supply information on health care encounters that can be simply downloaded for free. Some measures of patient-level health care encounters are publicly available from survey data such as the Medical Expenditure Panel Survey or the National Health Interview Survey. The National Ambulatory Medical Care Survey samples office visits and can be a particularly good resource for primary care. If you need data on hospital utilization or spending, the National Hospital Ambulatory Medical Care Survey is also publicly available.
However, these data are collected as medium to small scale surveys via voluntary participation, so it may not capture a sufficient number of cases for a given research question.
It is also important to note that geographic information is commonly excluded from the previously listed datasets for privacy reasons. To access geographic identifiers, a data-use agreement and modest fees are often required. For example, the National Inpatient Sample (NIS), the State Inpatient Databases, and the State Emergency Department Databases are administrative data sets that all enable state-level estimates of hospital-based activities (NIS only in older years) after the data application and payment. Additionally, only some states participate and each year of a state’s data must be purchased separately. More information on state-discharge data can be found in the column on medical providers.
Medicare (Colleen Carey and Ian McCarthy)
Medicare encounter datasets are commonly used in medical and health economics studies because of their large sample size and level of detail. Data are available on the universe of Medicare beneficiaries or subsamples, either random or researcher-specified, as far back as 1992.
Medicare claims are organized into files based essentially on who pays the claim and how it is paid. For example, Medicare pays providers directly for fee-for-service beneficiaries, but private insurers pay claims for Medicare Advantage beneficiaries. Thus, claims for these two groups are in different files, with the Medicare Advantage data only available starting in 2015. Files loosely correspond to sites of care, but seemingly similar services can be in different files if Medicare handles the claims for them differently. For example, an emergency department visit that results in a hospital admission is recorded in a different file (Inpatient) from those that do not (Outpatient). Prescription drugs taken at home are recorded in the Part D Event file, while those administered in a doctor’s office are recorded in the Carrier file. The Research Data Assistance Center (ResDAC) facilitates all Medicare data requests and can help you understand what files you need for your research.
Access to micro=level Medicare claims can be expensive. Medicare offers two pricing schemes: up-front purchase of particular files, or an annual fee to access all files via the Virtual Research Data Center (VRDC). The price for purchasing particular files varies with the size of the subsample. If other researchers at your institution have the files you need, you can instead request to “reuse” their datasets for your own research question at a reduced price.
While the up-front costs for data access through the VRDC are lower, you must pay an annual fee to keep your VRDC “seat” as well as an annual fee for storage space. A project that runs 10 years may be better off buying the data directly at a higher up-front cost, but a project that runs just a few years may be cheaper through the VRDC. Researchers are also limited to SAS, Stata, and perhaps a few other programs in the VRDC (i.e., no R, no Python, no Julia).
For recent years, Medicare aggregates physician services and prescription drug claims into a physician-service-year public use file that is available for free. This file contains no patient-level information and is subject to some privacy-related censoring, but is a great resource for studies of physician behavior.
Because Medicare represents a significant portion of the total health care system, researchers commonly use Medicare data to answer questions about the health care system as a whole. Medicare claims, however, only contain information needed for claims processing. Basic demographics such as age, sex, and race are available, but there is no information on beneficiary socioeconomic status or household composition.
Recent publication examples are listed as references 1 and 2.
Marketscan Data (Zirui Song)
One of the more longstanding data resources based on commercial claims activity that is also commonly used today is the Marketscan Data. The data are currently owned by IBM, which acquired the data assets from Truven several years ago. IBM will sell the data to researchers and other entities directly.
An immediately attractive feature of Marketscan Data is the relatively large sample sizes from commercially insured and Medicare populations. Note, the Medicare population is limited to beneficiaries with supplemental coverage through their employers. Health economists can then observe detailed price and cost-sharing information at the individual claim level across the inpatient, outpatient, and prescription drug domains. The data also tend to be well organized and fairly clean, which isn’t always guaranteed when working with transaction-based data.
While the data do include information on provider specialization as well as site of care, they lack physician, physician group, or hospital identifiers. The relative importance of this will depend on the research question being pursued. Similarly, the health economist can observe the plan type (e.g., HMO, PPO, etc.) and a de-identified plan ID associated with a given claim, but greater details, such as the name of the insurer, are masked. It is also important to bear in mind that the Marketscan Data are a non-random, convenience sample from large data contributors (generally employers), which will consequently have uneven geographic representation.
Recent publication examples are listed in references 3 and 4.
Optum (Sumedha Gupta)
An often-used source of longitudinal information on medical and pharmacy claims, together with lab results and administrative data is Optum’s de-identified Clinformatics® Data Mart system, a comprehensive commercial and Medicare Advantage claims database. The data are available through a negotiated contract with Optum by inquires to connected@optum.com. The database currently includes claims from January 2007 through 2020 (with approximately a 6 month lag), covering over 65 million unique individuals. Variables included in the data are listed here; of note, they include deidentified patient IDs for longitudinal tracking, patient age, patient location, and gender. Provider details include deidentified provider ID, specialty category. Detailed socioeconomic information of the patients – race, poverty status, education level, marital status, household income, and (separately) date of death data, can be obtained without detailed geography.
Overall, Optum provides a unique opportunity for detailed analysis over an exceptionally long observation window. However, Optum does not provide physician NPI/DEA numbers (cannot be merged by researchers with other physician data), and patient socioeconomic and demographic variables are not available at sub-state geographies.
A recent publication example is reference #5.
HCCI Data (Erin Trish and Erin Duffy)
Perhaps one of the more recent and certainly one of the largest data troves being incorporated into health economics research comes from the Health Care Cost Institute (HCCI). Historically, the HCCI data available to researchers have included commercial and Medicare Advantage claims from three large, national insurers – UnitedHealthcare, Aetna, and Humana. However, HCCI has recently announced that new data sources – such as its new partnership with Blue Health Intelligence – will be available later this year. It’s also worth noting that HCCI houses Kaiser Permanente claims data and is a qualified entity for providing access to and/or analysis of Medicare fee-for-service claims. But here, we focus on the commercial and Medicare Advantage (MA) claims data attributes belonging to HCCI.
At this time, HCCI is not accepting applications for access to its commercial claims dataset due to the data evolutions described above, but the HCCI administrators expect to invite new applications later this year. To stay looped in, health economics researchers can visit the HCCI website as well as sign up to receive email updates via this link.
The appeal of the HCCI claims data repository for provider-focused studies are fairly clear. First, they are comprised from multiple large insurers, which expectedly captures a wider slice of the commercial and MA markets overall and therefore improves the generalizability of the findings than claims data from just a single insurer. Second, the claims include rich detail on provider network status as well as allowed amounts paid to facilitate research questions ties to price negotiations and out-of-network phenomena for professional and facility services. Lastly, the claims can be used to construct episodes of care and longitudinally track patients (via an encrypted patient identifier) to generate a comprehensive view of medical services received and the associated payer and consumer spend.
That said, there are some limitations to even these very extensive and detailed data. Researchers cannot identify which insurer has provided a given claim, so variation across insurers cannot be accounted for in analysis. Intuitively, the geographic representation of the claims will reflect the market presence and penetration of the contributing insurers––leaving some state and local markets underrepresented. Physician and hospital identities are encrypted, and relatedly, researchers are prohibited from identifying providers by name, which can be limiting for certain areas of research or specific research questions.
Recent publication examples in references 6-9. In reference 6, we use these data to evaluate the prevalence of potential surprise out-of-network bills for patients treated at ambulatory surgery centers. One important contribution of this work is that we identified the frequency with which different types of health plans (e.g., self-insured vs. fully-insured) paid out-of-network claims in full by applying the network status, allowed amount, charge, and health plan attributes included in HCCI claims data. This enabled us both to improve on estimates of patients potentially liable for surprise bills, as well as to evaluate the impact that these insurer payments for out-of-network care have on premiums.
Electronic Health Record Data (Hannah Neprash and Engy Ziedan)
As electronic health record (EHR) adoption has become nearly ubiquitous, information derived from these platforms offers health economists a rich new source of data. Some resources to date include data from athenahealth, Inc. (a nationwide health information technology company with >150,000 health care providers on a cloud-based EHR) and data from individual health systems (e.g., Fairview Health Systems – a large integrated delivery system in Minneapolis, MN). A recent resource for national EHR data on appx 40 million individuals is from HealthJump, a data interoperability platform that homogenizes patient records across EMR vendors. These data are currently provided for free through the COVID-19 Research Database, a pro-bono cross-industry collaborative.
A clear strength of many EHR data sources is the ability to observe all (recorded) clinical activities performed by health care providers, across multiple payer types. This facilitates research questions related to provider behavior and decision-making. Conversely, EHR data may be less appropriate for research questions requiring a complete record of care received by patients, since any care provided by providers not using the EHR (e.g., an ER visit to a hospital using a different EHR) will likely not be observed.
While EHR data include many elements frequently present in claims data (e.g., diagnoses, procedure codes, patient demographic information), considerably more detail is frequently available, including orders placed by clinicians for follow-up care (e.g., medications prescribed, lab and imaging studies, referrals), test results, some health status measures (e.g., blood pressure readings), and appointment scheduling detail (e.g., scheduled visit start time and duration). Additionally, EHR platforms frequently record time-stamped metadata, enabling time measurements, including how long providers spend with their patients and how much time is devoted to visit documentation. Many EHR data sources are also available in almost real-time, whereas claims databases require several months at least to be complete and used for research.
With the richness of EHR data comes concern about generalizability, since access to EHR data is typically obtained at the level of an individual vendor and/or health system. Attempts to work across multiple sources of EHR data will likely encounter non-standardized data output and formatting. Finally, the relative recency of EHR adoption means a fairly short time series of EHR data, with compositional changes over time.
Recent publication examples using EHR data are listed in references in 9-15. Reference 15 is an example of how data from HealthJump was used to study the effects of the Covid-19 pandemic on non-Covid-19 care. The authors combined data on patients’ demographics (e.g.: age, race/ethnicity and 3-digit zip code), medication history, visits, laboratory exams and procedures to study changes in healthcare utilization by demographics and comorbidities. A neat feature of these HealthJump data is that they are designed to reach out beyond the health organization that originally collects and compiles the information. The information moves with the patient—so if an outpatient facility is signed up to the HealthJump database but one of its patients is hospitalized elsewhere, that inpatient record is reflected in the data. Another positive feature is that the data is updated weekly, with only a week lag between the visit time and reporting. However, a disadvantage of these data, is the fairly short time series (the pro-bono panel data start in 2019).
Nursing Home Encounter Data (Martin Hackman)
A segment of the health care sector that is often hard for health economists to study is Medicaid––especially when the research questions of interest pertain to provider behavior within the long term care market. One prominent exception to this commonly encountered problem is the nursing home industry. Nursing homes have large Medicaid exposure due to the known features of long-term care (LTC) financing in the US, and they are often well-tracked in various detailed datasets. For example, LTCFocus provides data on nursing homes in the US by combining a variety of underlying data sources, including resident-level data on patient’s diagnoses, treatments, medications, and activities of daily living. The publicly available dataset is aggregated at the facility level for years 2000 to 2017, and it can be downloaded here.
There are also state surveys from California and Pennsylvania, which offer comparable information at the nursing home by year level. Additional features relevant for health economics work is detailed information on service prices (e.g., commercial, Medicaid, and Medicare reimbursement rates) and revenues––these attributes are not found within LTCFocus. California’s Long-Term Care Facility Financial data can be requested though the state’s Office of Statewide Health Planning and Development (OSHPD) and Pennsylvania’s nursing home reports are provided by Pennsylvania’s Department of Health.
Within these datasets, researchers can benefit from standardized information on important staffing, facility, and residents’ characteristics. Moreover, the provider identifier allows the researcher to merge this information with Medicaid and Medicare claims data. That said, the revenue information is only available in select states (as described above), and the data are aggregated.
Recent publication examples are listed in references 17 and 8. In reference 17, I combine Pennsylvania’s survey data with complementary data sources to quantify the effects of policies that either raise regulated Medicaid reimbursement rates or increase local competition via directed entry on the quality of care. In reference 18, we combine the survey data from LTCfocus, California, and Pennsylvania with complementary data sources to quantify the effect of patient and provider incentives on the length of nursing home stays.
Table 1: Download (or Inquiry) URLs for Selected Data Sources
| Data Sources | Download URL | 
|---|---|
| Publicly Available Data | MEPS NAMCS and NHAMCS NHIS NIS | 
| Medicare Insured | CMS Research Identifiable Files | 
| Commercially Insured Claims | Marketscan Optum (direct contact here) HCCI | 
| Electronic Health Records | Athenahealth Fairview HealthJump | 
| Nursing Home Encounters | LTC Focus CA Long-Term Care Facility Financials Pennsylvania Nursing Home Reports | 
References
- Buchmueller, Thomas and Colleen Carey. 2018. “The Effect of Prescription Drug Monitoring Programs on Opioid Utilization in Medicare.” American Economic Journal: Economic Policy. 10(1).
- Carey, Colleen, Ethan M.J. Lieber, and Sarah Miller. 2020. “Drug Firms’ Payments and Physicians’ Prescribing Behavior in Medicare Part D.” NBER Working Paper 26751.
- Song, Zirui, Ji Yunan, Safran Dana, Chernew Michael E. Health Care Spending, Utilization, and Quality 8 Years into Global Payment. New England Journal of Medicine (2019);381:252-63.
- Song, Zirui. The Pricing of Care Under Medicare for All: Implications and Policy Choices. JAMA (2019):322(5):395-7.
- Gupta S, Nguyen TD, Freeman PR, Simon KI. Competitive Effects of Federal and State Opioid Restrictions: Evidence from the Controlled Substance Laws. National Bureau of Economic Research Working Paper27520; 2020. DOI3386/w27520
- Duffy, Erin, Loren Adler, Paul B. Ginsburg, and Erin Trish. Prevalence and Characteristics of Out-Of-Network Bills From Professionals In Ambulatory Surgery Centers. Health Affairs 39(5):
- Baker, Laurence, M. Kate Bundorf, Aileen Devlin, and Daniel P. Kessler. Why Don’t Commercial Health Plans Use Prospective Payment? American Journal of Health Economics 2019;5(4):465-80.
- Pelech, Daria and Tamar Hayford. Medicare Advantage and Commerical Prices for Mental Health Services. Health Affairs 2019;38(2): 262-7.
- Cooper, Zach, Stuart V Craig, Martin Gaynor, and John Van Reenen. The Price Ain’t Right? Hospital Prices and Health Spending on the Privately Insured. The Quarterly Journal of Economics 2018;134(1):51-107.
- Smith, Laura Barrie, Ezra Golberstein, Kelly Anderson, Tori Christiaansen, Nicole Paterson, Sonja Short and Hannah T Neprash. The Association of EHR Drug Safety Alerts and Co-prescribing of Opioids and Benzodiazepines. Journal of General Internal Medicine 2019, 34(8):1403-05.
- Neprash, Hannah T and Michael L. Barnett Association of Primary Care Clinic Appointment Time With Opioid Prescribing. JAMA Network Open 2019;2:e1910373-e
- Neprash, Hannah T, Anna Zink, Joshua Gray, Katherine Hempstead. Physicians’ Participation In Medicaid Increased Only Slightly Following Expansion. Health Affairs 2018;37:1087-91.
- Ganguli Ina, Bethany Sheridan, Joshua Gray, Michael Chernew, Meredith B Rosenthal, and Hannah T Neprash. Work Effort and the Physician Gender Pay Gap: Evidence from Primary Care. Forthcoming at the New England Journal of Medicine 2020.
- Neprash, Hannah T, Laura Barrie Smith, Bethany Sheridan, Katherine Hempstead, Katy Kozhimannil. Practice Patterns of Physicians and Nurse Practitioners in Primary Care. Forthcoming at Medical Care 2020.
- Neprash Hannah T, Smith LB, Sheridan B, Moscovice I, Prasad S, Kozhimannil K. Nurse Practitioner Autonomy and Complexity of Care in Rural Primary Care. Forthcoming at Medical Care Research & Review 2020.
- Ziedan, Engy, Kosali Ilayperuma Simon, and Coady Wing. “Effects of state COVID-19 closure policy on non-COVID-19 health care utilization.” NBER Working Paper w27621 (2020).
- Hackmann (2019) “Incentivizing Better Quality of Care: The Role of Medicaid and Competition in the Nursing Home Industry”, American Economic Review, May 2019, Vol. 109, No.5: 1684-1716
- Hackmann, Pohl (2018) “Patient vs. Provider Incentives in Long Term Care”. NBER Working Paper 25178
Alice Chen is an Associate Professor of Public Policy at the University of Southern California.
Michael Richards is an Associate Professor of Economics at Baylor University.
Kosali Simon is the Associate Vice Provost for Health Sciences and the Herman B Wells Endowed Professor of Public and Environmental Affairs at Indiana University, and a Co-Editor of the ASHEcon Newsletter.
Colleen Carey is an Assistant Professor of Policy Analysis and Management at Cornell University.
Ian McCarthy is an Associate Professor of Economics at Emory University.
Zirui Song is an Assistant Professor of Health Care Policy and an Assistant Professor of Medicine at Harvard University.
Sumedha Gupta is an Associate Professor of Economics at Indiana University.
Erin Trish is Associate Director of the USC Schaeffer Center and an Assistant Professor of Pharmaceutical and Health Economics at the University of Southern California.
Erin Duffy is a postdoctoral research fellow at the Leonard D. Schaeffer Center for Health Policy and Economics at the University of Southern California.
Hannah Neprash is an Assistant Professor of Health Policy and Management at the University of Minnesota.
Engy Ziedan is an Assistant Professor of Economics at Tulane University.
Martin Hackman is an Assistant Professor of Economics at the University of California Los Angeles.