Categories: News, Newsletter Issue 2024:3


Data Corner Episode 0: A Brief Overview

By Barton Willage

New health economics researchers quickly discover a vast landscape of data, both common sources and those that are less often used.  Data are key to all of our research, but it can feel like one of those topics that falls by the wayside during training as we focus on learning models and theory. With experience, new researchers can gain a deeper understanding of just how much data exists. Here I discuss a brief overview of types of data, suggestions on learning how to access data, and warnings about pitfalls I have experienced.

TYPES OF DATA

Data can be categorized into a few broad classes, some of which can be combined. The first distinction is primary and secondary data. Primary data is data collected by you, the researcher, for your own research; this is quite rare but not unheard of in health economics. Some examples are running a survey and natural or field experiments. I personally have limited experience with this, but it is more common in development economics. Secondary data is data that already existed before you got to your research idea, which is much more common in health economics.

Another dimension that has two broad categories is survey data and administrative data. Survey data is collected by research organizations and government agencies, and it is meant for research purposes. The most widely known example is the US Decennial Census. Conversely, administrative data is created “in the order of doing business”; the intention is not to use the data for research. Common examples include students’ grades, patients’ hospital records, and tax information. While survey data dominated 10 to 20 years ago, administrative data has grown increasingly common with digitization.

There are other types of data that are less commonly used in health economics research that I will not go into much detail about. One example is web scraping; the information exists, but it is not in a usable format (See here for a brief overview).

HOW TO LEARN ABOUT DATA

For those new to quantitative health economics research, two excellent starting points are this Journal of Economic Perspectives article and this Primer by Sebastian Tello-Trillo. Note that the JEP article is celebrating its 24th birthday, but it is still full of information on data commonly used in research. The ASHEcon newsletter has also published a few articles on specific data sets, including Medicaid data, health care encounter data, medical provider data, hospital financial characteristics, a state health practice database, and provider market structure data.

In my opinion, there are a couple of very valuable resources that one should keep in mind when trying to learn about data. First: other researchers you interact with, such as advisors, other graduate students, and seminar speakers. Discuss your research ideas with them, especially when you’re struggling to find suitable data. You never know who might have already stumbled across what you need; an undergraduate student mentioned a data source that I did not know about and that I immediately incorporated into an ongoing research project. In this case, it was publicly available state-level Medicaid drug data that I am using in co-authored work about Hepatitis C.

Second: reading. Papers that are interesting to us might have data that we could use. However, a major caveat is data accessibility. This is most commonly useful in the early stages of someone’s research career where they might be doing a lot of reading and gaining knowledge of data. Keep track of data sets that might be useful in the future and what features are of particular interest. If possible, keep track of which data have geocodes (such as state or county identifiers), because a lot of health economics research uses area-level policy variation. One instance of this for me was this paper that used organ transplant data, which again was useful for my Hepatitis C project.

THE SOURCES OF HASSLE

Fortunately, more and more data have become available; however, not all data are equally accessible. Some data are publicly available, with very few restrictions. For instance, the Current Population Survey (CPS) can be downloaded immediately, and even includes state geocodes. I would suggest using the IPUMS version, a good source of easier-to-use data. Usually, downloading these data requires users to agree to not identify individuals in the data, and sometimes there is a requirement to cite the data.

Some data are mostly public but require approval to use certain variables. This is commonly the case if some aspect of the data would increase the probability of being able to identify a respondent. For instance, recent waves of the National Vital Statistics require an application and security plan for geocoded data; and Add Health has similar requirements for things like genetic data. Many of these datasets have low costs, both in terms of money and application time.

However, other data sources have high costs both in terms of time and money. A project of mine used the National Health Interview Survey (NHIS) linked to Social Security records. Many of the restricted datasets from the National Center for Health Statistics require thorough data applications, a non-negligible fee, and access to a Research Data Center. Some data can cost several thousands of dollars, such as Nielsen data or H-CUP. Check with your university if there might be an institutional license, and potential coauthors might also have access. One risk with restricted data is potential surprises once you do get your hands on the data; for instance, in my project using NHIS data, I initially assumed that many more individuals would be matched between the NHIS and Social Security records than actually were. Often, we must go through a great deal of hassle before we know what is actually feasible.

Some data might already be used in research by others, but it might not be broadly available. For instance, many countries have systems to use government data for research, but only to people in that country or with other restrictions. Or it might even be that the data is mostly for people who work for a specific government agency; it appears much easier for Office of Tax Analysis staff to use tax records than for non-government researchers. Similarly, researchers at banks or other institutions might have exclusive access to that organization’s data.

Related, I want to mention data that might not be set up for research purposes…yet. For instance, a state agency or your university likely has a great deal of data that no one has used for academic research. I personally have had mixed success requesting data from these sources. For instance, in this project, we successfully gained access to data regarding how often students use the gym, but we were not able to get data on students’ academic performance. It never hurts to ask, but do not count your eggs until everything is signed and the data is in hand.

If you think a data source might be useful, do not wait to inquire about it. Some data will come within a month, but other data can take many months or years. Consult with other researchers who have used that data before, contact the data provider, and work with your university research office. The last point is particularly important, as individual researchers typically cannot sign the paperwork to access data; someone from the university’s upper administration must do so. The workers in the research office appreciate being consulted early in the process, as they are excited to help with your research but are very busy.