Today is Open Data Day. Open Data enthusiasts, activists, developers, hackers, scientists, and other entrepreneurial data geeks are gathering around the world to demand more open data, work in hackathons or code-a-thons, and engage in data discussions. Just google "open data day events" to see the scope. It's very encouraging.
The benefits of releasing open data are manifold. Take as an example open government data: It increases trust in the government by providing more transparency and accountability. It helps improve public services. It can stimulate economic activity and generate jobs. It helps governments improve the use of their own data. It helps increase the exchange of information among different departments and ministries (which are often siloed) and improve collaboration. And as an additional perk, open data will also lead to savings by reducing work on specific data requests.
There are plenty of examples where this works really well. The release of weather data has led to great weather apps, insurances, and other services, GPS data is used in fabulous apps and services in almost any mobile device now, public traffic data are used to make commuting and traveling easier and so on. The situation for health data is slightly different.
Releasing health data as open data requires consent of the subject and privacy considerations, and there are specific regulations aimed at data collected in the delivery of health care (e.g. HIPAA in the US) and oversight by Institutional Review Boards (IRB). At the same time, the stakes in health are higher than in many other fields. The sharing of health data enables data users to provide evidence for policy and decision making, track performance, evaluate and improve quality, identify effective interventions, optimize healthcare pathways, and improve health of individuals and populations. In short, sharing health data saves lives.
There are four different degrees of openness for health data sharing. Given the potential impact, the goal for every organization holding health data should be to publish as much data as possible at the highest level of detail possible, while protecting subjects and complying with regulations.
Tier 1: Open Indicators
Aggregated or tabulated data should always be shared as open data through as many channels as possible, including organizations' websites, data visualization sites, data aggregators, and open data portals. Examples can be found on healthdata.gov or the health section of data.gov.uk, as well as at WHO, World Bank, and the recent data release of the Global Burden of Disease (GBD 2010 regional results; my employer, IHME, is the coordinating organization of the study).
Tier 2: Open Microdata
Detailed or micro-data at the respondent or individual level can often be carefully de-identified and shared as open data. Sample surveys, mortality data, and even hospital discharge data are often shared openly, e.g. CDC's Reproductive Health Survey series on IHME's Global Health Data Exchange (yes, that's the platform that I manage) or the public use datasets for US mortality from NCHS. If access to funding or other considerations require registration, it should be fast (ideally instantaneous) and free, as is the case for microdata for the Demographic & Health Surveys from MeasureDHS.
Tier 3: Data Use Agreements
When data cannot be shared without restrictions, there should be a clearly defined process for data users to request access to more detailed or partially identified data (if consent from individuals to share the data was obtained). These processes need to balance the proposed purpose of using the data with the risk of identification of individuals, and provide proper oversight and safeguards that protect subjects' privacy. US mortality data with county identifiers are only available under Data Use Agreement.
Tier 4: Fully controlled data access
If data are too sensitive to hand out at all, data owners can offer options to access and analyze data on their own premises, and allow data users to only take the results of their analyses with them. The US Census Bureau operates Census Research Data Centers (RDCs), where researchers can access the full detail of data on controlled premises; no microdata can be taken out and research results are carefully vetted before being released to the researchers. Short of implementing full-fledged programs, data owners can also collaborate with researchers to provide this kind of access.
Last not least, sharing information about data collected is a minimum requirement. Over the past few years, my team at the Institute for Health Metrics and Evaluation (IHME) has cataloged and published information for over 8000 health-related datasets in the GHDx, and we are adding more daily. We are cataloguing data from 200 countries around the world, and it is often incredibly hard to even identify what data have been collected, and who to contact for access. Websites are are in different languages and structures, constantly in flux, can be down for periods of time and data available one day may be gone the next. Data and information about them is often only available in reports, statistical yearbooks, or published literature. Data owners should make an effort to add information about their data and at least aggregated results to open data platforms and catalogues to make them easier to find. And subsequently try to release as much data as possible in each of the four tiers.
Happy Open Data Day!