Skip directly to content

Visualizing Global Burden of Disease: behind the scenes

on Mon, 03/04/2013 - 16:50

Today, the Institute for Health Metrics and Evaluation (IHME, my employer) is launching 8 new interactive data visualizations that bring to life the results of the 5-year Global Burden of Disease (GBD) study at the country level. The GBD study compiled all available data on health outcomes for 187 countries in the world for 1990 and 2010, and provides estimates for the burden caused by different diseases and risk factors that are comparable across countries and over time. Regional results were published in a dedicated triple-issue of the Lancet in December 2012 (see my related post here). Managing the Data Team at IHME, I have been lucky enough to support the project with finding and managing data over the past 4 years, as well as overseeing the creation of these visualizations.

The data visualizations play a key role in the GBD project for several reasons. It started with IHME’s need to review the results of GBD. Tables and static graphs just don’t provide the flexibility to properly assess results and identify patterns and trends.

GBD uses four key metrics: number of deaths, years of live lost (YLL), years of life lost to disability (YLD), and disability adjusted life-years (DALY). The results datasets are massive, broken down by several dimensions:

  • 291 causes of disease and injuries at the most granular end of a 5-level cause hierarchy
  • 66 risk factors
  • 1100 cause-risk factor attributions (i.e. burden caused by a given risk factor via a particular disease or injury)
  • 187 countries, 21 GBD regions, global
  • 27 age groups: early neonatal, late neonatal, post neonatal, 1-4 years, 5-9, 10-14 and so on until 75-79, 80+, as well as under 5, 5-14, 15-49, 50-69, 70+, all ages, and age-standardized
  • Male, female, both
  • 3 years: 1990, 2005, 2010
  • Estimates expressed as total number, rate, and %, as well as ranked by country
  • 95% uncertainty intervals: lower bound, mean, and upper bound (not strictly a dimensions but adds to the size of the database)

In total, about 1 billion (!) results were calculated for the project, and then there are aggregations by cause, age, and geography. A nightmare to review, but a gold mine for visualizations. The results datasets are fully imputed for all dimensions, i.e. there are no gaps in the datasets. And consistent use of methods ensure comparability of results across all dimensions.

Initially, we tried off-the-shelf visualization tools, but they didn’t give us the flexibility to dive into all the dimensions and properly explore patterns and trends in the data. Then we discovered D3.js (Data-Driven Documents). D3 is a JavaScript library for manipulating documents based on data; it allows developers to build powerful visualizations very efficiently (but you be the judge about how powerful our resulting visualizations really are). And we did what recommended in a blog post today: iterate early, iterate often.

We improved the tools as we reviewed our results, then started using the tools to show the results to collaborators and country experts to obtain feedback, review our estimates, and discuss what data were used for analysis (and what data may be available to further inform and improve the estimates). Realizing how powerful these tools are for different audiences to explore the results of GBD, we decided to make them publicly available. In December 2012, we launched 5 visualization tools with the regional results of GBD (available here) with the publication of the GBD papers in The Lancet.

Updates for these tools are now available with country-level results. In addition, we created three new tools that allow users to review and explore the data from completely new angles. Here is a quick overview of the country-level visualizations:

  • GBD Compare is a powerful platform that visualizes the data in treemaps, maps, time plots, age plots and stacked bar charts. The most powerful feature is the 2-panel view that allows users to review any two of these charts simultaneously to compare and review trends across causes, risks, countries, ages etc. The panels are interactive, e.g. the map can be used to select countries in the other panel and quickly explore countries around the world. It’s a powerful tool, but requires a bit of commitment to make use of all the features. My video tutorial for GBD Compare can be found here.
  • GBD Cause Patterns provides results for 21 cause groups in stacked column charts. It allows quick exploration of trends across geographies, ages, gender and time (see options at the bottom of the screen).
  • GBD Arrow Diagram shows very concisely the rank of causes and risks for a given country or region in 1990 and 2010, along with the related growth trend. The connecting arrows quickly show how fast causes and risks have grown or decreased between 1990 and 2010. A version of the GBD Arrow Diagram is embedded below.
  • GBD Heatmap ranks causes and risks by burden within a country, but then allows comparisons of those ranks across countries and/or regions (you can compare the ranks within a country with the ranks for a given region or the world).
  • GBD Uncertainty Visualization allows users to compare uncertainty bounds across causes and risks for all dimensions. Countries or causes/risks where the data were more sparse or inconsistent will have wide uncertainty intervals.
  • HALE/LE Visualizations shows the relationship between total life expectancy and healthy life expectancy, i.e. the number of years people can expect to spend in good health over their lifetime.
  • Mortality Visualization provides an interesting addition to the results: users can look at all-cause mortality estimates and uncertainty bounds in the context of the underlying input data points. The hovers provide detailed metadata about the source of the data point.
  • COD Visualization show the input data points for cause of death data by country, cause, and sex, also with detailed metadata.

All visualizations also feature “share” functionality that creates a unique URL for the chosen settings that can be shared via email, Twitter, Facebook or other social media. This should be useful to bring up the tools in online conversations about the health situation in different countries, disease patterns and international comparisons.

These tools will be used extensively in policy and country consultations, and many of these conversations will be conducted in locations that have less than reliable internet connections. To facilitate use, we created offline versions of these tools as well. The sheer size of the data provided a substantial challenge, but the tools are now performing well offline.

If you are interested in building additional visualizations with the GBD results, you should start with the regional results of GBD, all available for download on the GHDx here. The country-level results will be made available via the GHDx in September 2013.

I would love get your feedback on your experience with using the visualizations. Are they intuitive? Are there features that you like or don’t like? Are there things you would like to see or do with the data that aren’t possible yet? Leave suggestions in the comments, and I will make sure to include them in our discussions for future development


Example: GBD Arrow Diagram

Open data and the four tiers of health data sharing

on Sat, 02/23/2013 - 06:44

Today is Open Data Day. Open Data enthusiasts, activists, developers, hackers, scientists, and other entrepreneurial data geeks are gathering around the world to demand more open data, work in hackathons or code-a-thons, and engage in data discussions. Just google "open data day events" to see the scope. It's very encouraging.

The benefits of releasing open data are manifold. Take as an example open government data: It increases trust in the government by providing more transparency and accountability. It helps improve public services. It can stimulate economic activity and generate jobs. It helps governments improve the use of their own data. It helps increase the exchange of information among different departments and ministries (which are often siloed) and improve collaboration. And as an additional perk, open data will also lead to savings by reducing work on specific data requests.

There are plenty of examples where this works really well. The release of weather data has led to great weather apps, insurances, and other services, GPS data is used in fabulous apps and services in almost any mobile device now, public traffic data are used to make commuting and traveling easier and so on. The situation for health data is slightly different.

Releasing health data as open data requires consent of the subject and privacy considerations, and there are specific regulations aimed at data collected in the delivery of health care (e.g. HIPAA in the US) and oversight by Institutional Review Boards (IRB). At the same time, the stakes in health are higher than in many other fields. The sharing of health data enables data users to provide evidence for policy and decision making, track performance, evaluate and improve quality, identify effective interventions, optimize healthcare pathways, and improve health of individuals and populations. In short, sharing health data saves lives.

There are four different degrees of openness for health data sharing. Given the potential impact, the goal for every organization holding health data should be to publish as much data as possible at the highest level of detail possible, while protecting subjects and complying with regulations.

Tier 1: Open Indicators

Aggregated or tabulated data should always be shared as open data through as many channels as possible, including organizations' websites, data visualization sites, data aggregators, and open data portals. Examples can be found on or the health section of, as well as at WHOWorld Bank, and the recent data release of the Global Burden of Disease (GBD 2010 regional results; my employer, IHME, is the coordinating organization of the study). 

Tier 2: Open Microdata

Detailed or micro-data at the respondent or individual level can often be carefully de-identified and shared as open data. Sample surveys, mortality data, and even hospital discharge data are often shared openly, e.g. CDC's Reproductive Health Survey series on IHME's Global Health Data Exchange (yes, that's the platform that I manage) or the public use datasets for US mortality from NCHS. If access to funding or other considerations require registration, it should be fast (ideally instantaneous) and free, as is the case for microdata for the Demographic & Health Surveys from MeasureDHS.

Tier 3: Data Use Agreements

When data cannot be shared without restrictions, there should be a clearly defined process for data users to request access to more detailed or partially identified data (if consent from individuals to share the data was obtained). These processes need to balance the proposed purpose of using the data with the risk of identification of individuals, and provide proper oversight and safeguards that protect subjects' privacy. US mortality data with county identifiers are only available under Data Use Agreement.

Tier 4: Fully controlled data access

If data are too sensitive to hand out at all, data owners can offer options to access and analyze data on their own premises, and allow data users to only take the results of their analyses with them. The US Census Bureau operates Census Research Data Centers (RDCs), where researchers can access the full detail of data on controlled premises; no microdata can be taken out and research results are carefully vetted before being released to the researchers. Short of implementing full-fledged programs, data owners can also collaborate with researchers to provide this kind of access.

Last not least, sharing information about data collected is a minimum requirement. Over the past few years, my team at the Institute for Health Metrics and Evaluation (IHME) has cataloged and published information for over 8000 health-related datasets in the GHDx, and we are adding more daily. We are cataloguing data from 200 countries around the world, and it is often incredibly hard to even identify what data have been collected, and who to contact for access. Websites are are in different languages and structures, constantly in flux, can be down for periods of time and data available one day may be gone the next. Data and information about them is often only available in reports, statistical yearbooks, or published literature. Data owners should make an effort to add information about their data and at least aggregated results to open data platforms and catalogues to make them easier to find. And subsequently try to release as much data as possible in each of the four tiers.

Happy Open Data Day!


Launch of the Global Burden of Disease Study 2010 results

on Fri, 12/21/2012 - 05:23

On Thursday, 12/13/2012, The Lancet published seven papers with the results of the Global Burden of Diseases, Injuries and Risk Factors Study 2010. The epic, 5-year study involved hundreds of collaborators to compile and analyze all available data on health outcomes globally. My role in the project focused on finding and obtaining input data, managing data at IHME, and creating visualizations. This is the first in a series of blog posts in which I’ll discuss the sources of data used in the different components of the study, the availability of health outcomes data in general, and the metrics that were generated, and share some stories from the trenches. This post provides an introduction to the study. Follow me on Twitter or subscribe to my RSS feed to find out about future installments on mortality, causes of death, non-fatal health outcomes, and covariates.

The Global Burden of Diseases, Injuries and Risk factors Study 2010 (GBD 2010) is arguably the most comprehensive assessment on human health ever conducted. Richard Horton, Editor of The Lancet, and Peter Piot, Director of the London School of Hygiene and Tropical Medicine, compared the GBD 2010 to the Human Genome Project in terms of scope and importance. The results were published by The Lancet in seven papers that took up an entire triple issue of the journal. It's the first time in the 189-year history of the Lancet that an entire issue was dedicated to one study (the Lancet was founded in 1823). The results were officially presented at a launch event at the Royal Society in London last week (picture on the left).

GBD 2010 was coordinated by the Institute for Health Metrics and Evaluation (IHME) – my employer – in collaboration with 6 other organizations, the University of Queensland, Harvard School of Public Health, Johns Hopkins Bloomberg School of Public Health, the University of Tokyo, Imperial College London, and the World Health Organization (WHO). Professors Christopher Murray, director of IHME, and Alan Lopez, head of the school of population health at the University of Queensland, developed approach and methodology for global burden of disease analysis in the 1990s, and oversaw this iteration with a complete revision of all the steps of the analytic process (on the right, editor-in-chief of The Lancet Richard Horton is taking a picture of Chris and Alan at the GBD Study 2010 press conference).

GBD 2010 mapped all known diseases and injuries to 291 causes that were then analyzed for the burden they caused through fatal and non-fatal outcomes. This required compiling and analyzing all available published and unpublished data and evidence on health outcomes (notice that available is in italics? More on that later). Data sources include censuses, surveys, vital statistics, disease registries, hospital records, and many more. Especially for non-fatal outcomes and risk factors, systematic literature reviews were a key source of data. Hundreds of researchers provided data and expertise, and the seven published papers included 486 authors from over 50 countries. The analysis encompassed 18 different components that are highly interconnected (see the overview paper for details).

Compiling these data was a monumental task, but analyzing the overall global burden provided a key advantage compared to studies that focus on one or few diseases or injuries. There are 235 causes that can lead to death, and in GBD 2010, deaths from these causes always sum up to all-cause mortality in each age-sex-region group, i.e. every death is counted only once or - in scientific terms - all-cause mortality estimates constrain the cause-specific estimates. Studies that estimate mortality for only one or few causes don’t have this constrain and will often provide higher numbers of death.

A key challenge for all parts of this study was the availability of input data (this is the main reason for me to write this series). For many developed countries, data to estimate mortality and non-fatal health outcomes by cause and risk factors are readily available. However, in many of the 187 countries that are part of the 21 GBD regions, these data are not being collected, incomplete, of poor quality, insufficiently documented, only available on paper, or stuck on obsolete storage media. In addition, data are often simply not being shared for political or other reasons, even for this kind of research that provides a global public good. A fundamental take-away from this study is that we need to improve the collection and distribution of health-related data, in developing but also developed countries. For mortality data, this means improving civil registration and vital statistics systems to make sure that we track every death and its cause everywhere in the world. Survey, census and other health-related data are just as important for governments to share. With regard to understanding non-fatal health outcomes, access to health record, disease registry and other detailed health data is essential, of course with proper attention to privacy, confidentiality, and consent of the individuals. All of these can be overcome but that requires commitment and political will by the data owners.

GBD 2010 compiled all data on health outcomes available to us. For the first time ever, we now have estimates on mortality and non-fatal health outcomes by cause and risk factor that were developed with a consistent methodology for several points in time (1990, 2005, and 2010). To provide information about the availability and consistency of input data, 95% uncertainty intervals were calculated at each step, propagated throughout the analytic process, and are available for all results. GBD 2010 provides estimates even for causes, risk factors, regions, or age groups where data was limited or no data was available; these were imputed using different statistical methods and covariates like GDP, education, and many others to inform the estimates. We believe that estimates based on limited data are better for policy and decision making than no evidence at all. The result is a gigantic database with structured results by age and sex that are comparable across geographies and time, all publicly available. The data will allow global health practitioners, policy makers, donors, media, researchers and others to explore patterns and trends in health over time.

Here are your key resources to explore further:

Global Development Data Jam at the White House: 10 take-aways

on Tue, 12/11/2012 - 13:41
Yesterday, I was honored to participate in the Global Development Data Jam at the Eisenhower Executive Office Building at the White House. It was a great crowd of sharp, motivated data geeks with passion for development. A series of insightful and inspiring presentations on data, open data, data collection and more was followed by working sessions to come up with concrete project ideas that can be implemented over the next 90 days. Here are my 10 take-aways for data for development, in order of appearance during the day:
  1. We need to focus on the big opportunities. Todd Park kicked the day off with this his usual display of boundless energy and can-do attitude, posing a fundamental question: "What is the next GPS of development?" What are the vital datasets that should be broadly available to enable innovative solutions (examples for GPS include mapping, directions, location-enabled services, etc.)? Great question. A list of suggestions will follow in a separate blog post.
  2. Data are essential infrastructure for development. Making data broadly available will speed up an evidence-based process of planning, implementing, measuring, and adjusting.
  3. Engaging the crowd to clean or digitize datasets, map infrastructure, and do other related tasks can be very successful, and all the tools needed are available. Examples: USAID cleaned 10,000 records in 16 hours with 300 volunteers at 85% accuracy; Ushahidi's SwiftRiver enables users to let the crowd filter and verify data and organize and present the results.
  4. Existing (and free) social media and mobile phone usage data can be mined for early detection, real-time feedback (disaster assessment), and  prediction of trends (flu trends, food prices). UN Global Pulse's Robert Kirkpatrick showed a number of great examples, including many from developing countries. Did you know that there are 100 million mobile users in Nigeria, 100,000 new Facebook users per month in Senegal, Jakarta is one of the world's "tweetiest" cities, and that 24% of residents in Mogadishu check into Facebook at least once a month? Me neither.
  5. Where these data don't suffice, companies like Jana and Mobile Accord can help roll out short mobile phone surveys in any country in a matter of days.
  6. More organizations and platforms provide comprehensive access to their data, including UNDP (last month), the Millenium Challenge Corporation,, and others
  7. Open data can be done anywhere. Literally: Development Seed's Eric Gundersen featured an open data platform for Election Data in Afghanistan.  
  8. Development funding needs more coordination. That's not exactly a new insight, but a problem we can tackle from two sides: AidData and partners geocoded all 550 current development projects in Malawi with a volume $5.6bn. The World Bank is mapping and sharing data for their project portfolio (Mapping for Results). More countries and donor agencies should do the same. 
  9. Without data scientists, you can "share data until the cows come home" without results. True words from DataKind founder Jake Porway. It's a key issue in development, that is only partly mitigated by organizations like DataKind. We need more training in statistics for civicl society groups, journalists, and others.
  10. In the White House, even paper cups and napkins feature the Seal of the President of the United States.

USAID Administrator Rajiv Shah summed the topic up nicely: The single biggest thing we can do to eradicate poverty? Open data! Data are turning into essential infrastructure for development. And events like the Global Development Data Jam help connect people, organizations, and fields within the development arena. Thanks to the White House Office of Public Engagement, the Office of Science and Technology Policy , and the U.S. Agency for International Development (USAID) for hosting a very inspiring event.

Things to watch at Strata Rx: 5 underlying challenges for sharing health data

on Tue, 10/16/2012 - 06:37

This week brings us the first Strata Rx conference, which explores the role of data and data science in health care. Very timely, because health care is at a crossroads. In many more developed countries, rising cost combined with stagnating outcomes and aging populations make health care systems unsustainable. In less developed countries, a dual or triple disease burden and stagnating development assistance for health hamper progress. Tim O'Reilly said in a recent conversation on health care (worth watching!) that "change happens when the pain of not changing is greater than the pain of changing". Health care is there, ready to be disrupted, and data is key to driving that disruption. It's one of our biggest challenges in the 21st century. 

Changes in technology have revolutionized the possibilities for collecting and analyzing health and health related data (sorry about the buzzword bingo): patient data are captured in electronic health records, smart phones capture and transmit volumes of personal data, social media capture health self-assessments, wearable sensors enable uninterrupted data collection and transmission, genome sequencing is now almost affordable, and cloud computing, open source software, machine learning, and big data management enable sophisticated analysis of all these data. With all these opportunities, leveraging health data to fix health care is not only one of the biggest, but also one of the coolest challenges in the 21st century.

However, there are 5 underlying challenges for leveraging data to fix healthcare which center around transparency and accessibility.

  1. Privacy:  sharing data about individuals requires protecting their privacy. However, there is an inverse relation between the availability of identifiers and the usefulness of the data. In addition, linking data from different sources enables much more powerful analysis but also increases privacy risks. When sharing useful health data, there always remains a (often very low) risk of identification. Therefore, we need strong de-identification techniques as well as powerful legal deterrents from using data to identify individuals. And we need to create trust in individuals that their data are handled responsibly.
  2. Consent: individuals need to agree that their data are being shared with others. They should be able to decide exactly what their data can be used for, and be able to remove that consent if they wish. Currently, there is limited transparency and very little control for patients over how their data are shared.
  3. Data Use Agreements: fully de-identified data (i.e. data with a very low risk of identification of individuals) should be shared as open data. Data with identifiers can be shared as limited use data for appropriate uses and with data use agreements. However, there are currently no standards around these kinds of agreements and their stipulations, making it often difficult to negotiate and implement them.
  4. Research ethics: research that involves collecting data from individuals or using data with direct identifiers often require ethics oversight, e.g. by Institutional Review Boards. Regulations like the United States' HIPAA detail what can be shared and how. While this oversight is necessary, it often hampers progress by being too strict and difficult to implement. Regulations and their interpretations need to keep pace with the current rapid developments in data collection and analysis, the globalization of research, and individuals' attitude towards data sharing, e.g. in social media.
  5. Incentives for sharing: there are powerful arguments for sharing. Open data can create entire ecosystems. Sharing unlocks external creativity and analysis, and most of the world's smartest people don't work for you. Most importantly, sharing and using health data can save lives, so sharing data becomes a moral imperative. However, many reasons beyond privacy and consent keep data owners from sharing data: competition, fear of misuse, reluctance to share the power of information, political agendas, academic publication plans, etc. The fragmentation of  health systems compounds the number of different players that have a plethora of different motivations for not sharing health data. We need better incentives and frameworks to encourage and facilitate data sharing. Patients can take a lead role here by sharing their own data and requesting providers and others to share their data responsibly.

The next two days will touch heavily on these areas, and I'm looking forward to connecting with other health data innovation enthusiasts. Follow me on Twitter for instant updates, and stay tuned for follow-up posts.