Skip directly to content

Encouraging the use of data from a brand-new tobacco study

on Wed, 01/08/2014 - 22:30

Yesterday, our team at the Institute for Health Metrics and Evaluation (IHME) launched the results of a comprehensive analysis of data on smoking prevalence and cigarette consumption around the world. The results are quite staggering. While smoking prevalence decreased in men and women between 1980 and 2012, the number of daily smokers increased by 41% in men and 7% in women, due to population growth. 

In 2012, people around the world smoked a total of over 6 trillion (!) cigarettes. The results are available by age, sex, country, and year with several metrics (smoking prevalence, cigarettes consumed, number of smokers). Quite a comprehensive dataset.

To encourage the use of these numbers, we published the data in different formats for different purposes and audiences:

  1. Peer-reviewed, scientific paper in the Journal of the American Medical Association (JAMA)
    The paper provides deep insight into the data and analytic methods used, and is mainly targeted at academic researchers and other analysts. In addition, the peer review provides assurance to all audiences that the data are scientifically sound and based on valid data..
  2. Comprehensive dataset for download on IHME's Global Health Data Exchange (GHDx)
    Provided as a comprehensive CSV file with all metrics and all dimensions, this dataset is useful for anyone who wants to use the results for further analysis or modeling, mostly academic researchers, modelers and analysts.
  3. Interactive data visualization on the IHME website
    The visualization (pictured above) explores all aspects of the paper. It provides a global perspective with international comparisons on a map, in a sunburst diagram, and via line charts. It also enables country deep-dives and a closer look at all the input datasets used for the analysis. Click through to the visualizatoin and see for yourself. Hopefully, policy and decision makers, funders, and many others will find the functionality useful to explore trends and patterns of smoking and cigarette consumption around the world, and devise ways to further decrease the prevalence of smoking to reduce the loss of health and lives to smoking.
  4. Infographic providing key insights
    A tobacco infographic focuses on global trends, rankings of countries with highest increase/decrease in smokers, and some specific examples. This graphic should capture the attention of just about anyone. After all, who knew that 6 trillion cigarettes can be smoked in one year?
  5. Encouraging the use of data
    The Roux Prize – a $100,000 award that I recently announced at TEDx Rainier -- was created to encourage the use of burden of disease data to improve the health of populations. You can read about how Australia used disease burden evidence to try to control tobacco there. Using these data to curb smoking around the world would certainly be a worthy cause.

In sum, hopefully everyone interested in smoking trends, reducing cigarette consumption and it's impact on health in general will find access to data that is useful to them. 

What about you? Did you find a format of the tobacco data that appealed to you? Do you have ideas for other ways to share the data? Or suggestions for improvement? Please leave a comment or send me a note via email or Twitter. I'd be happy to hear your ideas.

Global Burden of Disease data by country for download

on Wed, 09/04/2013 - 07:37

Our team at the Institute for Health Metrics and Evaluation (IHME) is now able to share the data from the Global Burden of Disease (GBD) 2010 study for download at the country level. First presented in a dedicated triple issue of the Lancet last December and made available in innovative data visualizations on the IHME website, the data can now be downloaded freely in three easily accessible places:

  1. GBD Compare - the flagship visualization for GBD results now has a "download" button that provides CSVs for any chart that is being viewed in the tool
  2. IHME's Global Health Data Exchange (GHDx) provides datasets for the cause and risk factor results for each of the 187 countries covered by GBD
  3. A new query-based data tool allows you to type a disease, injury, risk factor, country, age group, year, metric or other keyword to create a table and simple visualizations with the results

Please let us know if these tools are useful for you, and how we can further improve them. We are always looking for feedback and ideas for improvements.

Tell your global health data story in videos

on Fri, 03/08/2013 - 06:51

Data visualizations provide fabulous opportunities to make large amounts of data accessible. Sophisticated controls can allow users to work their way from high-level views into great degrees of detail. I wrote a few days ago about the role of visualizations in making the country-level results of the Global Burden of Disease (GBD) 2010 study accessible and useful. The GBD visualization tools make over one billion results accessible by choosing any combination of cause of disease or injury, risk factor, country, age, gender or year, and explore various metrics.

Visualization tools can also be used very effectively to tell stories. I would guess that the GBD data contain millions of stories worth telling. Videos can be a very effective way to share these stories. A recent article by Robert Kosara (@eagereyes) provides several examples for great data storytelling in video. Videos from public presentations can be very powerful, but simple screen grabs with tools like SnagIt let you tell your stories from the privacy of your own home (or office) and provide a much clearer view of the visualizations themselves.

Below, I'm adding four videos. The first shows Christopher Murray, Director of IHME and inventor of the concept of Global Burden of Disease, use the new visualization tools to explain GBD and show key findings from GBD 2010. In the second video, Bill Gates talks about the value of visualizations and provides great feedback on the GBD visualizations. The third is my first attempt to explain the functionality of the GBD flagship visualization, GBD Compare, in a video tutorial. And the fourth is a quick video that Tom Paulson from Humanosphere took when we talked about GBD and visualizations (see the resulting article here).

Have a look. And if you have been playing with visualizations, why don't you record your own stories, e.g. with the GBD visualization tools? Let me know about them, and I'll feature the best ones on this blog.

Christopher Murray using the GBD visualization tools to share findings


Bill Gates on GBD and the visualization tools


Tutorial for GBD Compare, the GBD flagship visualization


Quick intro to GBD Arrow Diagram

Visualizing Global Burden of Disease: behind the scenes

on Mon, 03/04/2013 - 16:50

Today, the Institute for Health Metrics and Evaluation (IHME, my employer) is launching 8 new interactive data visualizations that bring to life the results of the 5-year Global Burden of Disease (GBD) study at the country level. The GBD study compiled all available data on health outcomes for 187 countries in the world for 1990 and 2010, and provides estimates for the burden caused by different diseases and risk factors that are comparable across countries and over time. Regional results were published in a dedicated triple-issue of the Lancet in December 2012 (see my related post here). Managing the Data Team at IHME, I have been lucky enough to support the project with finding and managing data over the past 4 years, as well as overseeing the creation of these visualizations.

The data visualizations play a key role in the GBD project for several reasons. It started with IHME’s need to review the results of GBD. Tables and static graphs just don’t provide the flexibility to properly assess results and identify patterns and trends.

GBD uses four key metrics: number of deaths, years of live lost (YLL), years of life lost to disability (YLD), and disability adjusted life-years (DALY). The results datasets are massive, broken down by several dimensions:

  • 291 causes of disease and injuries at the most granular end of a 5-level cause hierarchy
  • 66 risk factors
  • 1100 cause-risk factor attributions (i.e. burden caused by a given risk factor via a particular disease or injury)
  • 187 countries, 21 GBD regions, global
  • 27 age groups: early neonatal, late neonatal, post neonatal, 1-4 years, 5-9, 10-14 and so on until 75-79, 80+, as well as under 5, 5-14, 15-49, 50-69, 70+, all ages, and age-standardized
  • Male, female, both
  • 3 years: 1990, 2005, 2010
  • Estimates expressed as total number, rate, and %, as well as ranked by country
  • 95% uncertainty intervals: lower bound, mean, and upper bound (not strictly a dimensions but adds to the size of the database)

In total, about 1 billion (!) results were calculated for the project, and then there are aggregations by cause, age, and geography. A nightmare to review, but a gold mine for visualizations. The results datasets are fully imputed for all dimensions, i.e. there are no gaps in the datasets. And consistent use of methods ensure comparability of results across all dimensions.

Initially, we tried off-the-shelf visualization tools, but they didn’t give us the flexibility to dive into all the dimensions and properly explore patterns and trends in the data. Then we discovered D3.js (Data-Driven Documents). D3 is a JavaScript library for manipulating documents based on data; it allows developers to build powerful visualizations very efficiently (but you be the judge about how powerful our resulting visualizations really are). And we did what recommended in a blog post today: iterate early, iterate often.

We improved the tools as we reviewed our results, then started using the tools to show the results to collaborators and country experts to obtain feedback, review our estimates, and discuss what data were used for analysis (and what data may be available to further inform and improve the estimates). Realizing how powerful these tools are for different audiences to explore the results of GBD, we decided to make them publicly available. In December 2012, we launched 5 visualization tools with the regional results of GBD (available here) with the publication of the GBD papers in The Lancet.

Updates for these tools are now available with country-level results. In addition, we created three new tools that allow users to review and explore the data from completely new angles. Here is a quick overview of the country-level visualizations:

  • GBD Compare is a powerful platform that visualizes the data in treemaps, maps, time plots, age plots and stacked bar charts. The most powerful feature is the 2-panel view that allows users to review any two of these charts simultaneously to compare and review trends across causes, risks, countries, ages etc. The panels are interactive, e.g. the map can be used to select countries in the other panel and quickly explore countries around the world. It’s a powerful tool, but requires a bit of commitment to make use of all the features. My video tutorial for GBD Compare can be found here.
  • GBD Cause Patterns provides results for 21 cause groups in stacked column charts. It allows quick exploration of trends across geographies, ages, gender and time (see options at the bottom of the screen).
  • GBD Arrow Diagram shows very concisely the rank of causes and risks for a given country or region in 1990 and 2010, along with the related growth trend. The connecting arrows quickly show how fast causes and risks have grown or decreased between 1990 and 2010. A version of the GBD Arrow Diagram is embedded below.
  • GBD Heatmap ranks causes and risks by burden within a country, but then allows comparisons of those ranks across countries and/or regions (you can compare the ranks within a country with the ranks for a given region or the world).
  • GBD Uncertainty Visualization allows users to compare uncertainty bounds across causes and risks for all dimensions. Countries or causes/risks where the data were more sparse or inconsistent will have wide uncertainty intervals.
  • HALE/LE Visualizations shows the relationship between total life expectancy and healthy life expectancy, i.e. the number of years people can expect to spend in good health over their lifetime.
  • Mortality Visualization provides an interesting addition to the results: users can look at all-cause mortality estimates and uncertainty bounds in the context of the underlying input data points. The hovers provide detailed metadata about the source of the data point.
  • COD Visualization show the input data points for cause of death data by country, cause, and sex, also with detailed metadata.

All visualizations also feature “share” functionality that creates a unique URL for the chosen settings that can be shared via email, Twitter, Facebook or other social media. This should be useful to bring up the tools in online conversations about the health situation in different countries, disease patterns and international comparisons.

These tools will be used extensively in policy and country consultations, and many of these conversations will be conducted in locations that have less than reliable internet connections. To facilitate use, we created offline versions of these tools as well. The sheer size of the data provided a substantial challenge, but the tools are now performing well offline.

If you are interested in building additional visualizations with the GBD results, you should start with the regional results of GBD, all available for download on the GHDx here. The country-level results will be made available via the GHDx in September 2013.

I would love get your feedback on your experience with using the visualizations. Are they intuitive? Are there features that you like or don’t like? Are there things you would like to see or do with the data that aren’t possible yet? Leave suggestions in the comments, and I will make sure to include them in our discussions for future development


Example: GBD Arrow Diagram

Open data and the four tiers of health data sharing

on Sat, 02/23/2013 - 06:44

Today is Open Data Day. Open Data enthusiasts, activists, developers, hackers, scientists, and other entrepreneurial data geeks are gathering around the world to demand more open data, work in hackathons or code-a-thons, and engage in data discussions. Just google "open data day events" to see the scope. It's very encouraging.

The benefits of releasing open data are manifold. Take as an example open government data: It increases trust in the government by providing more transparency and accountability. It helps improve public services. It can stimulate economic activity and generate jobs. It helps governments improve the use of their own data. It helps increase the exchange of information among different departments and ministries (which are often siloed) and improve collaboration. And as an additional perk, open data will also lead to savings by reducing work on specific data requests.

There are plenty of examples where this works really well. The release of weather data has led to great weather apps, insurances, and other services, GPS data is used in fabulous apps and services in almost any mobile device now, public traffic data are used to make commuting and traveling easier and so on. The situation for health data is slightly different.

Releasing health data as open data requires consent of the subject and privacy considerations, and there are specific regulations aimed at data collected in the delivery of health care (e.g. HIPAA in the US) and oversight by Institutional Review Boards (IRB). At the same time, the stakes in health are higher than in many other fields. The sharing of health data enables data users to provide evidence for policy and decision making, track performance, evaluate and improve quality, identify effective interventions, optimize healthcare pathways, and improve health of individuals and populations. In short, sharing health data saves lives.

There are four different degrees of openness for health data sharing. Given the potential impact, the goal for every organization holding health data should be to publish as much data as possible at the highest level of detail possible, while protecting subjects and complying with regulations.

Tier 1: Open Indicators

Aggregated or tabulated data should always be shared as open data through as many channels as possible, including organizations' websites, data visualization sites, data aggregators, and open data portals. Examples can be found on or the health section of, as well as at WHOWorld Bank, and the recent data release of the Global Burden of Disease (GBD 2010 regional results; my employer, IHME, is the coordinating organization of the study). 

Tier 2: Open Microdata

Detailed or micro-data at the respondent or individual level can often be carefully de-identified and shared as open data. Sample surveys, mortality data, and even hospital discharge data are often shared openly, e.g. CDC's Reproductive Health Survey series on IHME's Global Health Data Exchange (yes, that's the platform that I manage) or the public use datasets for US mortality from NCHS. If access to funding or other considerations require registration, it should be fast (ideally instantaneous) and free, as is the case for microdata for the Demographic & Health Surveys from MeasureDHS.

Tier 3: Data Use Agreements

When data cannot be shared without restrictions, there should be a clearly defined process for data users to request access to more detailed or partially identified data (if consent from individuals to share the data was obtained). These processes need to balance the proposed purpose of using the data with the risk of identification of individuals, and provide proper oversight and safeguards that protect subjects' privacy. US mortality data with county identifiers are only available under Data Use Agreement.

Tier 4: Fully controlled data access

If data are too sensitive to hand out at all, data owners can offer options to access and analyze data on their own premises, and allow data users to only take the results of their analyses with them. The US Census Bureau operates Census Research Data Centers (RDCs), where researchers can access the full detail of data on controlled premises; no microdata can be taken out and research results are carefully vetted before being released to the researchers. Short of implementing full-fledged programs, data owners can also collaborate with researchers to provide this kind of access.

Last not least, sharing information about data collected is a minimum requirement. Over the past few years, my team at the Institute for Health Metrics and Evaluation (IHME) has cataloged and published information for over 8000 health-related datasets in the GHDx, and we are adding more daily. We are cataloguing data from 200 countries around the world, and it is often incredibly hard to even identify what data have been collected, and who to contact for access. Websites are are in different languages and structures, constantly in flux, can be down for periods of time and data available one day may be gone the next. Data and information about them is often only available in reports, statistical yearbooks, or published literature. Data owners should make an effort to add information about their data and at least aggregated results to open data platforms and catalogues to make them easier to find. And subsequently try to release as much data as possible in each of the four tiers.

Happy Open Data Day!