Data science is crucial to making sense of all the health data collected in the health care system, by patients (and healthy individuals), by governments and others. "The study of the generalizable extraction of knowledge from data" (Wikipedia) is a vital step to making data useful for strategy, planning, and policy and decision making. Data scientists are in high demand
, commanding large salaries
, and are generally predicted a bright future. Rightly so. But to really create an impact with data science, there are 6 distinct steps that each require different types of expertise. Sure, they can all be done by one individual. But for larger projects, and to maximize impact, you'll need a team of experts with different specialties. In reality, data science is all about all about teamwork.
Step 1: objective & approach
Before touching any data, the objective of the data science exercise has to be defined. With the final audience(s) of the results in mind, stakeholder need to identify the key questions that need answers. With more data being available, the question much too often becomes "what can we do with these data". Instead, focus on the question. In some cases, the answer may not even require a large project. Once the goal is clear, the team needs to identify relevant data, the (likely) analytic approach, as well as the key metrics and dimensions that will be needed in the results. This first step should involve experts from all the following steps.
Step 2: data seeking and collection
Based on the information of Step 1, data experts like librarians, information scientists, or domain experts need to identify relevant existing data via literature reviews, web searches, or personal networking. Sometimes, some or all of the data will need to be collected via primary data collection. If useful data exist, it's often still hard to identify relevant datasets, find the data provider, and get access to the data. Barriers like unwillingness to share data, lack of documentation, insufficient capacity or expertise to share data, language, or data formats can make obtaining the data rather difficult.
Step 3: data preparation
Steps 3 and 4 are at the heart of the data science project, and are often combined into one. Once the data have been obtained, data analysts or scientists need to prepare them for analysis. Lots of data are still stuck on paper or in PDFs and need to be digitized. Unstructured data need to be turned into structured data. Microdata need to be aggregated, linked, and analyzed. Data from different sources often require cross-walks, e.g. between different versions of the International Classification of Diseases (ICD). At this stage, correction of data quality issues can be applied, e.g. correcting for "garbage codes
" in cause of death certification. The ideal end result is a coherent dataset that can be used for analysis.
Step 4: data analysis
Once the data are prepared, the rubber hits the road as scientists apply mathematics, statistics and computer science to the data. Different models can be applied to the data, from simple regression models to machine learning. Predictive validity testing can help identify the best model, e.g. for analyzing causes of death
. With more data, more powerful computation and more sophisticated methods, analytic projects can quickly turn into veritable software development projects. These projects require a very systematic approach to coding up the analysis, ideally with the involvement of software engineers. In addition, interactive visualizations can be extremely useful to review the results of the analysis, requiring yet another set of database and coding skills. Typically, scientists are experts in just one or few of these areas, requiring teamwork on this step alone (creating a team can also be the reaction to the current data scientist shortage
Step 5: data and code sharing
Much too often, the results of significant analytic tasks are used to answer the question, but are not shared for broad re-use. Of course, there are often political, competitive, legal, resource and other considerations that make data and code sharing impractical or impossible. However, whenever possible, code and results data should be shared in as much detail as possible. In addition, full citation lists and links to the data sources should be provided, ideally along with the actual input data to enable others to reproduce the results, or build on the analysis. However, much too often data use agreements, copyright and other legal constraints make sharing the actual input data difficult or impossible.
Step 6: data translation
The results of a data science project are often of interest for very different audiences, ranging from academic and other researchers to analysts, domain experts, policy and decision makers, journalists, bloggers, activists, and many others. The team needs to provide results in appropriate formats or products for these audiences, e.g. via peer-reviewed publications, books, policy reports, press releases, infographics, or interactive visualizations. Creating these products requires a good understanding of the relevant audiences and a good command of the subject matter. In addition, the data science team can offer additional advice and insight to the relevant audiences, or engage in collaborations to use and build on the research.
All these steps are crucially important for the success of a project. Are the results credible if relevant input data were not used? And is the analysis worth it if the results are accessible to other researchers in a published paper but not used by decision makers? The steps are also highly interrelated, e.g. the type of data available for analysis will impact what methods can be applied for analysis. While it's possible for an individual to go through these steps alone, doing this as a team will create a much better chance of success and make the work much more productive. And fun.