Skip directly to content

Useful tools to review, refine, clean, analyze, visualize and publish data

on Fri, 03/02/2012 - 16:47
Over the last few days, O'Reilly's Alex Howard (aka @digiphile) has published a series of very informative interviews with data journalists. As journalists get more and more sophisticated in collecting, collating, analyzing and visualizing data, their learnings are really useful for anyone working with data. The interviews contain lots of great insight, very useful information, and interesting links to more resources and examples, and I encourage you to read them in their entirety (see links below).

However, most interesting to me are the tools that the interviewees mention and which Alex calls the "Newsroom Stack". Any number of those tools may be used in sequence to get from your set of data to useful insights. I used the additional comments from the journalists to add to my own list of useful data tools; some key ones below, the rest on the Health Data Innovation Tools page. Let me know what other tools you think I should add.

Data tools: conversion, exploration, analysis

  • Microsoft Excel - still the standard for many as the easy first stop to review data
  • Data Science Toolkit - collection of useful tools to extract and convert test, GIS and other data (my overview here)
  • ScraperWiki - provides software and instructions to extract data and information from web sites
  • Google Refine - clean, organize, refine (duh!) and explore your new datasets,  great for exploring new datasets
  • Overview - clean, visualize and interactively explore large documents and data set (started by AP)
  • The PANDA Project - the new newsroom data appliance
  • Stat/Transfer - converts data between formats of statstical analysis packages
  • Ruby on Rails - powerful open source framework for budding programmers with helpful frameworks like Django or Remote Table (mapping)
  • Python - programming language, very useful for data analysis and visualization
  • JavaScript - prototype based scripting language
  • R - open source software environment for statistical computing and graphics
  • Git - to track versions of code and share with others

Data visualization and GIS packages

  • Protovis/D3 - JavaScript-based library of very slick visualizations
  • MetaLayer - discover and share insights from data via infographics
  • WEAVE - Web-based Analysis and Visualization Environment
  • PostGIS - spatially enabled PostgreSQL server
  • Tilemill - design studio to create maps, powered by MapBox
  • Leaflet - JavaScript library to create interactive maps


  • MySQL
  • PostgreSQL - open source object-relational database system
  • SQLite - Firefox extension that allows SQL queries without setting up a full database

Here are the articles; check back on the O'Reilly Radar data page for more:

Interview 1Liliana Bounegru (@bb_liliana), project coordinator of SYNC3
Interview 2Dan Nguyen (@dancow), news app developer at ProPublica
Interview 3Derek Willis (@derekwillis), news developer at New York Times
Interview 4Ben Welsh (@palewire), Web developer at Los Angeles Times
Interview 5Michelle Minkoff (@MichelleMinkoff), investigative developer/journalist at AP

This should put you in the right mood to have a look at the "Effective Data Visualization" presentation by Hjalmar Gislason (aka @datamarket) at Strata this week. It's a great account of the considerations necessary for anyone that wants to create visualizations. Very useful: if you download the PDF from Slideshare, the slides contain links to more information online.


Post new comment