In this interview Edouard Mathieu, who is the Head of Data at Our World in Data, answers questions from Katie Mika about openly publishing vital data that is often inaccessible and difficult to evaluate. Edouard discusses Our World in Data’s heavily used COVID-19 Data Explorer, the importance of publishing transparent data products in multiple formats, along with processing and analysis code, under a permissive license, and lobbying international organizations to open their data for research by scientists and to increase public accountability.
My name is Edouard Mathieu, and I'm the head of data at Our World in Data (OWID). I lead the team that builds and maintains our open-access database of tens of thousands of metrics. I work on the whole chain of collection, transformation, documentation, and dissemination of our data.
OWID is an online scientific publication launched in 2013 by the researcher Max Roser. The goal of our work is to make the knowledge on the world's largest problems accessible and understandable. Under this description, we include problems such as climate change, extreme poverty, global health, and many others.
Over the last couple of years, we have focused a lot of our attention on helping the world to understand the COVID-19 pandemic. In particular, we created the COVID-19 Data Explorer, one of the tools most widely used by journalists, policy-makers, and researchers to understand how the pandemic is evolving. And we have maintained the global datasets on COVID-19 testing and vaccination that are now used by most international institutions and media organizations around the world.
On the data side of our operations, we aim for the highest-possible transparency and openness in data processing and publishing. This means that all our code is open-sourced on GitHub, and we make our data available in many open formats (most notably our data on the pandemic).
It also means that all the charts, maps, data, and text produced by OWID are free for our users to take and use, without requiring our permission. They only need to provide credit to Our World in Data and our underlying sources. We license our work under a very permissive Creative Commons license (CC BY).
We also try to 'raise the bar' when it comes to data access and free licensing, by calling on international institutions to open their datasets. Recently, we asked the International Energy Agency to lift the paywall on its very important global energy data, which makes it unusable in the public discourse and prevents many researchers from accessing it.
Efforts towards open data have the potential to improve the world in many ways. Open data means that researchers can use data to describe and understand the world much better. It also means that governments can then reuse the data made available by these researchers. And in turn, it means that journalists and citizens have a greater oversight on whether the decisions taken by policy-makers are actually grounded in evidence.
In more practical terms, we see open data as covering three different aspects:
The data itself: that's the most well-known aspect of this problem, and the one where we've seen the most progress. This is about freeing the data from the databases where it's been locked away, sometimes for decades. Hans Rosling described it as the "database hugging disorder": a tendency from international institutions, governments, and companies, to keep their data close to them and protect it like a golden treasure.
The license that protects the data: this is an aspect that is too often forgotten. Once the data is available, what are secondary users allowed to do with it? Can they reuse it, improve it, redistribute it to help the world make progress against important problems? Or are they merely free to download a file on their computers and inspect it?
The code and methods that generated the data: this is where we've seen the least progress. Many institutions and governments, even those that have now opened access to almost all their data, keep a tight lid on how they generated it in the first place. What methods have been used to transform raw data into useful variables? What are the necessary limitations of the final dataset? What compromises were made between accuracy and availability? Has the code that generated the data been thoroughly tested, and how could it be improved?
Even though some actors have made great progress in the last few years, many others are far from ready. For most of them, it's because there are still many internal obstacles that prevent them from committing to opening their data. But for many as well, the problem lies in the lack of technical expertise. Many governments and large international institutions do not have data teams to take care of this. Or when they do, they often exist to display a willingness to improve things, but with so few financial and human resources that it's virtually impossible to truly make progress.
Another complexity that I mentioned earlier is this idea that merely publishing files means we've achieved 'Transparency' with a capital T. By only requesting access to files, we are underestimating the importance of the code and the methods behind them. Trade-offs and difficult decisions made by data providers are crucial intermediary steps that should be publicly available, and publicly discussed. If we have no idea how the data that we use came to be, then there is still a thick layer of blind trust separating us from true openness.
As the pandemic very slowly recedes from the headlines, our team is finding more time to work on the other important problems that the world is facing. And there are many. In the months to come, we want to publish a lot of new content on climate change, inequalities, mental health, democracy, existential risks, and many others.
As for the area I'm responsible for, we are working on opening our data architecture even further, by making many more of our datasets available in a data-science-ready format. Each day, thousands of people around the world are wasting time finding, understanding, and cleaning data that was already processed extensively by others before them. We want to help these people save months of work by publishing ready-to-use datasets that contain most of the variables they need. This is something we've already done not only for our COVID-19 dataset, but also for our CO2 and Energy datasets. We want to reach a stage of openness, maintenance, and documentation of our data, that will let millions of people understand the world better.
Text: © 2021 the President and Fellows of Harvard College and Edouard Mathieu, and licensed under a Creative Commons Attribution (CC BY 4.0) license