In this interview, Colleen Cressman talks with Katie Mika, Data Services Librarian at Harvard Library and the Institute for Quantitative Social Science. Katie discusses her role, how she works with researchers and colleagues regarding all things data, and shares her views on open access, which is, in her words, “as much about the process and the values behind openly sharing as it is about the materials and content that are shared.” Katie also offers strategies for searching in Dataverse and, in keeping with this year’s theme for International Open Access Week, shares datasets related to climate justice.
I’m a Data Services Librarian with both Harvard Library’s Open Scholarship and Research Data Services and the Institute for Quantitative Social Science.
Formally, I work with researchers and library colleagues across disciplines to deliver scalable Dataverse repository data-curation services, consultations, and trainings that support data sharing, reuse, collections development, and stewardship.
Informally, I collaborate with researchers and colleagues from any field who are interested in data sharing. Sometimes this means helping people use Dataverse, sometimes it means supporting data management practices that make it easier to share and publish research data, and sometimes it means figuring out how to make research software and information systems work together to curate datasets that the research community may want to reuse.
I work with both the Dataverse Project and Harvard Dataverse. The Dataverse Project is the open source software used to build a data repository, and Harvard Dataverse is Harvard’s data repository, a specific installation of the Dataverse software. I work mostly with the Data Management and Curation Team at Harvard Dataverse to help researchers use our repository and share data, and I contribute to the Dataverse Project by submitting and working on issues in the GitHub repository where the software is developed, participating in working groups and community discussions about adding and revising software features, and generally participating in the repository professional community as a Dataverse representative.
“Can I put this in Harvard Dataverse?” — My response: Is it data? Can you share it publicly? Then yes!
“Why should I use a CC0 license for my data?” — Open licenses make it easier for researchers to reuse your data. And they know you won’t sue them when they do.
“Is Harvard Dataverse good for sensitive data?” — Generally, no.
“Is Harvard Dataverse a good repository for NIH-funded research?” — Yes! If you aren’t satisfied with a disciplinary repository, Harvard Dataverse is a NIH-recommended generalist repository.
While ‘open access’ may have a narrower technical definition, I like to think of it as the general concept of sharing or publishing outputs from a research project, using open licenses and open platforms. This certainly includes papers and other traditional publications, but increasingly researchers are encouraged or required to share the data, documentation, and analysis workflow or code used to produce results and draw conclusions in a paper or project.
Crucially, I think open access is as much about the process and the values behind openly sharing as it is about the materials and content that are shared. Research thrives when we share our process and methods openly in order to promote equity and diversity and generate knowledge that is most broadly useful for society, not just for elite institutions and capitalist interests.
As an academic librarian, I think there are three primary reasons to openly share data.
Reuse and reproducibility. Access to data enables a host of core research activities, such as verification, replication, and reuse. Opening your data increases the trust others have in your research because having the data makes it possible to verify results and conclusions. Open access to data facilitates downstream research such as meta-analyses, reuse, and other investigations that include results from multiple studies. It also maximizes the potential for new applications by making it possible for future researchers to reuse your data in coordination with other datasets or to answer different questions.
Accelerates the pace of discovery and innovation. One major benefit is that open access to data accelerates the pace of research. When data are openly shared from the beginning of a research project, researchers can work together or in parallel on similar problems. These collaborative opportunities allow for building on the work of others rather than redoing it.
Amplifies your scholarly and public impact. Access and reuse of open data leads to increased citations, clear recognition of contributions, and broader scholarly impact. When data are open and the findings are shared in a clear and accessible way, it increases public understanding, creates opportunities for public participation, and bolsters public support of research initiatives. Indeed, open practices are becoming much more widely recognized in the funding process as granting agencies seek to increase the impact of their funded projects by requiring data to be shared as openly as possible.
For these and many more reasons, libraries and universities are building capacity to support researchers dedicated to sharing their knowledge openly.
Our selections are highlighted in a featured Dataverse collection. We chose to include collections of datasets curated by research organizations like The Alliance of Bioversity International and CIAT (International Center for Tropical Agriculture). We also included individual datasets, such as “Climate Change Tweets Ids” from George Washington University Library and “Replication data for: Climate Change, Inequality, and Human Migration,” which contains data related to a publication.
Given that climate justice is a very broad topic and lots of different types of data could be used to study related effects and phenomena, we chose to focus on providing a selection of high quality datasets and research collections that are diverse in researchers’ affiliations, locations represented or studied, and disciplines and subject matter. This Dataverse collection is an opportunity to showcase research outputs that directly address the inequities that determine the impacts of climate change and society’s response to them.
We also selected data that we, as data curators, consider to be of relatively high quality. The FAIR Principles (Findability, Accessibility, Interoperability, and Reusability), which, while not formal measures of quality, are the general guidelines we use to determine the overall quality of a shared dataset. We want to see that the data are sufficiently described to enable reuse, link to related publications and other relevant research outputs that help contextualize the data, and, crucially, are shared with open, permissive licenses.
I like to think of searching as kind of a funnel shape, where you start by throwing all possible search terms at the system and see what comes up before narrowing into more specific requirements. Especially when you are still developing a research question and want to get a general sense of how data in that subject are structured, described, and used. At this point I think it’s helpful to use a variety of databases and search engines to learn about data types and formats commonly used in the relevant fields, sub-disciplines, and jargon that can help in developing a list of search terms and keywords.
Once we have a broad understanding of what may be relevant, mining descriptions, abstracts, and variable names or other metadata can be helpful to learn about even more relevant keywords that can help with further searching. It can also be helpful to narrow a topic by using general search terms and using filters in a Dataverse repository to discover more specific content. For example, searching for “Climate change” gives a huge number of results. But if we filter by subject (on the left of the page) and look only at “Medicine, health, and life sciences” data, then our results change pretty significantly.
Searching is only half the task however; we also have to evaluate the returns to determine if a dataset will be useful for our purposes. If we’re studying home-buying trends in different regions as they correlate with air pollution we have to make sure that we compare housing market data and air quality along the same location measurements and dimensions. We can either look for data that use the same measurements, say county level, or we will need to do more work to normalize the data so we can appropriately compare data points.
Open access is one of the reasons I became a librarian! Since data sharing and data publishing are relatively new compared to more traditional publication and knowledge dissemination methods, librarians have a great opportunity to help shape the future of open data. Similarly to how we’ve seen publishers lock the work of researchers behind paywalls, we see some evidence that similar corporations are aiming to profit off of public research by making data sharing expensive, inaccessible, and complicated. Librarians are stewards, teachers, developers, and publishers and can be effective partners for institutions and research groups that value open access to knowledge they create. We can help educate new researchers and the public about data literacy and the importance of openly sharing the complete products of the research process. As information stewards, libraries are ideal organizations for developing open infrastructure to manage the long term preservation and accessibility of research data. Librarians have become critical to opening the publishing process and advocating for open, equitable access to information, which include both publications and data.
Text: © 2022 the President and Fellows of Harvard College and licensed under a Creative Commons Attribution (CC BY 4.0) license