Public data for the public good

January 05, 2023

Photo by Elaine Casap

🥱

TLDR
Upon completion, the human genome project delivered several discoveries, and a new era of research commenced. More importantly, novel technologies and analysis methods materialized during the project period. The cost reduction allowed many more labs to generate high-throughput datasets. The project also served as a model for other extensive collaborations that generated large datasets. These datasets were made public and continue to accumulate in repositories. As a result, the scientific community should consider how to use these resources effectively for research and the public good. A dataset can be reanalyzed, curated, or integrated with other forms of data to enhance its utility. We highlight three important areas to achieve this goal in this brief perspective.

The human genome project galvanized the scientific community around an ambitious goal. It produced several crucial discoveries, and a new era of research began. For the first time, it became possible to estimate the number of human genes and compile a comprehensive list of their coding sequences. These developments shifted the study of life away from single-gene models and kickstarted the discipline of systems biology. More significantly, scientists developed new technologies and analytic methods during the project. The cost reduction allowed many more labs to generate high-throughput datasets. The project also served as a model for other extensive collaborations that generated larger datasets. These included efforts to sequence a large number of genomes from different populations across the globe. Others concentrated on specific diseases and disease models, such as cancer.

The human genome project galvanized the scientific community around an ambitious goal. It produced several crucial discoveries, and a new era of research began.

Individual labs typically generate small, focused datasets and distribute them in public repositories. As a result, the scientific community began considering how to use these resources for research and the public good. Efforts went into fostering the adoption of best practices to document and share data and good policies around accessibility. Indeed, recognizing reuse as a legitimate form of research at the junior and senior levels has become more acceptable and encouraged. Initiatives are underway to develop cloud environments to store, manage, and analyze data in practical, scalable, and secure ways.

Here, I highlight three critical areas to increase the utility of publicly available data. A dataset can be reanalyzed, curated, or integrated with other forms of data to enhance its value.

Reanalyzing primary data

High-throughput experiments generate simultaneous measurements of a large portion, if not all, of the genome. It has become a standard practice for researchers to share the raw data and documentation of how they generated it. The obvious case for reuse is to mine the dataset for insights not in the initially published studies. Investigators could focus on a particular subset of the data and analyze it in-depth. Others choose to verify or refute the original hypotheses presented in the analyses by examining them independently. Both reuse cases yield additional value and benefit the wider community.

The obvious case for reuse is to mine the dataset for insights not in the initially published studies. Investigators could focus on a particular subset of the data and analyze it in-depth.

One can only analyze a dataset in a few ways in any given study. Existing and newly developed can generate new insights from these datasets. Often, statistically sophisticated approaches have the potential to extract more information from the same data points. These and others are examples of research pursuits that are only possible because or enhance the utility of publicly available data.

Curating data from different sources

Researchers interested in a particular topic often use similar models and similar experimental designs. Despite being generated using different protocols, combining the datasets from separate groups could help fill the design gaps and increase the analysis's statistical power. The reverse is also possible by curating and annotating a subset of a larger dataset to address a specific aspect of the model or focus on a data type.

One added benefit is that curators have to homogenize data from different sources and use unified terminologies. Furthermore, curators can pre-process and quality assesses large files of row data, making the data available in more accessible formats. These examples highlight yet another advantage of curation. It exposes the data to the scientific community and makes it known and easy to use for lab biologists.

Integrating multiple types of data

Different high-throughput technologies generate data types that describe different layers of biology. Integrating data types can be beneficial to either verify or complement the observations made based on a single data type. For example, the binding of a transcription factor to the DNA of a specific region is not necessarily a claim about the function of that transcription factor. However, the likelihood that this binding is functional increases if, under the perturbation of that transcription factor, the expression of the nearest gene changes. New methods capitalize on this idea of combining data from different sources. Existing biological knowledge can also help in modeling and interpreting experimental data.

Successful reuse places demands on the broader community regarding the documenting and sharing of data. The examples above explain how sharing and reusing data focuses on extracting more value from the available resources. However, for this to be standard, it requires data to be documented and communicated in transparent reproducible ways. In addition, resources should be available for curating, annotating, developing tools, and reanalyzing data.

When data is generated with potential reuse, by the primary authors or others, in mind, it would be a net benefit to the community. The conditions that encourage reproducible open science are the same that foster and promote reuse. Data sharing and reuse would benefit researchers in low-resource labs and developing nations. Finally, easily accessible data would facilitate and lower the entry barrier for non-computational researchers to use the extensive knowledge made possible by large datasets.

The conditions that encourage reproducible open science are the same that foster and promote reuse. Data sharing and reuse would benefit researchers in low-resource labs and developing nations.

It is also necessary to acknowledge the potential risks associated with reusing public data. User-submitted data may be of questionable quality and require substantial work to locate and obtain before reaching the point of analysis. The researcher's efforts will be wasted as data quality issues only appear later in the process. Furthermore, reusing public data may produce duplicate records of the same dataset. Finally, several ethical problems arise in reuse cases. Not crediting the original authors may disincentive others from sharing their data and code in the future. Besides, funding for generating new datasets may stall because similar datasets exist, or others could be repurposed.