During my graduate school training, I focused on using publicly available high-throughput data. This, on the one hand, was to reduce the opportunity cost and, on the other, increase the utility of existing datasets. My strategy consisted of three main items to leverage public data to answer research questions. Those are curating datasets from similar experimental designs, using alternative analysis methods for reanalysis, and integrating data of different types. The advantage of this strategy is to ask questions that are not otherwise tractable using individual data sets. Of course, each of these items comes with limitations and challenges to grapple with. In addition to the research strategy, I found it very helpful to think about how to integrate my work with the work of others in the wet lab and how to communicate the output of this work beyond the standard academic paper format.
1. Research Strategy
1.1. Curating datasets from similar experimental designs
It's often the case that researchers interested in a certain topic use similar models and similarly design their experiments. Although different in protocols, combining the datasets generated by different people could be useful in increasing the power of the dataset and filling the gaps in the design. I am interested in the role of autophagy during adipocyte differentiation. The cell line model 3T3-L1 has been used for years, and time-course experiments in this model are a standard. Several gene expression datasets for microarrays and RNA-seq were generated. Combining these datasets produced a dataset with many samples and covered more time points in the course of differentiation. Similarly, ChIP-seq experiments target particular DNA-binding proteins or histone markers, so each dataset is limited to the antibody used to generate it. Curating large datasets for all existing datasets enabled me to study many DNA-binding proteins. A similar problem arises when generating gene expression data with drug or genetic perturbations, so combining more than one dataset means including more than a few perturbations in the study.
Two of the main challenges when combining data from different studies are the curation of metadata and dealing with the differences between the data points not due to biological sources. First, carefully curating information on the samples, data generation protocol, and platforms is key. This way, I could use a unified language to describe the samples, making the next step easier. Data should be filtered to include only samples with minimal differences. Starting from the raw data and redoing the preprocessing of the sequencing reads helped to minimize the differences between the studies. Moreover, I had to deal with the batch effects and perform the final analysis with an eye on the effects that could arise only from how the data were generated.
1.2. Using alternative analysis methods for reanalysis
There is no doubt that the researchers who generate the data can make the most out of it since they are familiar with how it was generated and what it represents. That said, any given dataset has only been analyzed a few ways. Existing and newly developed tools can therefore be applied to these datasets to generate new insights. Differential expression and gene set enrichment analysis has been applied to most, if not all, high-throughput data. Methods that rely on more involved statistical approaches do not do so much, despite the potential to generate even more information from the same data points. I use co-expression analysis and unsupervised learning methods to study the interaction between gene products and deconvolute the mixture of differentiating adipocytes into subpopulations.
1.3. Integrating data from different types
Different high-throughput technologies generate data types that describe different layers of biology. Integrating these data types can be very useful to either verify or supplement the observations. Moreover, certain claims cannot be made if relied on one type of data. For example, the binding of a protein to the DNA of a specific region is not necessarily a claim about the function of that protein. However, the likelihood that this binding is functional increases if, under the perturbation of that protein, the expression of the bound gene changes. In my work, I used both binding data and gene expression data to study the interaction between adipogenic transcription factors and the autophagy genes of interest. I also developed a target method that predicts the interaction of two DNA-binding proteins as they function in cooperative or competitive ways to induce or repress a shared target.
Another way of integrating data is to use existing knowledge for modeling and interpretation. The existing knowledge about a specific pathway can be encoded in a network where the nodes are the biological entities, and the edges are the known interactions between them. The biological expression language is one way to represent these graphs in a standard computable graph. Methods such as network perturbation amplitudes take advantage of these graphs to predict the function of the biological entities from the changes in gene expression due to drug treatment or genetic perturbations. I used this approach to screen for potential antimetastatic drugs in breast cancer.
2. Supporting wet lab
My role in the lab is to explore topics of interest and investigate research questions primarily. To integrate my work with others, I focused on performing exploratory data analysis and generating new hypotheses that can be tested in the lab, if possible. My lab is interested in the role of a metastasis suppressor gene called RKIP and how it relates to autophagy in the context of cancer. To that end, I used gene expression data of prostate cancer at different stages to predict interaction with that protein. The results were further confirmed using a small-scale experiment in a relevant cell line.
This workflow was modified to support existing projects in the lab. The starting point was to establish a link between MTDH and RKIP which are relevant to cancer progression and metastasis. Therefore, I used ChIP-seq and RNA-seq data to identify binding sites of MTDH which was reported to work as a transcriptional co-factor, and to estimate the effect of knocking down this gene in several cell lines. The results of this analysis were also further confirmed in the corresponding cell lines.
Small-scale experiments generate small amounts of data that are relatively easy to analyze. However, this kind of data can also benefit from standardized workflows that increase their utility and the reproducibility of the analysis. I developed two open-source R packages to analyze RT-qPCR and fluorescence microscopy images. Using those packages in the lab increased the possibilities of what can be done with them and improved the reproducibility of the analyses.
3. Communicating my work
In addition to communicating research findings in the standard academic paper format, I believe several other data products could be used to increase the impact of any given project. Therefore, I spent a good amount of time thinking about communicating my research findings and methodology in a non-standard format. These included open-source packages and documentation. Also, I found that packaging analyses in a workflow format allowed me to expand on the details of data analysis decisions that are essential to the work but not necessarily best described in a paper. Finally, the output of any given analysis is often larger than the amount shared in a research paper. Therefore putting some effort into making them sharable in the format of a database or providing an interface for them is often helpful to make it easier for myself and others to explore the findings.