Part III: Meet the Computational Biology Core
- November 18, 2020
For this entry into our ongoing discussion of the MDI Biological Laboratory Computational Biology Core, I want to take a step back and, rather than talk about the specifics of the science, instead talk about who we are, what we do, and why it is so critical.
I hope that I’ve already covered at least some of why our work is critical in the previous blog posts, but I will expand upon that a bit further here.
Who are the members of the MDIBL Computational Biology/Bioinformatics Core?
As is not unusual for a researcher of my age working this field, my formal education is not in biotechnology or even biology—instead, I have B.S. degrees in Physics and Computer Science (1987), both obtained from Michigan Technological University, and a Ph.D. from Cornell University in Experimental Accelerator Physics (1993). Exactly how I made the transition from Accelerator Physics to Computational Molecular Biology is a topic for another day. I received most of my training in Computational Biology in a postdoctoral position with Professors Charles Cantor and Temple Smith at Boston University. I moved to The Jackson Laboratory (JAX) in 2002, and after fourteen and a half years, I moved to my current position at MDIBL in May of 2017.
In taking my position at MDIBL, I made the explicit choice that, rather than carrying out an independent research program, I wanted to use my knowledge and experience in analyzing and managing genome-scale data sets to support and enhance the ongoing research efforts for all of the MDIBL faculty. My interest in carrying out this work was heavily influenced by my experiences at JAX, where, among other projects, I led efforts in processing, quality control, and interpretation of the genomic data from the patient-derived xenograft (PDX) program. That work required that I become intimately familiar with the complexities of genomic data, and more importantly, with the experimental procedures from the perspective of identifying and minimizing sources of error and ambiguity. Based on this experience, when I came to MDIBL, I initiated a major revision of all computational efforts, including data storage, external resource updates, and workflow generation, with a specific focus on making all analyses transparent, robust, and reproducible. I’ll discuss these efforts a bit further below.
Nathaniel holds a bachelor’s degree in Bioinformatics (2018) from Michigan Technological University. Prior to starting at MDIBL in 2018, he worked as a Research Assistant in Dr. Thomas Werner’s Molecular Genetics lab, studying the role of toolkit genes in the evolution of complex color patterns in Drosophila guttifera (vinegar fly). At MDIBL, his role transitions between generating educational materials and content, to developing stable, reproducible, and efficient Bioinformatics analysis tools and pipelines written in the Common Workflow Language. He is highly proficient in working with both Windows and Unix-based systems and is familiar with an array of data-science-related programming languages, including Python, R, and Perl.
Chris joined the MDIBL Computational Biology Core in June of 2020. Prior to his role at MDIBL, Chris served as the Data Architect for the Advanced Computing Group at the University of Maine. Chris holds a technical degree in diesel and hydraulic technologies, a BA in Mathematics (2007) and a MA in Pure Mathematics (2009) from the University of Maine. He is also a Ph.D. candidate in the Computer Science Department at UMaine with a concentration in Artificial Intelligence. Chris has extensive experience in teaching undergraduate mathematics courses specifically in the Algebra and Calculus series. In addition to his academic and research activities, Chris is an experienced full-stack Software Engineer and worked with several technology startups throughout the Bay Area.
2) What do we do in the MDIBL Computational Biology/Bioinformatics Core?
The efforts of the Computational Biology Core are focused in a few fundamental areas:
- Analysis, management, and interpretation of experimental data, primarily for MDIBL research groups, but also for the partner institutions in the Maine INBRE network
- Education, both in the form of instruction and development of teaching materials
- Development and maintenance of efficient and robust computational systems that facilitate the research and education efforts
It is easily arguable that the most critical part of the Core’s job is to work with our research groups to ensure that genomic data is properly analyzed and managed. Carrying out this work requires that we in the core:
- Are knowledgeable about the current “best-in-field” approaches to analyzing genome-scale data,
- Are proficient in acquiring, implementing, and using the necessary programs and data resources to carry out the analysis, and
- Most critically, that we sufficiently understand BOTH the computational tools AND the research question in order to ensure that these are matched.
This last point is really the key. For any type of genomic data, there are many valid approaches and tools for analysis. However, not all of them will equally best match the question of interest for our research teams. It’s our job in the Core to be the bridge between the computational approaches and the researcher. Ultimately, this means that the most critical aspect of our job is communication. I meet with our research staff on a regular basis, discussing their current and upcoming plans, and making sure that I understand the questions that their experiments are designed to ask so that the Core can better serve their needs.
The Core is also contributes to many aspects of MDIBL’s educational program. This can involve efforts as simple as delivering an introductory lecture on the basics of bioinformatics and computational biology as part of a larger course, or it can be as complex as organizing and running an event such as the MINOTA (the Maine INBRE Non-model Organism Transcriptome Analysis) workshop. In addition to our work within formal courses and workshops, the Core has also mentored or co-mentored over fifteen summer research interns over the last four years, including both high school and undergraduate students. In an upcoming blog post, I’ll share with you how the Core adapted and expanded our efforts in the summer of 2020 in response to the COVID-19 pandemic.
With a Core team of only three, it is particularly important that we are as efficient as possible with our efforts. As such, a significant portion of our efforts goes into to the continued development of our computing environment, so that we can reproducibly and efficiently carry out the analysis as needed. We are continually developing, maintaining, and enhancing our computational systems, so that we automatically maintain well-organized and current repositories of genomic analysis programs and external data (more on this below). We are also in the midst of a multi-year migration of our computing efforts from locally housed and maintained computing clusters to the more cost-effective and adaptable cloud-based systems.
3) Why are the Core efforts so critical?
As I have emphasized in the previous two blog posts, modern biology is very data rich, such that typical experiments are frequently comprised of multiple samples measured by tens of millions of data points. The analysis of these data sets requires the use of multiple programs, multiple external reference data sets, and high capacity computing systems.
Complicating these issues is the simple and unavoidable fact that not only is modern biology very data rich, but it is also very rapidly changing. For example:
- External reference data resources (e.g. the ENSEMBL genomic database – a centralized resource for geneticists, molecular biologists and other researchers) are regularly updated.
- Analyses of current experimental data sets should ideally be done with the most current versions. However, integrated analysis across multiple experiments requires that all prior experiments also be re-analyzed with the updated reference set.
- Similarly, the computer programs that are used to carry out the analysis can also be subject to updates, making the same sort of periodic updates of analysis necessary.
- New approaches and programs for the analysis of existing data are regularly developed and released.
- New data types are continually being generated. For example, fifteen years ago, there was no RNA-sequencing data; ten years ago, there was no single-cell sequencing data, and five years ago, there was very limited high throughput proteomics data.
The skill sets involved in maintaining these resources, using these resources, and ultimately interpreting the resulting output tables and figures are separate from, and additional to, the deep and detailed knowledge of the biology studied by the world-class research groups at MDIBL. As already noted, this is where the Computational Biology Core comes in. While it is possible that each research group could hire, or internally develop, the skills needed to carry out this work, the Laboratory’s senior leadership recognized that it is more efficient and beneficial to establish a dedicated Core of professional informatics/computational personnel who can distribute their efforts across all of the individual research groups at MDIBL.