Part I: Computational Biology is vital to modern research
- July 10, 2020
But what exactly is it and why does science need it?
Modern biology is hugely data intensive. In a typical experiment, the data collected for just a single sample often reaches thirty million or more individual pieces of information. If a person, or even a team of people, spent just one second looking at each individual piece of data, it would take over 8,333 hours (that’s 347 days!) of non-stop work to do the analysis. Obviously, that’s not efficient or feasible for many reasons, even with a large team of researchers.
Before delving into what the computational analysis is, let’s take a quick step back, to discuss exactly what the measurements capture and what we want to learn from this data.
Nearly all biology and biotechnology starts with what we call the “Central Dogma of Molecular Biology,” which is most often represented graphically in a manner like this:
The typical interpretation of this figure is that the arrows represent the “flow of information”: that DNA is capable of self-replication (represented by the circular arrow); that the information stored in DNA is activated by first being transcribed to RNA (resulting in an RNA that initially is an exact copy of one strand of the double-stranded DNA); then finally, the information in the RNA is translated into a protein, which is the class of molecules that carries out many of the necessary functions required in a cell.
This relatively simple model was first established in the 1950s, not long after the characterization of DNA, and while we have added many additional details and occasional exceptions to the model, it still is a reasonable model to show how genetic information stored in DNA is actively used in a living organism.
So then, why do we need Computational Biology?
Let’s start with a few numbers. Studies on the genomes of a large range of different animals and plants have established that a typical eukaryotic organism has about 20,000 protein-coding genes. Within a single cell at any given time, there will typically be a few tens of thousands of RNA transcripts present. Each of these RNA transcripts in their final form will be between a few hundred and a few thousand bases (or nucleotides) in length (there are a few exceptional genes whose transcripts can extend to tens of thousands of bases).
The most common goal that modern experiments have is creating a transcriptome profile, which is also occasionally referred to as a transcriptome expression profile or just an expression profile. So, what does that mean? Let’s take it apart one piece at a time. The “transcriptome“ is defined as the complete set of all of the RNA transcripts that can be made from the genome of an organism. “Expression” is the activation of a given gene by copying it into RNA and then translating it to protein. Finally, “profile” is really the critical part of this term because it means the simultaneous measurement of the expression of all genes at the same time.
Every cell in your body contains (pretty much) exactly the same DNA, yet a skin cell, a muscle cell, a bone cell, and any other distinct cell type have very different functions, structures, and even appearances. How does that happen? The secret lies in the fact that not every gene is expressed in every cell type, and it is that specific selection of which genes to turn on and which genes to turn off that determines the type and the functions of each cell. A transcriptome expression profile is our attempt to measure any experimental sample not only for which genes are being expressed, but also in what amount.
This “heat map” is the graphical output of the computational biology process. This map shows a time course examination of zebrafish hearts after damage. The columns represent the samples, and the rows show genes. This plot has about 400 genes represented in it. The colors in the dendogram (tree graph) on the far right represent grouping of genes that have similar patterns of expression across the nine samples. The nine samples are clustered at the bottom of the map, and they are simply three replicates from day 0 (right after damage, labeled d00_r1, d00_r2, d00_r3), three replicates from day 3 (d03_r1, d03_r2, d03_r3) and three replicates from day 14 (d14_r1, d14_r2, and d14_r3). Note that what is interesting here is that day 0 and day 14 are more similar to each other than to day 3, suggesting that there is a significant response to injury early on, but that by day 14, the samples are much closer to the original pattern again.
One more critical point to understanding why these measurements are so critical to scientific research is that the expression profile in any given sample is not only affected by the type of cell — all of the cells in any organism can modify their expression profile (turning some genes on and other genes off) as a response to nearly any change you could think of: day vs. night, sickness vs. health, young vs. old, and so on.
Measuring the transcriptome profiles in an experiment is the means by which we determine the differences among samples at the most basic molecular level. For example, in the Coffman Lab, transcriptome profiles have been used to determine how exposure to stress can alter the embryological development of the zebrafish. These findings were facilitated by comparing the expression profiles of two groups of samples, one of which was treated to increase a stress response while the other was an untreated control. (MDIBL’s James Coffman, Ph.D. has just had a paper on this subject published, which you can read here.)
Okay, let’s get back to the question of those thirty million pieces of information for each sample. What are the pieces of information? They are the outcome of an experimental measurement procedure called “RNA sequencing.” Through the advent of the human genome project (and other associated advances), our ability to measure DNA and RNA sequence has expanded enormously over the last 30 years. These machines take advantage of the natural ability of DNA, taking a single strand and rebuilding the matching complementary strand in such a way that each step can be differentiated and recorded for whether an A, C, G, or T was added. A modern sequencing machine can perform these reactions on 200 to 300 million sequences at the same time.
One complication, or limitation of these machines, however, is that while many such sequences can be generated, they are relatively short — typically between 50 and 150 bases in length. Remember how I mentioned above that typical RNA sequences are a few hundred to a few thousand bases long? This means that, as helpful as RNA sequencing is, each fragment gained from the process is only a very small portion of the total sequence.
THIS finally is where some of the computational biology comes fully into play! Many analysis programs have been written (and the team of four, including me, in MDIBL’s Computational Core are trained in using these programs and interpreting their results) that will take the set of tens of millions of short fragments of RNA and match them to the reference sequence and its associated transcriptome.
This process of aligning the short sequences from the sequencing machine to our known genome and transcriptome results in the transcriptome expression profile from each sample. Comparison of the transcriptome profiles between treated and control samples allows us to identify the genes that are affected (either by increasing or decreasing their expression). For example, by the stress response in Zebrafish, mentioned earlier.
In my next post, I’ll discuss making the next step — moving from lists of genes that are differentially expressed to determination of the underlying biological pathway and process changes.
If you’ve enjoyed this Blog, you can keep reading Part II where I write about genome-scale measurements of genome expression.