In the last post, we discussed RNA-seq, which is the state-of-the art method for obtaining genome-scale measurements of gene expression, which in turn lets us characterize the differences between the samples in an experiment. To be very explicit, the primary result of the initial analysis is an assessment of whether not each gene in the genome is differentially active (or expressed). Graphically, what we are looking for is something like the following:

**Graphical representation of differential gene expression for a single gene.** The red triangles represent measurements of replicate treated samples, and the black diamonds represent matched control samples. Differential expression of this gene is determined by a statistical analysis of the separation between expression levels in treated and control, compared to the variation within each group of replicates.

At the end of our first round of analysis, our statistical analysis of the data produces a ranking of all of the genes in the genome, sorted from the most increased to the most decreased. We can visualize this graphically like this:

**Graphical representation of the relative expression changes of ALL measured genes.** The column represents a list of genes, sorted by their change in expression between treated and controls, starting at the top (in red) with those that are most clearly increased in expression, and ending at the bottom (in blue) with those that are most clearly decreased in expression.

There are very good statistical approaches to carrying out this work, and these methods will assess each gene both to estimate the amount of change, along with the statistical confidence that the gene is truly expressed at different levels. In the Computational Biology Core, one of our jobs is to be well versed in the use and interpretation of the programs that carry out this work, so that we can work with the MDIBL research groups to analyze their data.

Our ultimate end goal, however, is not to only get lists of which genes are or are not affected by an experimental treatment, but more generally to understand the biological mechanisms that are affected by the treatment. Such understanding forms the basis for interventions that might alter or even guide the cellular processes, such as in therapeutic interventions in human medicine.

So how do we go from a ranked list of genes to an understanding of the biological processes that might have been changed? The answer lies in the application of combination of rigorous statistics and existing bodies of scientific knowledge.

It is critical to remember that we are metaphorically standing on the shoulders of an army of giants that came before us. Biological pathways and the genes from which they are made have been studied in great detail for several decades by many research groups across the world. Critically for our efforts, this storehouse of knowledge has been captured in databases that are available for us to aid in the interpretation of our gene expression analysis.

Specifically, we already know a very large number of groups of genes that work together to carry out specific processes (e.g., cell-cycle, gene transcription, and so on). We use that knowledge of which groups of genes work together for which processes as the means of going from lists of differentially expressed genes to “likely affected biological processes.”

The process by which we carry out this analysis is frequently referred to as “statistical or probabilistic inference.” Statistical inference can best be understood by using a real-world example, so let’s use the example of a bag of marbles. Imagine that you have a bag of 100 marbles, in which 50 of them are black, 30 are red, 10 are yellow, and 10 are blue. From these numbers, we can make a relatively simple probabilistic calculation of any possible** random **selection of 10 marbles from this bag. The most likely draw is 5 black marbles, 3 red marbles, 1 yellow and 1 blue. (While that is the most likely selection, many others are possible and even likely => in all cases, we can calculate the probability or

*likelihood*of any specific selection.)

Now, the important part of our example is that we explicitly said this was a “random selection” of the marbles from the bag. If the selection is random, then each individual marble is equally likely to be chosen, so when we select the first marble, since half of the marbles are black, then there is a 50% chance that the marble we select will be black, and by the time we select 10 marbles, there is a good chance that five of them will be black. But what if the selection isn’t random?

Let’s assume that the way I drew the bag of marbles above is accurate and all of the black marbles are at the top, and below that are the red marbles, yellow marbles, and blue marbles in order. Let’s further assume that instead of randomly selecting, I instead take the top 10 marbles—obviously I will get 10 black marbles if I choose from the top of the bag. (Conversely, I’ll get 10 blue marbles if I choose from the bottom of the bag.)

So now, we need to make one more big change in our model, and that is instead of the picture that I drew above (with all black at the top, followed by red, yellow, and blue going down), I am going to choose some other way to sort the marbles, ** but I’m not going to tell you how I sorted them. **Instead, you will still know that the exact number of each (50 black, 30 red, 10 yellow, and 10 blue), and I’m going to let you select the top ten marbles from the bag, and we’re going to ask the question “Do we get roughly the number of each color that I would expect if they are organized randomly?” Here are two examples of what could happen:

If we select the top ten marbles in Sort A, we get 6 black, 2 red, 1 blue, and 1 yellow, which is pretty close to what we expect as a random selection (so nothing in this result surprises us). In sharp contrast, if we select the top ten marbles from Sort B, we get 3 black, 1 red, 0 yellow, and 5 blue, which would be a very surprising (and low probability) result *if the sorting was random with respect to the color*. As a result, we can *infer* that the blue marbles are most likely sorted in a biased manner towards the top of the bag. (Similarly, I hope you can see that if we selected the bottom ten marbles in the bag, we would infer that the yellow marbles in Sort B are biased towards the bottom of the bag.)

So now, let’s map this simple model of marbles in a bag onto the problem of interpreting gene expression changes in an experiment. The analogies are such:

- Genes are the marbles
- Marbles of the same color are genes that are known to act in a common biological process
- The order of sorting in the bag is the change in gene expression in our experiment

So finally, to interpret our “expression-sorted” list of genes, we identify the known groups of genes that work together that are biased towards either increased or decreased expression due to treatment. Or put simply (and leaving out a lot of details of implementation), we look at the groups of most increased or decreased by treatment and look for known groups that show up more than we would expect at random.

These biased groups of related genes (and associated biological processes) provide the support for which biological processes are changed by the treatment, while also leading to new hypotheses that can be tested in follow-up experiments.

Thanks for following along this series with me, Part III will be uploaded soon.