MDI Biological Laboratory
Bioinformatics

Cloud Computing 101


Joel Graber, Ph.D., is a Senior Staff Scientist and Director of the Computational Biology and Bioinformatics Core, funded by a Center of Biomedical Research Excellence award from the National Institutes of Health 

There’s a good chance that by now, many of you have heard of “cloud computing.”  You might have even read the recent news about the project carried out by the Maine INBRE Bioinformatics Core under the auspices of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

So what is cloud computing?  In an uncharacteristic burst of brevity (for me), I’m going to give you the answer quickly, and then you can keep reading if you want more details.  In short, “cloud computing” is the general term for performing our computational work using external computers that we access through the internet.

The benefits of carrying out our work in this manner are several. The primary benefit is cost, in as much as we can access very powerful computers for only as long as we need them, and accordingly, for significantly less cost.  The costs are reduced in several manners – first, we don’t have to purchase the machines, but instead can essentially rent them from vendors such as Amazon, Google, or Microsoft. For just cents or dollars per hour, we can set up and use a machine with up to hundreds of CPUs and Gigabytes (or even Terrabytes) of RAM. Secondly, not only do we avoid purchase of the machine, but we also reduce our on-site power costs, both for running the computers, and also the necessary cooling systems.  Finally, we can also get on with at a somewhat smaller or at least different computational staff to carry out our work.

So, in short, instead of buying, installing, and maintaining a high capacity computer system on campus, we instead access and use external computers to carry out our calculations. To do this, we need to:

  • Move our data onto the machines
  • Make sure the machines have the right programs to do the work
  • Carry out the computations
  • Retrieve the results before releasing our claim on the machine.

The key point here is that we only pay for the machine for as long as we have it running.

Of course, the benefits of cloud computing don’t come without some effort, primarily, focusing on learning to carry out our efforts in a new context – a context that comes with many, MANY options. Most scientists, especially bench-focused experimental scientists, can’t easily find the time needed to learn all of the options, especially how to optimize them to effectively carry out their work, while still minimizing cost. Our job in the Computational Biology Core is to be the bridge between our research teams and these powerful computer resources.

The STRIDES program arose out of the recognition of NIH leadership that cloud-based computing has become a very cost- and effort-efficient means for carrying out large-scale computations for biomedical research.  Recognizing that most research teams don’t have the necessary knowledge and tools to carry out the work, the STRIDES program is heavily focused on educational efforts such as the one currently being carried out by the Maine INBRE Bioinformatics Core.

The education and skills needed to carry out cloud computing include understanding the computational tools, selecting the best type of machine for carrying out the work, and the most cost effective way of transferring and storing the data, the external reference data, and finally the results. The benefit of cloud computing is also the challenge.  The machines can be available as often and for as long as we want – the danger lies in becoming lax. If we don’t turn the machines off when not in use, the benefits will evaporate quickly if we don’t do this carefully and efficiently.

So to wrap up the key ideas, we in the Computational Biology Core have developed, and continue to develop, the skills to make cloud computing efficient in terms of both cost and effort, while also developing education materials and programs to help bring these tools and resources to the MDIBL and Maine INBRE communities.

Okay, now that all that’s out of the way, here are a few more details, for those of you still reading.

First, here’s a bit of insight into what makes this (relatively) new approach possible: most computers are not used to their full capacity most of the time. Smart engineers have taken advantage of the under-utilization and developed operating systems that let one physical machine behave as if it is several independent computers (these are called “virtual machines” or VMs).  Each VM behaves like a separate computer, but many will run on one set of hardware, taking advantage of the fact that individual computers sit idle most of the time. The external machines that we use are, in fact VMs, running on hardware owned and maintained by external providers.

Second, the computing needs of most people, labs, and companies are not constant, but instead are sporadic, coming in bursts. This, of course, is why most individual computers sit idle much of the time.  When the bursts arrive, however, the needs can be quite high. For example, one of our most common analysis steps is alignment; taking short stretches of DNA or RNA sequence and finding the most likely place in a target genome that they match. As noted in my previous blog posts, a modern RNA-seq data set generally has several tens of millions of short (about 100 nucleotides) sequences, that must be aligned to their most likely spot in the genome, which for most organisms of interest is hundreds of millions anywhere up to billions of nucleotides.  To process just one sequence set, we typically use a computer with 8 to 16 CPUs, and 32 or more GB of RAM, running for several hours. A machine like this can easily cost tens of thousands of dollars and needs specialized facilities with power and cooling to run. So we instead provision a cloud machine that costs us on the order of a dollar or two per hour, and run in for a day or two until the work is done, and then shut it back down again.

Finally, I thinks it’s a fair wager that many of you are already doing cloud computing of your own, though you may not realize it.  To put it another way, where are your digital photos?  Where are your email messages (and their attachments)?  Where are your other files?

Your first answer to this question might be “in my phone.”  Alternatively, it might be “on my computer’s hard drive.”  Those answers are at least partially correct, but what if I follow up with “What happens if you lose your phone?” or “What happens if your hard drive crashes?” Prior to about 2010, the answer to that was either a shudder of fear or else (for the tech savvy among us) a confident statement of “I have my backups” (on CD, external drive or some other device that could be used for recovery.)

In the modern world, however, you are much more likely to say “They’re on my iCloud,” or perhaps “They’re in my Google Photos” (or gmail account).  Congratulations, you are using  cloud computing.  As with the work we do in the Core, you are using a small bit of capacity of an external computer system that you access through the internet.

In many ways cloud computing is just a modern demonstration of the idea of “economies of scale.” Commercial companies like Google, Apple, Amazon, and Microsoft have the financial and personnel resources to establish, operate, and maintain large computer systems (frequently referred to as “server farms”).  Because they have such large facilities, it is a relatively smaller (and therefore more economical) burden upon them to absorb the costs of security, hardware failure, and even the power for compute and cooling.  With these resources and skills on hand, they provide their remote cloud services at very cost effective prices for the rest of us, and therefore your photos, your email, and maybe your business documents, are all stored on someone else’s computer and accessible pretty much at will, as long as you have access to the internet.