Has your organization encountered the following scenario?
A scientist has finished developing an analytical tool or pipeline, but no team member is sure how to deploy it into production in a way that simplifies automated deployment, makes the compute resources powering the pipeline scale (up and down) with cost-optimized usage, ensures resiliency to unexpected hardware/software failures while enabling easy deployment of updates and version deprecation.
With so many critical engineering questions remaining unanswered, how can your scientific team switch from the mindset of testing and development in an academic playground environment to engineering truly reliable and production-scale scientific tooling, while also minimizing the man-hours required to establish and maintain such a system?
Would you like to run such workloads in the GCP or AWS cloud environments in order to take advantage of the global scale, extremely high infrastructure reliability, and cost-effectiveness these platforms provide when compared to on-prem compute clusters?
Perhaps you would also like an explanation of the core DevOps principles underlying how to run such a workload in the cloud?
I’ll be using a detailed demo to explain how you can make use of a working example, for both Google Cloud Platform (GCP) and Amazon Web Services (AWS) cloud platforms, which demonstrates the execution of real-world scientific computing (bioinformatics) workload utilizing several modern DevOps principles. Without further ado, let’s dive into it.
At a high level, the working example code:
- Shows how to put common bioinformatics tools (FastQC and BWA) into container images and upload them to each cloud’s image repository
- Spins up (and enables rapid tear-down of) all the cloud infrastructure required to run workloads via Terraform, an Infrastructure as Code tool
- Deploys a common workflow pipeline based on these images and spun up cloud infrastructure to each cloud’s fully-managed Kubernetes service, a container execution management and orchestration system. Argo will be used to orchestrate a workflow / pipeline of tasks on a Kubernetes cluster
By using Docker, Terraform, Kubernetes, and Argo, you will learn how to:
- Spin up, update, and tear down the infrastructure required to run your workloads with simple commands
- Run end-to-end analytical DAG-directed workloads with automated retries of individual steps, should they encounter unexpected errors or infrastructure failures
- Centralize workload logging and metrics monitoring into cloud-native log and metric exploration tooling
- Run workloads on infrastructure that will automatically scale up (to global-scale capability) and down (too little or no compute resources) as needed so that you only pay for the CPU/RAM/storage resources that are required to complete your tasks. Gone are the days of leaving servers running idle and running up your bill needlessly
- Seamlessly deploy new software versions and phase out old version usage
The code base demonstrating these principles can be found here, and Part 2 discusses how to make use of it. If you want to skip an overview of DevOps principles and jump into the code, you can stop reading this Part 1 article.
However, if you want a guided tour on the technologies powering these DevOps goals then stick with me. I strongly suggest getting a cup of strong coffee ready. You will need it to power through this detailed article.
Presented below is a crash course on DevOps, Containers, Kubernetes, Terraform, and how utilizing them is critical to real-world usage of software, bioinformatics or otherwise.
Containers: Enabling Scalable Deployment of Tools
The core problem that this post and its code base solves is that of DevOps, or Development Operations. Dockerization, Kubernetes, and Terraform all work together towards the same goal of simplifying the reliable deployment of programs at scale. Understanding the fundamentals of modern DevOps requires a basic understanding of containers. Let’s start with that.
Most bioinformaticians are all too familiar with this issue: You want to use an open-source program, probably created within academia, with outdated dependencies. Installation of this program becomes a package management nightmare as its version requirements for Python, Perl, and other dependencies create conflicts with locally installed updated versions.
You have likely gotten around this issue by creating virtual environments with Anaconda or Python’s venv command. But unfortunately, this approach is flawed. At some point, this way of working becomes untenable and more difficult to maintain than it is worth. Virtual environments may be acceptable in a development and testing setting, but it is simply not sustainable at scale.
I have seen many companies, small and large, discover this the hard way. They stuck with it for far too long before throwing in the towel.
A container is an isolated software process that contains all code and dependencies required to run the application quickly with near bare-metal performance on any computing environment.
A container/Docker image is a definition for how a container should be built and run. Containers are processes that execute the packaged instructions within container images.
A container image is created with a simple
Dockerfile that defines:
- The base OS the image is based on (this does not prevent a container based on the image from being run on other host OSs)
- The packages to install
- Shell commands to run when a container based on the image is executed
Dockerfile is used by
docker build to create an immutable image.
How do you provide redundancy and horizontal scaling for the service packaged in the image?
You achieve that by deploying executable instances of a container image via multiple containers running on multiple hosts residing across multiple distinct data centers, and then load balancing between these containers. Given the global scale of cloud providers, the potential for horizontal scalability and resilience to failure with containers is effectively limitless.
You could, for example, create a container image that packages together a tool such as FastQC and all of its dependencies. Containers launched from this FastQC image can then be deployed in virtually any compute environment, at any deployment scale, regardless of the hardware, software or OS running the host machine. A true boon.
You would run that container similarly to how the FastQC tool is run by itself. It is straightforward, for example, to pass in input files to and get output files back from a FastQC container running FastQC.
Every cloud provider has its own container registry to push and pull images from. This functions similarly to how you push & pull git repository branches. For example, you run
docker pull <image_name> to pull down a FastQC image hosted in your cloud’s container repository. This enables pulling the image onto a variety of compute resources in your cloud environment, which makes FastQC immediately ready to run with
docker run <image_name>without employing arcane methods for managing dependency installation.
The similarity to git repos doesn't end there. Containers can also have tags applied, such as
v1.0.1, so that you can keep track of which container image corresponds to the specific code base version.
docker run <image_name>:<tag_name> enables you to pull and run a container based on a specific image version.
Containers vs. Virtual Machines (VMs)
If you think that containers sound a lot like virtual machines, which also package up software and make it executable in a wide variety of compute environments, you may be wondering what the differences are between the two and why containers are favored for scalable DevOps.
- Both containers and VMs can be run on a single host, but containers achieve this without the large performance hit VMs incur as a result of running their own operating system in addition to the software packaged within. Containers, by contrast, share the host OS, making them lightweight; you will typically see ~0.5% performance degradation.
With a Virtual Machine running on a modern hypervisor, you will be lucky if you see only a 1-3% performance degradation. At larger scales, these differences have a significant and tangible impact on cost.
- Containers are significantly quicker to build and deploy. They enable an Agile development process that does not exist when working with VMs. These can take minutes or hours to build.
- Containers are immutable and defined by a single file, making continuous integration of software more reliable. For example: You can roll back to a previous, known-functional software version when required, safe in knowing that it could not have been altered since it was last deployed.
- Containers encourage the proven software architecture practice of favoring loosely coupled, distributed microservices over monolithic applications. This practice enables faster deployment when a fix or a new feature needs to go out, and it prevents a single issue in a monolithic application from bringing all other aspects of the application down that didn’t necessarily have to be tightly linked.
Ideally, each program deployed at scale should be written into its own container image, thus enabling that program’s deployed container count to scale up and down to meet resource demands independently.
- Containers enable increased ease of OS-level and application-level logging and metric monitoring, as they share the host OS resources and were designed with detailed logging and metric capture in mind.
Container Execution Management
Let’s say you have containerized a program such as FastQC and tagged it with the appropriate tags like
v0.11.9. What do we do with this to not just enable, but greatly simplify fault-tolerant and scalable execution of FastQC?
Kubernetes (often abbreviated as K8s) is an open-source platform created by Google in 2014 as an open-source job scheduler and cluster management system to better enable community adoption of automated management of containerized workloads. It has gained widespread popularity due to the relative ease with which it enables global-scale operations.
Kubernetes is now the second-most popular GitHub repo in terms of authors and issues, second only to the Linux kernel.
The K8s documentation has an excellent but lengthy explanation for why Kubernetes is so useful. To summarize the best of K8s in short, K8s enables:
- Automatic bin packing: You provide Kubernetes with a cluster of nodes (cloud servers) that it can use to run containerized tasks. You tell Kubernetes how much CPU, memory, and storage each container needs. Kubernetes can fit containers onto your nodes to optimize resource usage.
- Self-healing: Kubernetes restarts containers that fail (due to software errors or hardware failures), replaces containers, kills containers that don’t respond to your user-defined health check, and doesn’t advertise them to clients until they are ready to serve.
- Service discovery and load balancing: Kubernetes can expose a container using the DNS name or their own IP address. If traffic load is high, Kubernetes can load balance and distribute network traffic across multiple instances of that container so that the container deployment remains stable at scale.
- Automated rollouts and rollbacks: You can describe the desired state for your deployed containers using Kubernetes, and it can change the actual state to the desired state at a controlled rate. For example, you can automate Kubernetes to create new containers for your deployment (e.g., containers of your software tagged ‘v2’), remove existing containers (e.g., tagged ‘v1') and adopt all their compute resources to the new container. This migration to a newer container could be done all at once or customized, for example, at a rate of 5% of existing containers being replaced every five minutes. If you observe an increased error rate as the new container is being rolled out, rolling the new deployment back is as simple as issuing a single command.
Fully Managed Kubernetes
You will be glad to note that Kubernetes is offered as a fully managed service by the top cloud providers today. These providers include:
- GKE (Google Kubernetes Engine) in Google Cloud
- EKS (Elastic Kubernetes Service) in Amazon Web Services
These offerings simplify the Kubernetes installation and hardware provisioning process to abstract away Kubernetes cluster creation. These services enable you to focus on launching scalable workloads against a K8s cluster that takes only minutes and a few clicks through the console to create.
Both GKE and EKS have a cluster control plane machine that is serverless. This acts as a master node of sorts , with multiple worker machines called nodes. You submit containers for execution to the control plane, which then schedules those to be run on nodes. When running on nodes, these tasks are referred to as pods.
To recap: Pods (typically just a single container) run on nodes, and the scaling up and down of pod count is controlled by the control plane machine. The fully managed K8s offerings also automate the setup of node auto-scaling (auto-scaling is built-in to GKE; EC2 Auto Scaling templates are created by EKS within AWS). You ultimately achieve both pod and node auto-scaling with minimal effort with GKE and EKS.
When setting up a K8s cluster, you should also define optional node groups, or the family of hardware resources to use in your cluster. Node groups are commonly specified when submitting scientific computing workloads.
For example, you might create a ‘high cpu’ node group that uses CPU-rich / CPU-cost-optimized machine family hardware such as the c2 family on GCP or c5 family on AWS, then assign CPU-intensive workloads to this node group. You could have a separate ‘GPU’ node group that uses the a2-highgpu family on GCP or the p3 family on AWS, to which GPU-utilizing workloads are submitted. By operating in this way, distinct machine types can scale up and down independently of one another in a way that is tied to the workloads required to run on those machine types.
However, if you use GKE’s Autopilot mode rather than its Standard mode, you won’t have to specify node groups for your cluster. The provisioning of hardware resources is abstracted away as GKE Autopilot leads the DevOps industry by moving the K8s ecosystem more towards a serverless approach to global-scale computing. With GKE Autopilot, you simply submit your K8s job, within which is defined the CPU/RAM/storage requirements for the container, and GKE will scale compute resources up and down behind the scenes as required to execute that container task based on your specifications. ECS Fargate is AWS’ equivalent to GKE Autopilot.
Reproducible and Automated Infrastructure
Tying containerization and robust container execution on Kubernetes together is Terraform, an open-source infrastructure as code software tool that makes it possible to define, with easy-to-read YAML text, your desired cloud infrastructure state. With a single command, you can create, update, or tear down your cloud infrastructure.
While containers make software versions immutable, easy to reproduce, and easy to scale, Terraform makes the cloud-based infrastructure that those containers are running on easy to replicate, update or delete, while being fully versionable.
By moving your software into containers, your workload management system into Kubernetes, and codifying your cloud infrastructure with Terraform, you will set the foundations for a company that can scale from stealth mode all the way to global ops, as well as quickly recover from common sources of failures such as buggy new releases and hardware issues, with very few core changes required in your DevOps ecosystem.
Putting Newfound Knowledge to Use
Thank you for your time — I genuinely hope this article has helped improve your understanding of DevOps. If you would like to begin making practical use of the skills discussed by following along with a functional demo code base, please continue on to the last of this two-part series.