Jul 26, 2019

Building a Platform for Duct Tape

How Solutions Engineering Goals Drive Our Architecture Design

While the Solutions Engineering team at Zymergen works on software, we do so in a fundamentally different way than the Software department. Our unique requirements call for us to architect a custom platform to best provide tools for our users. In this article, I’ll be talking about what makes Solutions Engineering so singular, what that means for our development practices, and how we address that with our software framework.

Scientists at Zymergen and the Solutions Engineering Team

The mandate of the Solutions Engineering team is to support our internal users (Zymergen scientists) by providing last-mile support, rapid prototyping, and one-off software solutions. Our users work in a wide range of scientific disciplines (strain design, modeling, chemical analysis, and lab work, to name a few) and are often developing new processes, researching new opportunities, and trying new technologies. Because of this rapidly changing and growing domain of work, our existing software and data tracking may not always support what users are doing. While other Software teams work on building out our core platform, the Solutions team works to unblock or speed up scientists’ workflows that fall at the edges of that domain. We do this by providing small-scale tools — often narrowly scoped to enable or automate a few use cases. The success of a Solutions tool is measured largely by turnaround time and effort saved for the user, with less emphasis on long-term robustness or flexible use cases. Put another way, Solutions Engineering is the duct tape team for software at Zymergen — we provide quick fixes to immediate problems.

A typical request from a scientist to Solutions might be something like: “A machine I use outputs a .csv of measurement readings, which I have to reformat manually and upload to the next step in my analysis workflow. When I have to do this for twenty files at once it becomes extremely time-consuming and increasingly error-prone. Can you make a tool to automate this?” This may only affect a handful of users performing one specific kind of analysis, but having a Solutions Engineer write a couple hundred lines of code can save these users hours per week. (Not surprisingly, the Solutions Engineering team has quite a few fans among the rest of the company.)

Criteria for Solutions Engineering Development

Because the Solutions Engineering team’s goals differ from those of a standard software team, various aspects of our development process differ as well:

  • We want to optimize for a process that supports rapid development and deployment of small, narrowly-scoped tools.
  • The tools’ specialized use cases mean their codebases are largely independent save for some common helper functions.
  • We don’t expect high traffic for these tools — at the moment each is used a few dozen times a week at most. However, we do need to account for concurrent users of the same tool, and be prepared to scale up as traffic increases.
  • These tools just need to get the job done — no fancy user interfaces are necessary, as long as they are easily accessible and navigable by non-technical users.

While the mandate of the team has always been as described in the previous section, it took some trial and error while practicing it to articulate exactly what that meant for our software development process, and in turn how that should inform our architecture. Identifying these criteria allowed us to evaluate prospective designs against each other and pinpoint specific benefits and disadvantages of each one. To give a clearer picture on why we picked our final design, it’s worth describing what the previous setup was, since it informed several of our design choices.

Previous Architecture — Jupyter notebooks

Solutions tools were generally written using Jupyter notebooks, as the original team members were familiar with this tool as a way to present code for users to run themselves, without needing a command line or shell. Jupyter has the added benefit of having a browser-accessible interface for file upload, download, and navigation, and also provides ipywidgets, a package for creating clean, simple user interfaces.

To make these Jupyter notebooks available to users, the git repository containing the notebooks was cloned on an EC2 instance, and the Jupyter notebook server was run from there. This server exposed a link to a browser-accessible file navigator and notebook launcher.

While this setup accomplished the main goal of providing quick access to tools for users, problems arose as Zymergen grew. Running a single notebook server causes issues when multiple users try to edit the same notebook or too many users are running different notebooks simultaneously. The server also had to be manually restarted any time Solutions Engineers wanted to update the code for any notebook, or when the EC2 instance underwent maintenance. Additionally, this approach also committed us to using Jupyter for everything — tools had to be built as notebooks and the hub for navigating them had to be the built-in Jupyter browser navigator.

New Architecture

After investigating a few alternatives and prototyping some key components during our department-wide Hack Weeks, we ultimately landed on our current platform. It takes advantage of two new technologies set up by our Infrastructure team — Binderhub and Zappa. (Many thanks to our Infrastructure team for making these happen — they ended up contributing to the Jupyterhub OSS project while implementing the former.) Binderhub is an application which spins up single-session-use containers running a Jupyter notebook server (more specifically, Jupyterhub), populated with files from a git repo. Zappa is a Python package that handles configuring AWS Lambda and API Gateway for deploying “serverless” (AKA without any permanent infrastructure) Python web apps. We wrote a Flask app that serves as the central website for collecting all our tools, which is deployed on AWS Lambda using Zappa. Each of the Solutions tools are made available through the app by displaying a Binderhub link, which creates a private container with the tool running in a Jupyter notebook.

Schematic of the new architecture: Using Binderhub, Zappa, and AWS Lambda to create a private container for each Jupyter notebook.

Schematic of the new architecture: Using Binderhub, Zappa, and AWS Lambda to create a private container for each Jupyter notebook.

From the user’s perspective, they click a link on our website and are presented with a Jupyter notebook that runs their desired tool. From the developer’s perspective, they just commit code to a git repo. Here’s what happens behind the scenes to make this happen:

During development (blue arrows in diagram)

  • A Solutions Engineer writes a tool, still in the form of a Jupyter notebook, and gets it reviewed and approved to merge into the git repo. The engineer also adds a page to the Solutions Flask app for the tool, along with some user documentation to display there.
  • When the code is merged into master, a Jenkins build is automatically kicked off, which runs a Zappa script.
  • Once the Zappa script is complete, the newly deployed code is available at a Solutions-specific URL. Accessing this URL triggers AWS Lambda to spin up our Flask app and serve it.

Live on the website (green arrows in diagram)

  • When a user wants to access a tool, they click the appropriate link on our website.
  • This causes Binderhub to spin up a new Docker container running Jupyterhub, pull our code from git, and open a Jupyter notebook for the tool.

Advantages So Far

There are many reasons we’re enjoying the new setup:

  • We were able to use all our existing notebooks and backend code with almost no modifications. (We had to update our code from Python 2 to Python 3 to run on Binderhub, but that was long overdue anyway.) This made the migration easy for us as developers and for the users as well, since they continue to use the notebooks they’re familiar with.
  • Deployment is as simple as merging code — the developer doesn’t need to take any further steps. When the code updates, any Lambda calls currently in progress complete before being shut down, and opening Binderhub links automatically pulls the latest version of the code from git, so users don’t experience any downtime.
  • There is now a central location for all Solutions-related resources, including documentation and forms for users to submit new requests.
  • Concurrent users run code in their own private containers that exist for a single session. This means there are no leftover files or notebooks from other users or sessions and it is impossible for a user to view or modify other users’ files.
  • Thanks to our Infrastructure team, who set up Binderhub to run in a Kubernetes cluster, we can easily scale up resources as they are needed.
  • The Flask app itself is very simple, which is helpful as most of our Solutions team is fairly new to web development.
  • Binderhub links can be replaced with any other implementation we want later. For example, we now have the option to write the user interfaces for our tools without using Jupyter at all.

In Conclusion

Hopefully this article has made it clear what the purpose of Solutions Engineering at Zymergen is, and how that purpose has driven the design of the platform we’ve developed for serving tools to our users. Redesigning our system forced our team to articulate exactly what our unique needs were and how they translate into architecture requirements. We learned a lot during this process, and it’s been exciting and rewarding to watch this project grow from a “wouldn’t it be nice…” idea into a fully-fledged service that benefits our fellow scientists. We’re looking forward to finding ever more efficient and convenient ways to duct tape things together!

Roy Luo is a Senior Solutions Engineer on the Product Operations team at Zymergen.