At Zymergen we are building a data science ecosystem where machine learning algorithms help predict which genetic edits will produce strains with enhanced chemical production. These strains can reduce our reliance on chemicals derived from fossil fuels, in line with Zymergen’s mission to build a sustainable future through biology. We’ve created data pipelines that update machine learning models as new data arrives, then generate new strain recommendations. These recommendations are presented to scientists who submit strain designs to the build team.
Figure 1. Zymergen leverages biology and experimental results to improve microbial chemical production.
Strains are iteratively improved using miniaturized high-throughput experiments. We engineer and test microbe variants in parallel, gather data about their performance, then use this performance data to inform the next round of designs. Strains that look promising at the smallest scale are promoted to larger bench-top fermentation testing. When we find a real winner, the strain graduates to testing in commercial-scale tanks. Microbes that show improvement at commercial scale can help clients save millions of dollars. Our goal is to find these winning strains as efficiently as possible.
Figure 2. Small scale experiments are used to predict large scale performance. Microbes that look promising at the microliter scale are tested in bench-top tanks. Particularly promising strains are tested at industrial scale.
Genome optimization is challenging for a number of reasons. The first reason is that the space we are exploring is enormous, and only partially understood. A typical microbial genome has about 4 million characters of DNA. Of those 4 million base pairs, roughly half of them are associated with genes. And only about half of those genes are associated with known biological functions. Furthermore, even if the function of every gene was known, we would still be a long way from predicting how those genes interact — making it very hard to predict how genetic edits would affect a microbe’s performance.
One way to think about the size of the space we are exploring during genome optimization is to consider how long it would take to exhaustively explore that space. Though we have many ways to perturb a gene in the genome, for simplicity let’s just consider gene deletions. If the genome you are working with has 4,000 genes, perturbing each of them individually is tractable with Zymergen’s high-throughput strain engineering platform. You only need to build and test 4,000 organisms — that’s easy for Zymergen! However, to achieve the performance improvement that our clients require, a single edit simply won’t cut it. Instead we must consider combinations of edits.
Even if we only wanted to test strains with sets of six simultaneous edits, exhaustive search is out of reach. With 4,000 genes in the genome, you can compute that 4,000 choose 6 is 10¹⁷. If Zymergen could build and test strains at a rate of 1/second, it would still take 10¹⁰ years to explore that space exhaustively. That’s the age of the universe, so that’s not going to work! Instead we need to be smart about what we build and test, so we use biochemical modeling and machine learning to help find improved strains more efficiently. In this post, I will tell you about one of the approaches we use to help us explore this very large search space.
Figure 3. Exploration of the space of possible genetic edits is intractable, so Zymergen uses biology and algorithms to navigate it more efficiently.
Zymergen’s genetic engineering workflow can be explained in terms of analogies to software engineering and git workflows.
Figure 4. Engineering microbes has many analogies to software engineering.
We first start with human-interpretable ideas about genetic changes, such as perturbing a specific gene. We then need to “compile” this idea into the low-level DNA language. We do this using an in-house system called Helix, written in Python. Helix converts human-friendly strain descriptions into DNA designs. It also generates all the information needed to build and verify the construction of that DNA.
A proposed modification gets “patched” into an organism by inserting a physical loop of DNA. The genetic edit is encoded between two regions of DNA sequence that match the upstream and downstream sequences on the genome, much like git patches. We test these strains in high-throughput small-scale experiments, which can be thought of as our “unit tests.” These experiments are a quick and simplistic test to get some idea of how the microbes perform, even though the conditions are a bit different from how the microbes will be grown in practice. We then perform larger-scale bench-top tests that can be thought of as “integration tests” because they occur in an environment that better reflects the enormous tanks the winning microbes will be deployed to.
Like code, we can “commit” our strains to a repository for long-term storage. We store the strains in freezers, and their genetic representations in our databases. This allows us to “check out” past strains for new experiments. Freezing strains also enables creation of “branches” in the genetic tree, at any time in the committed strain history. We can also “check out” the DNA representation of strains for modeling, and “diff” two strains to associate performance changes with genetic edits.
Successful application of machine learning at Zymergen requires clean and reliable experimental data. The Data Science team uses Python to pull and preprocess data, update models, and produce recommendations. Our inputs to machine learning models are biological features and experimental results. These algorithms produce edit recommendations that are used to design strains. Strains are evaluated in parallelized small-scale experiments driven by robotic automation. Since experimental data can be noisy and have variation across time, our workflow also removes outliers and performs statistical normalization.
Figure 5. Zymergen’s Data Science team uses Python for cleaning data, modeling, and designing new strains.
Biological Feature Extraction
For the purposes of this blog post, we will only explore feature engineering for genes whose function has been previously identified: even when the function and interactions have been extensively studied, leveraging that knowledge for machine learning is non-trivial!
Biology is an exciting domain for machine learning because we have opportunities to encode information gained from decades of reductionist science as features for our algorithms. We can derive features from subsystems such as those involving DNA, RNA, proteins, and metabolic chemistry. The trick is that each has a complex, network-like structure, and learning algorithms are usually applied to tabular features. Thus, Zymergen writes software to extract tabular features from biological information.
In the following, we consider extraction of features from metabolism: the set of chemical reactions cells use to build cellular material and respond to external conditions. We are especially interested in metabolism because manipulation of metabolic genes is often the quickest route to improve production of a desired chemical.
Imagine we wanted to make a metabolite such as glutamate, which is often packaged as MSG. Microbes produce glutamate by rearranging the carbon in their food, using a sequence of chemical reactions. A subset of the metabolic network for producing MSG is shown below. Green edges show reactions that are involved in the most efficient way to produce this valuable chemical.
Figure 6. Illustration of the most efficient route for converting food to the chemical of interest.
Traditional graph metrics can be extracted from metabolic networks for use in machine learning, including graph distances to the product chemical, or node connectivities. Flux Balance Analysis (FBA) is a complementary method for interpreting that graph in a scientific context. In FBA, we represent the metabolic map as a matrix that encodes which reactions produce and consume chemicals in the cell. These rates form a set of linear equations when we assume metabolism is at steady state. We set an objective function to optimize within this many-dimensional solution space. These solutions estimate “fluxes,” which we can use as machine learning features.
Figure 7. Flux Balance Analysis uses linear programming to determine reaction rates in a metabolic network.
With an open-source Python package called CobraPy, only a few lines of code are needed to produce such features. First the metabolic network is loaded. Next we specify and find the reaction we are interested in optimizing flux for, and set it as the objective. Then we find solutions that optimize that objective, and write out summaries.
Figure 8. Obtaining basic flux features using CobraPy.
The example above uses MSG production as the objective. Importantly, we can loop over other objectives and apply scientifically-motivated constraints (e.g. limit oxygen uptake rates, or change food availability) to obtain a larger set of features.
Figure 9. We can use numerous objectives to obtain a variety of flux-derived features for machine learning.
Using Genome Optimization to Make Useful Molecules
At Zymergen, we are programming microbes to make a variety of molecules useful across a broad set of industries such as electronics, agriculture, pharmaceuticals, and others. The first step is often to prove we can get the microbe to make a small amount of the target molecule. The next step is almost always to program the microbe to increase the yield of the target molecule. Then, we can iteratively edit the organism to increase the yield.
FBA provides real-valued features such as rates of reactions, and can be leveraged to generate binary variables such as membership in important sets. By looping over multiple objectives, we can infer the set of genes expected to be essential for cellular growth and can determine which genes encode proteins needed to produce our target molecule. We can also determine which genes are essential for the organism’s survival.
These types of features are useful for strain recommendation algorithms because we often want to increase the rates of chemical reactions that contribute to making our target molecule, and decrease the rates of other reactions. And while we want the cells to be able to grow and reproduce, we often want to limit this capability to allow for more target molecule production. These features are also useful for the challenge of predicting the results of large commercial-scale experiments from the smallest scale tests.
The approach described above, namely coupling machine learning with biologically inspired features such as those from FBA, helps us explore the genetic space more efficiently. If we can get the combination of edits just right, we can build strains with very high efficiency, where a maximal amount of food eaten by the microbes is converted to the target molecule. Combining biological features with experimental results using this approach has accelerated genome optimization at Zymergen.
Janet Matsen is a Data Scientist working on strain recommendation algorithms at Zymergen.