Sep 22, 2017

Fast Forward: Radical Empiricism As the Root of Innovation at Zymergen

At Zymergen, we’re taking a different approach to improving the performance of materials and to discovering new molecular products that touch every industry. To do this, we apply the concept of “radical empiricism.” What do we mean by this? We asked our General Counsel, Duane Valz, to share some thoughts on the uniqueness of our approach and on how radical empiricism drives our innovation.

“I think the biggest innovations of the 21st century will be at the intersection of biology and technology. A new era is beginning.” – Steve Jobs

“We’re only at 1% of what’s possible. Despite the faster change in the industry we’re still moving slow relative to the opportunities we have. We should be building great things that don’t exist.” – Larry Page

I have the great fortune to help lead Zymergen, a company operating at the cutting edge of several fields of science and technology. This after having worked at Google, one of the most innovative companies in the world that combines Internet, cloud, mobile, networking and machine learning technologies (among others) to provide best-in-class information services. Although Zymergen has been called an “utterly unsexy tech company” given the primary industry segments we serve, it is as exciting a place to work and innovate as Google or any of the other great companies I’ve counseled. So why is that the case? Put simply, the appeal of Zymergen ultimately lies in the sheer audacity of what we are undertaking — harnessing the transformative power of biology using robots and machine learning in order to launch a new industrial renaissance.

At Zymergen, we take a radically different approach to improving production performance and discovering new molecular products using biology. As legal counsel, one of my team’s charges is to craft an IP strategy, support invention development and capture the novel value drivers emerging from that complexity. This is no small task. Zymergen’s work involves many subfields within molecular biology, chemistry, computer science and automation systems design. The innovation that we produce often arises at the intersection of two or more of these subfields. Whether we are determining which organic molecules can viably be produced using engineered microbes or working through how best to predict beneficial genomic modification strategies using machine learning, we revel in finding insights across disciplinary boundaries. This work and its application allows us to create tangible economic value for ourselves and our customers while simultaneously advancing scientific understanding. And that is what makes innovation at Zymergen so exciting — we have found an approach that not only advances our engineering and business goals but the science of microbiology as well.

Radical Empiricism: An Approach to Taming Biology

Biology is notoriously difficult to engineer. There is a logic (or several) to how it works at the molecular level, but that cannot be reduced to a single, elegant principle or deterministic formula. For instance, despite having transcribed the entire human genome, we still don’t fully understand the complex interactions of individual genes with each other and the environment to produce or alter a given trait of interest. The same is true of microbes — bacteria, yeasts, other fungi — which are significantly less complex than human beings. These tiny organisms grow in colonies and can produce many things useful for humans — medicines, vitamins, fermented beverages, food supplements, and a range of building block materials for new materials. The human genome contains 19,000–20,000 genes comprised of roughly 3 billion base pairs of building block proteins. The genome of E.coli, a well-studied bacterium, contains approximately 4,000 genes comprised of roughly 4 million base pairs. However, despite having a smaller genome than a human with many fewer possible points of interaction between its constituent genes, much of E.coli genetics are still poorly understood. These significant gaps in understanding have meant limitations in our ability to engineer microbes to more effectively produce useful metabolites or to produce entirely new types of organic chemicals. Why is that? As suggested, genes interact in many complex, nonlinear ways with each other and the environment. There are many ways to perturb or modify genes, down to the base pair level. If there were only four degrees of design freedom (an understatement) for each gene in E.coli, the combinatorial space of design possibilities becomes an impossibly large number for humans to address through traditional laboratory methods (4^4000 = a number with over 2,400 digits; one quadrillion only has 16 digits!). Most random genomic modifications produce little to no measurable effect and many harm or disable the host organism. Finding combinations of measurably beneficial modifications, mathematically speaking, is thus more difficult by far than locating the proverbial needle in a haystack.

Finding beneficial genetic modifications is more difficult by far than locating the proverbial needle in a haystack.

Many genes in a given microbe responsible for core metabolic functions are well characterized. However, most genome regions are poorly characterized, and all microbes have significant (upwards of 25%) “dark regions” in their genomes, in which the functions of the constituent genes are completely unknown. While understanding how to modify core pathway genes to improve a trait of interest is a key aspect of microbial engineering, the gains to be had by working only in well-characterized metabolic pathways are limited. For example, last year after 20 years of work, scientists at at JCVI in San Diego were able to engineer a new bacterium species from scratch using a minimum viable number of genes. The new bacteria are able to feed, grow and self-replicate, but all the reasons why this is possible are not yet totally understood; of the 473 genes in this bacterium, 149 have no known function. Almost a third of the genes used to create the new bacteria were critical to making it viable, but for reasons that no one can explain. That lack of understanding means a lack of insight concerning how these genes interact with the genes whose primary functions are known. Imagine, then, engineering any type of bacterium to cost-effectively produce a metabolic product of interest without really understanding how much of it works. That’s like redesigning a car to be faster or more fuel-efficient without fully understanding how the engine or transmission operate.

The vastness of the design space and the amount of uncharted territory in the genome explains why, at least for industrial microbes, the traditional engineering approach has defaulted to random mutagenesis — exposing microbial colonies to ultraviolet radiation or chemicals and then screening for surviving cells that exhibit desired traits. As its name suggests, this process is haphazard at best. Even when successful, generations of microbes subjected to random mutagenesis undergo many genetic changes that are deleterious, alongside those that may enhance a trait of interest. This is where radical empiricism as a Zymergen engineering principle excels. By systematically modifying the genome of a microbe and carefully measuring what phenotypic changes result, Zymergen is able to understand that a particular modification impacted a trait of interest without necessarily understanding why the modification had such impact. Successful application of radical empiricism is the animating force behind our new discoveries in biology, chemistry and machine learning and lies at the heart of our novel intellectual property.

Advancing Engineering Efficacy as well as Scientific Understanding

So how then does radical empiricism work? As compared to random mutagenesis, Zymergen’s approach to engineering microbes is targeted, generates many beneficial genomic modifications, and builds in an iterative learning model that allows this process to be undertaken very rapidly. “Empiricism” means that the genomic modifications we create must be measurable in terms of phenotypic changes that have resulted from those modifications. “Radical” refers to the process we undertake. We use automation equipment to simultaneously perform hundreds of slightly different experiments related to the same phenotypic improvement goals. Use of such equipment helps ensure experimental consistency alongside high experimental frequency. We use custom in-house software to design genomic modifications, to gather data from all experiments, and to analyze those data to generate additional modification strategies (see below). Thus, instead of a scattershot approach to engineering, radical empiricism allows us to (1) design and generate many genomic modifications; (2) identify and measure the modifications that work best towards a trait of interest (e.g., produce more of organic chemical X); (3) analyze the corpus of modifications; (4) generate predictions regarding which set of additional modifications are most likely to further advance achievement of the desired trait; and (5) rinse and repeat. Each of these steps involves the application of a cross section of the earlier mentioned technical and scientific subfields.

Zymergen thus does not simply test predictions of rational models of a host cell’s genome to determine whether the results conform to such models. Rational models tend to be limited, principally because they fail to take into account the impact that genes not involved in core metabolic pathways have on traits of interest. Leveraging a high throughput (HTP) platform, Zymergen designs many random but targeted experiments involving off-pathway genes, while simultaneously constructing experiments according to rational hypotheses. Any modifications from either set of experiments that have a measurable impact on a trait of interest (i.e., a “hit”) provide immediate discoveries as well as useful data for conducting future experiments. Additional experimental approaches can be constructed and tested using hit data (for instance, modifying additional genes in the literal or functional ‘neighborhood’ of an altered gene that produced a hit). A results-based understanding of a trait of interest — and the extended network of genes responsible for impacting that trait — is built up gradually, inductively, with every round of experiments.

But if radical empiricism is an approach to engineering microbes, how does it advance scientific understanding? Especially if we don’t always comprehend why certain modifications work? Conventional wisdom suggests that science is the precursor to technology. Scientific theories are first developed and validated through empirical research, and only then are these theories applied to practical applications. For instance, many scientists believed that we would be able to develop targeted gene therapies and other technologies that make use of sequence-specific understanding of genes only after we mapped the entire human genome. Scientists have realized, however, that understanding the base pair sequence of a genome provides limited understanding if the functional properties of sequenced genes aren’t known. The history of science, however, is rife with instances of technology development predating scientific discovery. For example, the science of thermodynamics was developed in the late 19th century by initially studying vacuum tubes and eventually the operation of steam engines. Study of these devices in operation enabled a more systematic, scientific understanding of the relationship between pressure, heat and various forms of energy. This, in turn, led to the design of improved steam engines for factories, boats and trains and the development of internal combustion engines later used in automobiles and airplanes.

Another, perhaps more fitting, example comes from the birth of microbiology. Antonie van Leeuwenhoek is widely considered the “Father of Microbiology,” and his contributions include the discovery of protists and bacteria during the late 17th century. However, van Leeuwenhoek is much better known for his advancements in fashioning glass lenses that created microscopes powerful enough to detect and observe lifeforms on such a minute scale. It was his creation of the tools enabling microbiology research in the first instance that distinguished him, and facilitated an entire new epoch of scientific discovery by all biologists to follow. Radical empiricism as an engineering approach is akin to the creation of the microscope. It embodies a series of technical developments that allow deeper probing of the genome and its often opaque mechanisms of action. Instead of discovering new life forms, Zymergen is incrementally making sense of the genetic codes responsible for the expression of known life forms through an engineering approach driven by predictive analytics, measurable results, and positive feedback loops. By studying how a particular genomic modification combined with others impacts a trait of interest, and how such a modification may impact other traits, we gradually build a better understanding of both the modified genes and the genome of the host cell overall. Radical empiricism not only allows us to successfully engineer microbes without a full understanding of why that engineering works, but to generate entirely new scientific insights over time as we collect more and more data.

The role of Software and Computation in Radical Empiricism

What makes what Zymergen is doing different? Computer science has been used to great effect to advance our understanding of biology. The incredible advances in genetic sequencing technology since the initiation of the Human Genome Project owe their development to innovations in both hardware design as well as software. Innovations in computational biology and bioinformatics have advanced our understanding of the composition of genomes and allowed us to isolate and study individual genes. Yet, despite those advancements, we still have only scratched the surface in determining how multiple genes interact to produce traits of interest: it is rare that a single gene is solely responsible for any given trait. Without such an understanding, our ability to intervene and make genomic modifications towards desired outcomes is necessarily limited.

At least two things differentiate Zymergen’s approach to applying computer science to biological applications: (1) our analytics and machine learning approaches are always informed by wet science experiments in vivo; and (2) we immediately apply insights gained from computational analysis back into our iterative engineering of live cells. As described, we generate hit data through the experiments we run on microbes, with “hits” being the genetic modifications we introduce that measurably impact a trait of interest. Hit data is instructive and aggregated hit data provides the basis for deeper pattern recognition. One application of hit data is in understanding which hits are likely to combine well phenotypically. For that application, our computational models have made consolidation predictions that are as good as (and in some instances better than) those by one of our leading scientists. Further details regarding this technique, which we refer to as epistasis mapping, can be found in one of our recent patent publications alongside many other innovations. Ever more refined hit data is what we use to feed our machine learning algorithms. A notable prior stumbling block to using and analyzing hit data in this way is quite simply the difficulty of generating hits in the first instance. With random mutagenesis, hits are far and few between. With Zymergen’s high throughput (HTP) production platform, we can generate dozens of hits in a relatively short period of time. Hit data are the lifeblood of computational analysis at Zymergen. Grounded in empirical reality, hit data are the units of innovation that permit us to excel at microbial engineering and to deepen scientific insight.

By understanding which cluster of modifications produce measurable improvements for a given trait, we can inductively derive provisional theories of causation that may be associated with the entire cluster or subsets of it. Such provisional theories are just that — temporary, unconfirmed notions formed from the analysis of a limited data set. But such theories are enough to make intelligent design choices for the pursuit of additional genomic modifications. This meaningfully focuses our efforts in attacking the very large design space presented by microbial genomes. For instance, one type of genomic modification we make in high throughput is the substitution of promoters in front of certain target genes. Promoters are non-coding regions in the genome of microbes that serve to modulate the expression of the genes to which the promoters are functionally linked. Zymergen develops a suite of 6–10 promoters for each type of microbe with which we work, with such promoters varying in strength and the extent to which each can modulate the expression of target genes. Combinatorially, we can try every promoter in the suite in front of dozens to hundreds of target genes and measure the results of each phenotypic variant. Which genes we target may be based in part on provisional theories (e.g., dampening the expression of each of a group of genes responsible for producing an adverse byproduct is likely to increase the production of our end goal metabolite in this microbe). Simply electing to rotate promoters in front of target genes, instead of implementing every possible base pair modification in each of the target genes, allows us to inspect the search space in a systematic, but relatively more targeted manner. Based on which gene-promoter combinations yielded hits, we may elect in subsequent rounds of engineering to target with those promoters additional genes sharing some common characteristics with previously targeted genes. Our analytic models involve linear regression, polynomial regression as well as emerging frameworks such a graph convolutional networks. Promoters are but one of many modification variables we systematically evaluate. Each subsequent round of experimentation and analysis yields additional insights. This is true particularly for modifications that persist as hits in slightly different strain backgrounds, including backgrounds that contain other modifications previously identified as hits. Over time, some provisional theories become better validated. The more experiments conducted by Zymergen, the richer our corpus of hit data and the more sophisticated our analytical and predictive capabilities become.

For our customers, the results of our work is anything but “unsexy.” Given the challenges of biology and traditional methods of engineering it, even a small percentage point increase of yield or productivity we deliver to our customers can mean hundreds of millions of dollars per year in realized value. For me personally, it is particularly rewarding to capture as intellectual property all of the novel ways we use computation technologies to probe and make sense of biology. Data analysis is the connective tissue between all of the other fields we bring to bear on strain optimization. Machine learning is quickening the pace of our discovery. These are precisely the kinds of tools the patent system was designed to protect.

Despite all that we have achieved, the Zymergen team is aware that we are only at the beginning of our innovation journey. We aspire to develop a more thorough understanding of microbial biology through refinement of our techniques and further iterative investigation. We continue to develop different approaches to perturbing microbial genomes using systematic, measurable approaches. We are also developing additional machine learning models that may be uniquely adapted to probing biology. We thus believe that radical empiricism can lead the way in engineering and gaining scientific understanding in other areas of microbiology, such as plant and mammalian cells. Stay tuned.