Dec 6, 2018

AI in the Enterprise: Challenges and Opportunities (Part I: Opportunities)

At Zymergen, we apply AI or machine learning techniques to many aspects of our high-throughput microbial genome assembly and testing systems and practices. Aaron Kimball, our CTO, offers his thoughts on what lessons we’ve learned as a result of this AI journey, and how these can be generalized to a broader business context. In this two-part series, we will discuss first the opportunities and benefits brought to organizations by the use of AI; then in part two we will examine the costs and challenges.

Technical leaders in many industries are examining the potential for AI to assist their business. Large tech companies have pioneered the approach of investing heavily in AI features in their products, with Google CEO Sundar Pichai recently declaring that Google’s strategy is “AI first” and Microsoft announcing that they “are infusing AI into every product and service.” Our smartphones are now filled with suggestions and recommendations, automatically-tagged photos of our friends, and even tools that can write your email for you. But AI is not restricted to the world of high tech digital assistants or self-driving cars. Core applications of AI include searching through large volumes of data or complex virtual domains, making predictions or recommendations about the future, or classifying sensed information, and can apply to many industries. Whether one calls it artificial intelligence, machine learning (ML), or data science — the Venn diagram of these domains is nearly a circle — predictive techniques can give a company an edge in markets as diverse as agriculture, manufacturing, and of course, biology, as we know here at Zymergen.

There are many possible benefits to employing AI in an organization. But these benefits don’t come simply by hiring a data science team and telling them to get to work. In addition to being a technical challenge, employing AI in a meaningful way will challenge assumptions about the importance of different kinds of decisions, the relative authority of individuals vs. systems, and the financial value of investing in data collection and infrastructure. In this post, we’ll explore some of the benefits of adopting an AI-oriented strategy, and some of the issues that will arise along the way.

The AI opportunity

Decision-making in complex environments can be time-consuming or error-prone. Organizations that do this well often rely on specific subject matter experts to run a process, and expanding this team or replacing employees who leave can involve a challenging hiring process. In information-dense environments such as telecommunications, financial services, and heavily-instrumented laboratory or industrial systems, AI can provide a substantial advantage in a number of ways.

AI saves time

Data science toolchains free up time for subject matter experts to focus on the subject matter instead of reformatting data in Excel. In contrast, working with data “by hand” is time-consuming even for experts. Collating data streams together is tedious work, and usually must be done by the same people who are analyzing that data to make a decision. Humans are very good at extracting subsets of data from larger data sources (like written reports or complex spreadsheets) but this is a slow and labor-intensive process. Only once the data is properly prepared for the task at hand can it be analyzed. A survey of data scientists shows that they spend 80% of their time on data prep, and just 20% on analysis. Repeating data analyses means that many labor hours are invested in repeated data prep. While building a machine learning pipeline is expensive, a pipeline that automates a complicated process such as data formatting can pay for itself in months.

At Zymergen, we face a number of analytic challenges across our various areas of scientific effort. One of our core offerings is a service that helps chemical manufacturers improve the economic efficiency of industrial fermentation: using microbes to ferment (manufacture) chemicals out of sugar. By genetically engineering the microbes, we can improve the rate and efficiency with which sugar is converted to the product chemical.

Onboarding a new client requires a lot of upfront investigation and investment in calibration of systems and procedures for working with a new type of microbe. One component of this is determining the cultivation conditions for the organism, which we call “plate model development.” The plate model, used in lab tests of engineered microbes, must be an accurate representation of how the microbe would perform in industrial-scale fermentation. To develop a plate model, we perform a series of experiments to test varied cultivation conditions until we arrive at an optimal configuration; there are several parameters that scientists can adjust and the space of potential plate model configurations is very large.

Plate model development for a project lasts about 12 weeks, and requires the full-time attention of several staff members; within that effort, almost 40 person-hours of work were originally required each week for data prep and analysis. While the analysis work was spread across a few individuals, it effectively represented an investment in a full-time employee per project. Tools created by our data science department have since reduced this burden to less than eight hours per week. In a multivariate space such as the plate model, changing one parameter of the experimental model can have a different effect depending on the values of other parameters. As such, understanding which configurations to try in the next iteration of tests requires a combination of biological knowledge and analysis of large volumes of data. Our new tools provide better statistical models of microbe performance and its relationship to plate model parameters, as well as recommend a set of experiments to run to further reduce uncertainty in the plate model. In the hands of a trained scientist, this allows for significant acceleration in decision-making and increases the quality of the resulting plate model.

AI scales horizontally

In the example above, I described a process we need to run from the beginning as each new client project begins, one after another. Human-driven processes require substantial execution time compared to AI-driven processes each time they are run. This scaling issue also applies to processes run in parallel.

AI-driven processes can also be applied to a number of related domains at the same time. Consider the merchandising department of an online e-commerce company. Human experts for each department — women’s clothes, men’s clothes, accessories, etc. — will need to choose which items to mark as on-sale, which items to pair in an ad creative, or which to highlight on the department landing page. An AI system trained on customer data and preferences that can help surface the right choices in one department can also be applied to other departments. As the company increases the breadth of departments, existing AI-based processes can be leveraged to keep the resource needs of each department down.

AI systematizes knowledge

When decision-making relies on experts, expert knowledge becomes the key variable input to a system. Different individuals may make a different decision based on the same input data, so the output of a process can vary week-to-week depending on who is out sick, on vacation, or on call. A given expert can only produce value one run of a process at a time. By pairing subject matter experts with data scientists, this knowledge can be built into a system that applies logic consistently, and these subject matter experts can be freed to work on developing other processes or focus on judging marginal cases that a system deems too close to call or otherwise exceptional. Data science teams that have built a feedback mechanism into their system can then turn those manually-judged cases into new training data for an improved system. At Zymergen we test thousands of genetically engineered variants of a client’s microbe to identify versions with improved fermentation performance, which we call “hits.” Hits can be small and hidden by noise in the measurement system. Our hit detection improved when we shifted to an automated system for identifying hits, vs. one that relied more on expert judgement.

AI has a long memory

AI works best when there is a lot of data to analyze, and it must be done consistently and quickly. While humans hold the advantage when it comes to flexibility of thought, creativity, and intuition, computers have two basic advantages compared to humans: they think extremely fast, and they have virtually unlimited data storage and recall capability. Humans have poor recall of detailed historical information compared to computers. Remembering even the titles of all the books I read ten years ago would be a challenge for me. By contrast, a single server-grade computer could store on disk the full text of the collection of the US Library of Congress — every book ever published in America — twice.

When it comes to data revealed over time, humans will reach for more recent examples of useful information far more often than the optimal information. In contrast, a computer can search a time-series database with no implicit preference for newer records over older ones. At Zymergen, we use a machine learning algorithm to recommend hits for consolidation: once hits are identified, we need to consolidate a subset of the hits into a new master strain of the microbe; not all hits can be combined equally or at all. Prior to implementing this, our scientists had a tendency to recommend consolidation lineages comprised of hits discovered in the most recent three-month period. Our algorithms, however, now often recommended combinations of hits involving ones discovered — and set aside — over a year previously, creating higher-performing final microbial strains.

AI can find unintuitive patterns

In certain domains, humans are great at identifying patterns and can surpass the capability of computers. For example, only recently have computer vision systems been able to keep up with or outperform humans at identifying the subject of an image. While this has required immense computing power and sustained computer scientific research effort, every toddler with a grasp of language learns how to identify the subjects of pictures intuitively.

But in other domains, humans frequently rely on identifying patterns based on hypotheses that are themselves influenced by prior research. Humans may not even test the influence of certain inputs on a signal or pattern, because they reject out-of-hand the importance of those inputs that had not been previously analyzed. On the other hand, a computer system being fed large volumes of data can dispatch a tool like principal component analysis on every data dimension with equal dispassion.

With their ability to run millions of simulations quickly, or analyze years of historical data, AI systems can find non-obvious variables that have predictive power and use them to gain an advantage. At Zymergen we have used data science to discover subtle aspects of our genome manufacturing process that have significant effects on outcomes. For example, by carefully tracking every step of genome manufacturing, we have identified how equipment that processes many trays of samples at a time caused variations in microbe performance several steps downstream, depending on how the trays are stacked inside. This discovery was only possible because of our ability to analyze thousands of “control” samples that had moved through the equipment over time along with exact information about their placement.

Our processes have further been able to develop genomic modifications with large impacts that no scientist would have predicted. In one case, we delivered an improved microbe strain, containing several “hits” (beneficial genetic changes), to a client. Nearly two thirds of the hits came from genes believed to be unrelated to the target product; several of those were in genes of no known function at all. The systematic and dispassionate search performed by our high-throughput system probed areas of the genome that human-driven pattern-seeking heuristics would not have uncovered.

Looking ahead…

AI can bring many advantages to an organization: saving time, scaling knowledge or capabilities, and finding patterns in complex data collected over many years. But these advantages don’t come for free. In part two of this series, we’ll discuss some of the costs and challenges of implementing AI systems.