The art of biological data science: Bringing biology’s great unknowns into focus
How algorithms are ushering in the era of generative biology
When you use Google search, you may have noticed that it predicts what you’re going to type before your fingers hit the keys. Google’s algorithms can suggest whole sentences and even write entire stories. (Full disclosure: this article was written by humans.)
What if we could predict, in the same way, which molecules can make new plastics for foldable phone screens before we ever test them in a lab? And at the same time we could predict and rule out those molecules that will be toxic to the environment? That’s happening, in fact, and it’s exactly the kind of machine learning that gets Kurt Thorn excited about biofacturing: using nature as inspiration and cells as factories to create breakthrough products.
Kurt is a data scientist who’s not a data scientist. His PhD is in biophysics, and for nearly a decade he was a professor at UCSF, where he led research on advanced microscopy and imaging. He left academia and joined Zymergen because he felt he could make a profound intellectual contribution solving meaningful problems in industry.
“Zymergen has lots of hard problems that need new solutions, so it’s a lot of fun for me,” he says.
Kurt’s first big project involved outlier detection. In this project, Kurt helped automate the process of sifting through seas of data to find signals of interest. He was working closely with the data science team (in particular, with Amelia Taylor, whose Medium post and YouTube talk on the subject are a great introduction to outlier detection). He was soon managing a group of scientists at the intersection of biology and computation. When the computation and biology teams were unified, Kurt’s path toward biological data science became clear.
“Mixing biologists with traditional data scientists gives us a really unique perspective on data science,” he says.
Amelia Taylor, Staff Data Scientist at Zymergen, gives a talk about outlier detection at PyBay2018. Kurt’s first challenge at Zymergen was to automate the process of sifting through seas of data in this project to find signals of interest.
From credit cards to biomolecules
Applying data science to biology is a very new idea, but data science itself goes back about twenty years. In fact, most of us have interacted with data science in other domains. Simply put, data science uses algorithms and other computational tools to extract knowledge from large sets of data.
“Visa uses it to process your credit card transactions,” Kurt says. “They have all this transaction data, and computers can sift through it and predict which transactions are legitimate and which are fraudulent.” Kurt says that data scientists who work in human resources can similarly analyze a company’s HR data to reveal gender or racial biases in their hiring or compensation.
At Zymergen, Kurt and his interdisciplinary team are taking what’s been done in these and other tech sectors and applying it to biology. He says that biological data science has only become a field — if you can call it that — in the last five years. What makes their methods and systems even more bleeding edge is that they are applying them to biofacturing, a practice that itself he says “basically didn’t exist before 2019.”
Now comes the hard part
Applying data science to biology may sound simple, but it’s far from easy. Take the example of “biological optimization”, such as trying to get a yeast or bacteria to produce large amounts of a valuable biomolecule — something that Zymergen does every day.
“Working with microbes isn’t exactly an engineering problem,” as Kurt puts it. That’s because we lack a comprehensive, fundamental understanding of how biology really works. For example, we routinely find ways to improve the performance of a microbe by changing genes that have no known function. Fields like systems and synthetic biology have risen specifically to try to make the engineering of biology “routine”. But compared to building a house or designing a computer, where the inner workings are well understood and very predictable, biology’s complexity often defies engineering.
For this reason, Kurt’s data science team is a combination of specialized deep domain experts and highly interdisciplinary researchers, all of whom communicate across disciplines like machine learning, software engineering, statistics, and biology to achieve their ends. This requires cross-training individuals from each of these backgrounds to work successfully across disciplinary boundaries. Again, easier said than done.
Then, if you’re able to build such an interdisciplinary dream team, you have three big technical challenges on the way to biological optimization: the size of the search space, the cost of measuring biological data, and biology’s complex and poorly understood systems. This trifecta of troublesome problems makes it hard to apply data science in the same way you would in other realms.
Challenge #1: Biology isn’t just big data — it’s huge data
From a data perspective, what makes biology unique from other realms is its sheer size. As an example, Kurt points to the typical genome of a microbe, which might have 4,000 genes. If we assume there are 1,000 meaningfully distinct variants of each gene, that’s 1,0004,000 variants to test. That’s a 1 followed by 10,000 zeroes — vastly more than we can ever test in the lab. This is where we can use machine learning approaches to narrow down the search space.
“One way we do so is by taking advantage of what we do know about biology,” says Kurt. “We can take all the information we have about connections between genes, proteins, and molecules in a microbe and use this to prioritize genes to edit based on how ‘close’ they are to our desired product.”
Another thing Kurt’s team can do is to use this kind of approach the same way a Netflix recommendation algorithm might work. If your microbe ‘liked’ changes to one gene, their tools can recommend similar genes that, if changed, might result in a better performing microbe.
Challenge #2: Large, quality biological datasets are expensive
As it turns out, most of the things we really care about are expensive and time-consuming to measure. An example from our everyday work is testing how microbes will perform when we move them from small-scale lab experiments to large-scale, commercial fermentation settings. How much product does it make? How fast? How efficiently? Kurt’s team seeks the optimal ways to answer these questions, balancing small amounts of highly meaningful data and larger amounts of less meaningful data.
“We use bench-scale models of industrial fermentation, scaling down from 100,000 liters to 1 liter,” Kurt says. “Amazingly, what we measure at the 1-liter scale nearly perfectly predicts performance at the industrial scale. But running a single experiment in a 1-liter fermenter can cost $1,000 or more, which means measuring the properties of 1,000 strains could cost a million dollars, before we do any replicates or controls.”
In other words, Kurt says, it’s probably not cost-effective to get every data point we might like.
The solution — at least part of it — is to go smaller. “We can develop further scaled-down models that let us run more experiments. We can do high-throughput screening in 96-well plates. This cuts our cost to, say, $1 an experiment, but these measurements are not quite as predictive of performance at the industrial scale.”
A 96-well plate, one of the many levels at which data and biological scientists can model their work as they scale up from a single cell design to hundreds of thousands of liters in industrial fermentation settings.
The data science team can incorporate data from workflows developed by Zymergen Exploration & Development team, or ZED, which has gone a step further with flow cytometry or other kinds of ultra-high-throughput experiments that allow them to screen millions of strains at once.
The smaller you go, the more distantly related the measurements are to industrial-scale performance, which is the ultimate goal. This relationship is true for other biological industries or any place that’s biofacturing — whether it’s lab-grown meat, or proteins for food, or any of the other myriad places biology is used. Kurt believes that data science plays a key role in success and cost-effectiveness in this area by developing models that best predict performance at larger scales from high throughput data and by ensuring we extract maximum information and value from the expensive large scale experiments.
“We approach testing with both a micro and macro lens,” Kurt says. “This is not simply a big data problem, and in many cases, it’s actually a small data problem.” In the lab, he explains, sometimes the best data is limited to examples only in the hundreds. “In those cases, we need to have good statistical practices and expert statisticians to extract the most relevant signals from the limited number of data points we have.”
Challenge #3: Large swaths of biology are still dark matter
There are parts of biology that are well understood. The biologists on Kurt’s team are tuned into the biological literature, for example, and part of their job is to help guide the team toward problems that haven’t already been solved.
“It’s really easy to train an ML model in a naive way, only to rediscover something any biologist could have told you,” Kurt says.
For everything else, the team has to build tools to interrogate biology’s vast unknowns.
Kurt explains: “Biology is very complicated. We don’t understand that much about any organism — even yeast or E. coli, the best-studied lab organisms in the world — because they’re so extraordinarily complex. So, trying to understand organic matter, let alone design it, is a very broad space. It’s still a big black box.”
Machine learning helps to make databases more searchable and trackable by biologists, whose human understanding and instincts are paired with the power of AI to create a kind of investigative ecosystem.
Generative biology: A glimpse into the future
“Our algorithm is getting really good at predicting biological mutations, in much the same way Google’s algorithm predicts language,” Kurt says, “and there’s actually very recent work that shows you can learn the grammar and syntax of proteins.”
Just as some researchers are using AI to predict things like COVID mutations that may go unrecognized by the human immune system, Zymergen hopes to learn the “language” of biology well enough to classify and predict things like the best biosynthetic pathways to make new molecules, ways to rapidly identify and design enzymes that carry out the biosynthetic reactions we want, and ways to optimize the hosts to maximize production speed and efficiency.
But as we collate the language of biology in ever-increasing databases, it becomes its own challenge. Public databases contain over 250 million unique protein sequences, and Zymergen has another 400 million proprietary sequences on top of that, all containing valuable information about protein function. Extracting that information absolutely requires the right techniques, machine learning experts, software engineers, and data engineers to tackle “Big Data.”
So, the next time you use Google Search or see your iPhone serve predictive text suggestions, imagine a world where advancements in biology happen just as fast. Where proteins and vaccines are served up by algorithms, queried by scientists, then downloaded into production. Biofacturing is bridging the gap between machine learning and human expertise, ushering in a new era of biofacturing.
Check out our technical blog for more in-depth stories about Zymergen’s science and technology.
Kurt Thorn is Senior Director of Data Science at Zymergen. He has a PhD in biophysics and joined Zymergen as a biologist before moving to Data Science. Prior to Zymergen, he was a professor at UC San Francisco, where he directed the Nikon Imaging Center and ran a lab developing novel biochemical assays.