How machine learning is changing the way we design products
Marian Farah’s team is honing in on better chemicals and materials — without the AI hype.
Marian Farah leads the Experiments and Optimization Team, which is pushing the envelope in applying machine learning (ML) techniques to make high-performance products with biology.
Machine learning may be Marian Farah’s expertise, but mathematics was her first true love.
“I think I learned geometry before I learned to read and write,” she says. “My father was a mathematics teacher in our farming village in Syria, and he showed me that math was fun. I was solving high school mathematics problems when I was in elementary school.”
After high school, Marian came to the United States with her family. From humble immigrant beginnings, she worked to earn her Ph.D. in Bayesian Statistics and embark on a career at the forefront of computation and natural sciences. At Zymergen, Marian is pushing the envelope in applying machine learning (ML) techniques to make high-performance products with biology.
“I can’t think of a better way to apply machine learning to make the world a better place,” she says.
Love the learning, but avoid the hype
Marian leads Zymergen’s Experiments and Optimization Team, whose primary mission is to build ML tools that accelerate the discovery of novel molecules and materials that can be used to make many of Zymergen’s products. The team includes members with expertise in biology, genomics, chemical engineering, statistics, and machine learning.
Superheroes of the Experiments & Optimization Team
Crystal Humphries, Senior Data Scientist
Superpower: Learning and Innovating
Favorite quote: “We cannot solve our problems with the same thinking we used when we created them” —Albert Einstein
Ehsan Moharreri, Data Scientist II
Superpower: Screening yet-to-be-made materials
Favorite quote: “Far better an approximate answer to the right question, than an exact answer to the wrong question.” —Tukey
Kavindri Ranasinghe, Data Science Intern
Superpower: Bending time to meet impossible deadlines!
Favorite quote: “Your imagination is your preview of life’s coming attractions.” —Albert Einstein
Laurel Wright, Senior Data Scientist
Superpower: Discoverer of truths and navigator of innumerable browser tabs
Favorite quote: “Make it work.” —Tim Gunn
Patrick Aboyoun, Staff Data Scientist
Superpower: Turning thoughts into actions
Favorite quote: “When confronted with multiple solutions to a complex problem, choose the simplest one.” —William of Occam
Tim Gushue, Staff Data Scientist
Superpower: Causal relationship therapist
Favorite quote: “You can’t fix by analysis what you bungled by design.” —Light, Singer and Willett
As passionate as she is about ML, Marian will be the first to tell you that the story of “artificial intelligence” has been vastly oversold as a solution to everything in recent years. “I look to my heroes in this field — Andrew Ng, Michael I. Jordan, Fei-Fei Li, and others — who say we need to have some humility, stop calling everything AI, and avoid the AI hype.”
At Zymergen, her more grounded approach is based on anchoring ML in science and coupling it with experiments and mechanistic models for maximum effect.
“There are many ways we learn in biotechnology, and each one offers unique advantages,” Marian says. Lab experiments are the gold standard, but they are too time-consuming and expensive to test the vast design space that biology offers. Mechanistic models implemented as computer simulations can also be very informative, but they too take a lot of time. More crucially, those simulations require a comprehensive, unequivocal understanding of the physical mechanisms involved in order to be effective and accurate. “On the other hand, ML is data-driven, strikes a good balance between accuracy and speed,” Marian says, “and it can help guide us toward the best use of our resources.”
Machine learning is a game of consequences, and data is a key player
Here, Marian points out that predictions from ML models are only as good as the training data used to build the model. She gives the example of training a machine to detect melanoma using thousands of images taken of skin spots and other blemishes, where each image is labeled: melanoma or not. The machine “learns” from the data and can then “look” at a new image and provide a prediction as to whether melanoma is present.
But if the training data is not representative of the population at large, the model may have a bias and give unreliable results. In this example, if the training data is based mostly on people with fair skin, then the model will be more reliable at predicting cancers in people with fair skin but less accurate in people with other skin tones. Additionally, if the images in the training set came from hospitals with different imaging equipment, the ML algorithm may begin to detect differences due to the type of equipment and use that to inform its predictions. Thus, we could end up in a situation where effectively the ML algorithm uses hospital information — not the skin spot itself — for prediction.
Marian says, “To guard against bias, I usually ask: What’s the origin story of this dataset? What are the protocols under which it was generated? And is the data diverse enough to achieve the kinds of predictions we hope for?” Ensuring that the data have the right quality and quantity for the job is top-of-mind for Marian. Additionally, she emphasizes that validating ML algorithms against a variety of test datasets is crucial in helping guard against bias.
To learn is to map
The same concepts apply to discovering breakthrough molecules and materials. But instead of mapping visual images to skin cancer, Marian’s team maps the chemical structures of molecules to their physical properties in the real world. Combining massive public and proprietary databases, the team builds complex models to generalize how a given structure relates to a specific set of desired properties. This results in an equation that looks like this:
where i is a material, yi is its property (or a combination of properties), xi1…xiM are its features, f is the function that maps those features to the property, and ei is the measurement error. When searching for a molecule with a specific combination of properties (temperature tolerance, flexibility, transparency, etc.), researchers specify the desired values on these properties and let the computer suggest a list of promising candidates. This is done by either screening a large library of molecules using the estimated f, or by using its inverse, when possible, to directly identify a winning molecular structure.
Hyaline ZYM0101 electronic film: A case study
So how does this work in real life? The story of Hyaline ZYM0101 — an electronic film now in development — shows how it might soon help us go from natural molecules to better mobile phones.
If you could peel apart your phone’s touch screen, you’d find that it is made of many layers, including a number of polymer films, each with a specific job. But phone manufacturers have hit a snag: those polymer films are made using the limited handful of petrochemical molecules available, which don’t have the thermal, mechanical, and optical properties to let phone designers make folding screens, or take advantage of the latest ultra-high-resolution microLED technology, and other futuristic display applications.
Not so far-fetched: By scouring the diversity of natural chemicals, materials innovators and product designers hope to produce devices like this in the not-so-distant future.
“To design a stronger, thinner, clearer, and more temperature resistant film than anything out there, we specify the desired physical properties in our model to see which candidates are most promising,” says Marian. That is the goal of ZYM0101, which is exceptionally clear like Zymergen’s first film, Hyaline Z2, but also aims for a unique balance of strength, thermal stability, and flexibility.
Marian’s team meets regularly with the film product team as they formulate and test new candidates. The data from those tests help inform the next round of candidates. With every cycle of experiments, the ML team fine-tunes the model and develops new measures of performance based on observed relationships, and the algorithms get smarter and smarter.
Beyond designing experiments and developing and deploying ML models, Marian’s team also develops intuitive tools to enable quick decision-making. For example, the team created a dashboard that allows product designers and lab scientists to filter from among tens of thousands of potential polymers untested in the lab to dozens of the most promising candidates, with the most desired properties.
The art of choosing your battles
Two more areas in machine learning where Marian’s team is pushing the envelope are active learning and transfer learning.
Active learning helps overcome one of the biggest challenges in materials discovery: the search space is just way too big for a conventional trial-and-error approach. Active learning works similar to the typical design-build-test cycle, but with an important difference: algorithms recommend subsequent experiments that will help complete the dataset so as to improve the algorithm’s ability to make good predictions. By efficiently guiding experiments toward the most informative molecules, active learning also helps product developers get to molecules with desired properties faster.
This figure, provided by Crystal Humphries, illustrates how we can tune active learning algorithms to accelerate discovery.
In transfer learning, researchers apply knowledge gained in one problem to another. In Marian’s work, this often means leveraging knowledge (features, weights, model architecture, etc.) from previously trained models on one class of molecules or properties to train new models for other classes of molecules or properties. The data can come from multiple sources (such as proprietary or public databases), even if they have different formats. This helps to build larger, more complete datasets from which the algorithm can make qualified predictions.
“Initial results coming out of our active learning and transfer learning pipelines are very exciting, and they give me great confidence in our ability to make never before imagined molecules and products faster and cheaper.”
So, will machine learning save the day?
Her optimism notwithstanding, Marian is a pragmatist. She is intensely focused not on potential but on actual results.
“No matter what an algorithm or a fast-talking AI snake oil salesman tells you, success can only be measured and achieved in the lab,” she says.
Marian’s antidote to AI hype is simply this: trust. Marian believes we, and the broader scientific community, need to build trust in three areas: our data, our algorithms, and the people making choices over data and algorithms.
“I’m humbled and proud to work with some of the brightest, most thoughtful people in the field, people who are serious about our models and predictions and use them to advance our mission without embellishment,” she says. “We must be creative, and we must be honest about the limitations of our methods.”