Robots, Biology, and Unsupervised Model Selection
Our end-to-end approach for model selection for unsupervised outlier detection algorithms, with an emphasis on parameter tuning.
At Zymergen, we integrate robotics, software, and biology to predictably and reliably improve microbial strains through genetic engineering. We design large numbers of microbe variants, then test their performance in high throughput screens. We use data science to rapidly analyze data from these thousands of variants, predict performance in larger tanks, and identify statistically significant improvements.
Our robots allow us to run hundreds of experiments in parallel. Scientists need timely and reliable processing of this data to make the next round of designs, allowing them to iteratively improve strain performance.
One tool we use to clean up experimental observations is unsupervised outlier detection. Model selection, including parameter tuning, is well studied for supervised and even semi-supervised methods, but there are few resources available for unsupervised algorithms (e.g., those operating on unlabeled data). This post presents our end-to-end approach for model selection in Python, with an emphasis on parameter tuning, for unsupervised outlier detection algorithms. We discuss both the challenges of dealing with unlabeled data and tackling parameter tuning in the context of such data.
Outlier Detection: An Unsupervised Learning Problem
We test thousands of microbe variants (strains) weekly and want scientists making hard scientific decisions, not manually inspecting data speculating about outliers and process bias. Thus we built a machine learning pipeline that handles this work.
Internally, we break the process into several steps: “strain build” is where we design and build the large numbers of strains and grow them in plates. High throughput screening (HTS) is where we test the performance of those strains at plate scale and gather data. Then we use machine learning to identify outliers, handle process bias through normalization, and identify improved strains. Finally we use that data to inform the next round of predictive strain design, starting the cycle over again.
In HTS, we grow individual replicates of a strain in wells of a microtiter plate. Errors can arise in any physical process; for example a pipette tip on a robot may get blocked, or a well may be contaminated. The data from affected wells are outliers, not representative of the “true” distribution of strain performance. One way to identify outliers is human inspection. This is an unscalable approach. Further, reproducibility is an issue since different humans give different answers and even the same human might give different answers at different times. As an example of human bias: in the graph below are the data in the blue circles outliers or not?
A solution to these challenges is to train a machine learning (ML) model that predicts outliers from the data. The most common approaches for training ML models requires the training data to be labeled; this is known as supervised learning. In this example, these approaches would require that the outliers be marked. The amount of data that must be labeled for this type of training is large. Many of the more famous uses of supervised learning require a lot of human labor to label the training data. For example, successful image detection starts with large data sets labeled by humans, like imagenet and open images, and each of these datasets provide more than ten million hand-labeled images.
For us to use supervised learning we would have to have scientists hand label large data sets. We prefer to use a machine learning algorithm that doesn’t require labels; this is known as unsupervised learning.
Models and Hyperparameters
There are numerous off-the-shelf unsupervised outlier detection algorithms. Four available in Python’s SciKit Learn are:
- Elliptic Envelope using Mahalanobis Distance (aka Robust Covariance)
- One class SVM (Support Vector Machine)
- Isolation Forest
- Local Outlier Factor
Each of these is actually a class of models, as like most machine learning algorithms they all involve hyperparameters. Different choices of hyperparameters lead to different models.
Our approach to unsupervised model selection can be used to select between different classes of machine learning models, or to choose a model within a model class by learning the hyperparameters to use. We illustrate our solution by learning hyperparameters for Elliptic envelopes using Mahalanobis Distance.
The Mahalanobis Distance (MD) is given by
for each point x, where μ and Σ are parameters — they are inferred from the data. To determine whether any point is an outlier, we compare its MD to some threshold. For example, we might set
Then 10 is a hyperparameter value that defines the size of (or distance across) an elliptic envelope. Points whose MD falls inside the elliptic envelope’s boundaries are inliers and those outside are outliers.
In the figure above, we see an ellipse (red) where we used robust estimators to learn μ and Σ and other ellipses (blue) we used maximum likelihood estimators (MLE) and a range of hyperparameters (MLE vs robust estimators is like mean (MLE) vs the median (robust)). This raises the first of two key decisions we need to make.
- The type of estimator for the center μ and the covariance Σ. Figure 4 illustrates that with the maximum likelihood estimator (MLE), no matter what we set the MD threshold to, we won’t get a good separation between the inliers and the outliers. Therefore, we need to use estimators that are robust to the outliers. We use the minimum covariance determinant for Σ and its associated location for μ.
- The MD threshold. Our example used 10, is that the right value? In this example that distance is precisely the hyperparameter we need to figure out. The rest of this post is about how we use the data to figure out what value to use.
Unsupervised Model Selection
When the data are labeled, as in supervised learning or supervised model selection, the most common way to select hyperparameters is cross-validation and traversing the parameter space using either a brute force grid search or an optimization algorithm like gradient descent or Bayesian Optimization. This can all be done easily using SciKit Learn and many online resources exist that explain how it works (you can get started with Wikipedia!).
This becomes a much harder problem when the data aren’t labeled. We can’t just test a bunch of hyperparameters and see which one is the best, because we don’t know the “right” answer. To solve this, we set out to obtain a performance metric without labels.
“On the internal evaluation of unsupervised outlier detection” (Marques HO, Campello RJGB, Zimek A, Sander J)¹ proposes a metric that is high if the predicted outliers are “well separated” from the predicted inliers. Outliers should come from a different data generating process than inliers: that’s what makes them outliers. Thus we expect them to be separated from the inliers in space, just as they are with the robust ellipse shown in Figure 4 above.
This diagram illustrates the approach proposed by Marques et al. for hyperparameter tuning in the context of outlier detection.
For any outlier detection model you want to compare:
- Run the model on your data.
- Treat the results of that process as a labeled dataset.
- Run a classifier that provides probabilities on those labeled data, such as Kernel Logistic Regression in the diagram above. These probabilities serve as a metric for how far the outliers are from the decision boundary. For example, if the probability that a point is an outlier is high, then the point is more separated from the inliers than if the probability is low.
- Coalesce these probabilities into a metric we call the Chance Adjusted Metric (CAM) that we adapted from the “adjusted IREOS” in Marques et. al. Higher values of the CAM correspond to better separation between inliers and outliers. The hyperparameters providing the largest value of the metric are the ones to use.
Marques et al. “advocate the use of a maximum margin classifier” and indicate the importance of a nonlinear classifier like nonlinear SVM’s or Kernel Logistic Regression (KLR), because we expect the decision boundary to be nonlinear and we need a classifier which quantifies the distance of data from the boundary. We chose KLR because it is easy to implement, handles the nonlinear classification problem at hand, and comes with classification probabilities included as a fundamental part of the model.
The CAM looks at the probabilities of the full set of points, M(X), vs. those of the outliers, M(S), so we have to calculate those probabilities before we can compute our metric. We include a simplified version of the metric to illustrate the main idea; details can be found in Marques et. al. Let
be the full set of data where some points were labeled as outliers by the first step in the diagram above. Let S ⊂ X be the subset of n points in X that are labeled outliers and p(xᵢ) the probability provided by the classifier. Then M(X) is the average of all the probabilities,
and M(S) is the average of the probabilities for points labeled as outliers,
This is a complicated metric, and it helps to remember our goal of measuring how separated the inliers are from the outliers. If an inlier and an outlier are close in (non-linear) space, then the KLR will struggle to predict the class of both points. If this happens across many points, then the distributions of the probabilities over S and X will be very similar and M(S) will be roughly the same as M(X), making the CAM close to 0 or even negative. In contrast, if the points are better separated, then it is easier for the KLR to classify the points. The distributions will differ more, M(S) will be closer to its maximum value of 1, M(X) will be roughly the expected value (assuming the n labeled outliers were assigned at random), and the CAM will be positive and larger than when the points are less well separated.
When we first implemented the steps described above, we ran into further complications due to the complexity of our data generating processes. When we test strains, samples of each strain (microbial variant) are grown in wells. There are many wells in a microtiter plate, many plates within an experiment, and experiments are run at different times; a rich hierarchy. Observations for a single strain may come from different plates and different experiments. We expected the results to give CAM values that were robust to both differences in strains and process variation from the complex hierarchy. If this first implementation worked as described in the paper, we would expect results to look like these, where the largest CAM value circled indicates which hyperparameters to choose.
Further, we expected that if we used hyperparameters corresponding to the largest CAM value the results of the outlier detection algorithm applied to our data would look intuitive as shown below.
After the initial implementation our results looked nothing like these expected plots! For some experiments, the hyperparameters suggested by the highest CAM selected all points as inliers or all as outliers. When we looked at the results per strain, we realized that our problem is much more complicated. The plot below shows CAM by hyperparameter for six different strains. It shows four of the strains reach maximum CAM at different values of the hyperparameter, and suggests we can’t do better than random for two others.
Given our throughput, we can’t choose the hyperparameter for each strain, or each experimental run — the data sizes, time, cost, and general scalability issues make this infeasible. We need an outlier detection algorithm that can generalize.
We had to come up with a way to think about the data on all these different levels (strain, experiment, plates, wells, etc). Using what we know of our actual manufacturing processes, we modified the Marques, et al. approach to add several decision points, and tests for performance of the metric. The diagram above illustrates some of these decision points and optimizations, like computing the metric for each strain, in each plate and across multiple experiments, then normalizing and coalescing the CAMS to get a single generalizable metric. This hierarchical CAM approach was successful. It made the CAM robust to variations both in strains and in our manufacturing process, thus identifying hyperparameters that work well for us week after week across a wide range of strains. Further, we have used this adjusted CAM to investigate different underlying outlier detection algorithms, not just select hyperparameters.
Before we sign off, we should acknowledge that KLR has hyperparameters of its own. We follow Marques, et al. who show that this approach works well if you average over a range of standard values for those parameters.
Tuning unsupervised outlier detection algorithms is hard and a bit neglected in the literature. In this post, we shared how we tackled the problem. We learned that there is a useful metric out there (CAM). We also learned that it didn’t work “out of the box” — and the modifications we had to make pushed us to rethink how we work with our data.
Because our robots allow us to run hundreds of experiments in parallel, rapid and robust processing of this data is critical. It provides scientists with the information they need to make the next round of designs, iteratively improving strain performance. Our revised CAM metric gives a solid foundation for making sure scientists have accurate, outlier-free data in a timely manner, and ensures our process for getting them that data is scalable.
Amelia Taylor is a Staff Data Scientist at Zymergen.
¹ Marques HO, Campello RJGB, Zimek A, Sander J (2015) On the internal evaluation of unsupervised outlier detection. In: Proceedings of the 27th international conference on scientific and statistical database management (SSDBM), San Diego, pp 7:1–12. doi:10.1145/2791347.2791352