t takes a trained eye to determine whether you’ve succeeded in turning a skin cell into a stem cell, or to distinguish between two related cell populations based on a handful of their surface markers. And even when such distinctions become obvious, looking for them in thousands of samples gets tedious. The appeal of machine learning is that a computer program can take over this heavy lifting for you—and do it even better, by seeing what you can’t.
Machine learning aims to make accurate predictions from large sets of data based on prior training using a smaller set of examples. In cell biology, this could mean, for example, being able to predict a cell’s phase or its identity based on its shape, size, or staining pattern.
Cell biology will increasingly rely on machine learning and other computational approaches as automated fluorescence microscopy (high-content screening) continues to capture massive sets of images that can be mined in multiple ways. Imaging applications of machine learning work by breaking an image down into numerical or other descriptors, called “features.” The algorithm then selects and classifies those features. In a branch of machine-learning methods called supervised learning, those classifications are tested for accuracy by measuring against the test set of data. Once the machine-learning algorithm or program is “trained,” it can be applied to a larger set of data. In contrast, unsupervised machine-learning methods mine the data and infer its structure without any training.
Of course, there’s a level of trust involved in allowing machine learning to take the reins. The Scientist spoke with developers of machine-learning approaches in cell biology to help demystify these tools. Here’s what we learned.
Intro: Soon after the launch of CellProfiler—a popular imaging software platform that allows biologists to recognize different cell types, phases, and conditions—its users were faced with a new problem: How do you process the thousands of measurements for each of hundreds of cells in a single image? “In many cases the data don’t even fit into Excel, and certainly the tools there are limiting,” says developer Anne Carpenter of the Broad Institute of MIT and Harvard University.
To address the data problem, Carpenter and her colleagues developed CellProfiler Analyst, an open-source platform that allows researchers to explore and visualize their data. (See Machine-Learning Glossary at bottom of page.) The latest version of the software, 2.0, is rewritten in Python and is equipped with several machine-learning algorithms that classify multiple biological phenotypes. The original version of Analyst, coded in Java, classified only single phenotypes. Also, a new visualization tool allows researchers to see their results overlaid on their multiwell plate experiments (Bioinformatics, 32:3210-12, 2016).
Application example: Aiming to create human replacement livers, Sangeeta Bhatia’s MIT lab cocultured two cell types, fibroblasts and hepatocytes. Hepatocytes don’t proliferate in culture, so the group created a screen for compounds that would cause the cells to self-renew. CellProfiler Analyst enabled the scientists to classify cells within the screened coculture as being either hepatocytes or fibroblasts (Nature Chem Biol, 9:514-20, 2013).
Getting started: Users can download CellProfiler Analyst 2.0, which is Mac- and Windows-compatible, via its website (www.cellprofiler.org). General and application-specific tutorials are also available on CellProfiler’s site (cellprofiler.org/tutorials/). Training the program takes from half an hour to an hour to recognize the majority of phenotypes accurately, Carpenter says.
Considerations: CellProfiler Analyst’s versatility extends beyond traditional microscopy data; it was recently used to analyze data from imaging flow cytometry, an emerging method that captures several shots of each of thousands of single cells as they pass through a conventional flow cytometry system (Methods, 112:201-10, 2017)
Besides CellProfiler Analyst, another user-friendly machine-learning program that complements CellProfiler is called ilastik (Methods, 96:6-11, 2016). Ilastik’s pixel-based classifier can process images that can then be exported into a CellProfiler pipeline. You can download ilastik for free via its site (ilastik.org/download.html), and it is Windows-, Mac-, and Linux-compatible.
Future: If the classical machine-learning algorithms in CellProfiler Analyst are not effective for identifying the phenotype you want to study, then you might need to move on to deep learning, Carpenter says. Deep learning is a type of machine learning that uses more layers of features that form a hierarchy, and often shows far superior performance than classical algorithms. For example, “identifying the stages of malaria infection in red blood cells is impossible using classical machine learning methods but our recent work has shown a deep-learning model can match the accuracy of experts,” she adds. There are currently no user-friendly tools allowing biologists to readily apply deep learning to their imaging problems, but Carpenter says her lab is working on this.
Intro: Developed by researchers at the National Institutes of Health, WND-CHARM (Weighted Neighbor Distances using a Compound Hierarchy of Algorithms Representing Morphology) comprises a four-step algorithm for pattern recognition: extract features, reduce their dimensions, classify them, and validate them. It is available as an open-source command-line program via GitHub (Pattern Recognit Lett, 29:1684-93, 2008)……