Researchers from the Helmholtz Zentrum in Munich and the Technical University of Munich (TUM) have developed a deep learning strategy for better querying of single-cell atlases.
Called ‘scArche’ (single-cell architecture surgery), the new algorithm uses transfer learning and parameter optimization to allow more effective queries without sharing raw data, the researchers say. They have tested the algorithm on several conditions already, including one involving COVID-19 and they say their model has several advantages.
“Instead of sharing raw data between clinics or research centers, the algorithm uses transfer learning to compare new datasets from single-cell genomics with existing references and thus preserves privacy and anonymity, this also makes annotating and interpreting of new data sets very easy and democratizes the usage of single-cell reference atlases dramatically,” according to Mohammad Lotfollahi, the leading scientist working on the algorithm. Lotfollahi is a team leader at Fabian Theis’ lab at the Helmholtz Zentrum in Munich and doctoral student at TUM School of Life Sciences.
The team’s work is published today in Nature Biotechnology. In their report, the researchers write: “Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data.”
These batch effects arise because these datasets are typically comprised by different laboratories who may be using different protocols.
“Data-integration methods are typically used to overcome these batch effects in reference construction,” note the authors, but this process requires access to all the relevant data, which may be a complex legal process.
Further, actually analyzing and integrating this data typically requires a lot of time and sophisticated informatics resources. “Traditional data-integration methods consider any perturbation between datasets that affects most cells as a technical batch effect, but biological perturbations may also affect most cells,” writes the team.
As a result, they say, most conventional approaches are not up to the task of productively using these atlases.
In one key example, the researchers applied scArches to study COVID-19 in several lung bronchial samples. They compared the cells of COVID-19 patients to healthy references using single-cell transcriptomics. The algorithm was able to separate diseased cells from the references, for both mild and severe COVID-19 cases. Biological variation between patients did not affect the quality of the mapping process.
“Our vision is that in the future we will use cell references as easily as we nowadays do for genome references. In other word, if you want to bake a cake, you usually do not want to try coming up with your own recipe – instead you just look one up in a cookbook. With scArches, we formalize and simplify this lookup process,” says Fabian Theis, who is Director of the Institute of Computational Biology at the Helmholtz Zentrum in Munich and Coordinator of the Helmholtz Artificial Intelligence Cooperation Unit (Helmholtz AI).
Reference datasets can help researchers understand the influences of aging, environment and disease on a cell – and ultimately diagnose and treat patients better. Yet, as the authors point out, single-cell datasets may contain measurement errors (batch effect), the global availability of computational resources is limited, and the sharing of raw data is often legally restricted.
In this publication, they reported studies from mouse brain, pancreas, immune, and whole-organism atlases, to show that scArches “preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration.”