Analysis of flow cytometry data with domain-adversarial autoencoders
Title (trans.)Análisis de datos de citometría de flujo mediante el uso de domain-adversarial autoencoders
EntityUAM. Departamento de Ingeniería Informática
SubjectsFlow cytometry; Batch effects; Autoencoders; Unsupervised learning; Domain adaptation; Clustering; Dimensionality reduction; Informática
NoteTrabajo fin de máster en Bioinformática y Biología Computacional
Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.
Machine Learning is a field of Artificial Intelligence focused on automatic data analysis. In the era of big data, there appear algorithms that allow the analysis of large quantities of data efficiently, incorporating more knowledge into our studies. One of the main fields of application for these algorithms is bioinformatics, where large amounts of high-dimensional data are typically analyzed. However, one of the main difficulties in the automatic analysis of data with a biological origin is the inevitable variation that occurs in the experimental conditions, causing the well-known batch effects. This makes it difficult to integrate data that come from different experimental sources, thus reducing the simultaneous capacity for analysis and losing relevant biological information. Focused on flow cytometry data, in this work we propose a new algorithm in the context of unsupervised learning with the aim of smoothing the influence of batch effects simultaneously under an arbitrary number of experimental conditions. Applying state-of-the-art techniques in Machine Learning, such as domain adaptation and adversarial learning, we present the domainadversarial autoencoder (DAE). For the validation of the DAE as a domain adaptation or batch normalization algorithm, in this work we carry out experiments with three data sets. The first two are simple, artificial datasets composed of beads that have been passed through the cytometer in a controlled environment. In one of them, the clogging or misalignment of the cytometer is artificially simulated. In the other, we have the same data analyzed on two different machines. The third example is a real dataset with dendritic cells of mice that have also been collected on two different cytometers. Firstly, we show how these batch effects influence the analysis typically applied by flow cytometry users, such as clustering with Phenograph or visualization with t-SNE. Secondly, we see how the DAE manages to efficiently alleviate the batch effects in these examples and improve the clustering results, achieving a notable increase in the F1-score after the correction. In addition, we provide with a visual evaluation of the representations in two-dimensional spaces learnt with a standard autoencoder (SAE), t-SNE and a DAE. Additionally, in this work we present a novel method to evaluate the quality of the batch normalization of data using statistical distances. In particular, we use the multidimensional version of the Kolmogorov-Smirnov distance between distributions. We show that the distribution of the data in the latent representation of the DAE is very similar when the data comes from different experiments, presenting a smaller distance than in the case of the SAE, where we do not provide the algorithm with domain information in the training step. Therefore, this work allows us to conclude that domain adaptation in flow cytometry data opens a new line of research, which is focused in developing tools for the integration of data from different experiments
Google Scholar:Dorado Alfaro, Sara
This item appears in the following Collection(s)
Showing items related by title, author, creator and subject.
Dorado Alfaro, Sara
Análisis e implementación de diferentes medidas de similitud para un algoritmo global de selección de variables Dorado Alfaro, Sara
Barrío, Joaquín; Lozano, G.; Lamela, Jorge; Lifante, Ginés; Dorado, Luís A.; Depine, Ricardo Angel; Jaqué, Francisco J.; Míguez, Hernán R.