Application of multimodal machine learning to visual question answering
Author
Galvé Mateo, CarlosAdvisor
Fiérrez Aguilar, Julián
Entity
UAM. Departamento de Tecnología Electrónica y de las ComunicacionesDate
2021-09Subjects
Visual question answering; Computer vision; Natural language processing; TelecomunicacionesNote
Master’s Degree in ICT Research and Innovation (i2-ICT)
Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.
Abstract
Due to the great advances in Natural Language Processing and Computer Vision in recent yearswith neural networks and attention mechanisms, a great interest in VQA has been awakened,starting to be considered as the ”Visual Turing Test” for modern AI systems, since it is aboutanswering a question from an image, where the system has to learn to understand and reasonabout the image and question shown. One of the main reasons for this great interest is thelarge number of potential applications that these systems allow, such as medical applicationsfor diagnosis through an image, assistants for blind people, e-learning applications, etc.In this Master’s thesis, a study of the state of the art of VQA is proposed, investigatingboth techniques and existing datasets. Finally, a development is carried out in order to try toreproduce the results of the art with the latest VQA models with the aim of being able to applythem and experiment on new datasets.Therefore, in this work, experiments are carried out with a first VQA model, MoViE+MCAN[1] [2] (winner of the 2020 VQA Challenge), which after observing its non-viability due toresource issues, we switched to the LXMERT Model [3], which consists of a pre-trained modelin 5 subtasks, which allows us to perform fine-tunnig on several tasks, which in this specificcase is the VQA task on the VQA v2.0 [4] dataset.As the main result of this Thesis we experimentally show that LXMERT provides similarresults to MoViE-MCAN (the best known method for VQA) in the most recent and demandingbenchmarks with less resources starting from the pre-trained model provided by the GitHubrepository [5].
Files in this item
Google Scholar:Galvé Mateo, Carlos
This item appears in the following Collection(s)
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/
Related items
Showing items related by title, author, creator and subject.