Video content has significantly increased in volume and diversity in the digital era, and this expansion has highlighted the necessity for advanced video understanding technologies. Driven by this necessity, this thesis explores semantically understanding videos, leveraging multiple perceptual modes similar to human cognitive processes and efficient learning with limited supervision similar to human learning capabilities. This thesis specifically focuses on video question answering as one of the main video understanding tasks. Our first contribution addresses long-range video question answering, requiring an understanding of extended video content. While recent approaches rely on human-generated external sources, we process raw data to generate video summaries. Our following contribution explores zero-shot and few-shot video question answering, aiming to enhance efficient learning from limited data. We leverage the knowledge of existing large-scale models by eliminating challenges in adapting pre-trained models to limited data. We demonstrate that these contributions significantly enhance the capabilities of multimodal video question-answering systems, where specifically human-annotated labeled data is limited or unavailable.
[ATTENTION dans le cadre du plan VIGIPIRATE : l’accès du public à cette soutenance est contraint à une inscription préalable obligatoire auprès de aurelie.patier@inria.fr. L’accès ne sera pas autorisé sans inscription préalable. Par ailleurs, les visiteurs ne porteront ni bagage ni sac.]
Luce MORIN - Professor, INSA Rennes - Président
Josef SIVIC - Distinguished Researcher, Czech Technical University - Examinateur
Karteek ALAHARI - Directeur de recherche, Inria Grenoble - Examinateur
Ivan LAPTEV - Visiting Professor, MBZUAI - Rapporteur
Matthieu CORD - Professor, Sorbonne University - Rapporteur