Metric learning for instance and category-level visual representation

Defense type

Thesis

Starting date

Mon 01/07/2024 - 14:00

End date

Mon 01/07/2024 - 18:00

Location

IRISA Rennes

Room

Petri-Turing

Speaker

Shashanka Venkataramanan (Linkmedia)

Main department

D6 - Signal, Image, Language

Theme

The primary goal in computer vision is to enable machines to extract meaningful information from visual data, such as images and videos, and leverage this information to perform a wide range of tasks. To this end, substantial research has focused on developing deep learning models capable of encoding comprehensive and robust visual representations. A prominent strategy in this context involves pretraining models on large-scale datasets, such as ImageNet, to learn representations that can exhibit cross-task applicability and facilitate the successful handling of diverse downstream tasks with minimal effort.

To facilitate learning on these large-scale datasets and encode good representations, com- plex data augmentation strategies have been used. However, these augmentations can be limited in their scope, either being hand-crafted and lacking diversity, or generating images that appear unnatural. Moreover, the focus of these augmentation techniques has primarily been on the ImageNet dataset and its downstream tasks, limiting their applicability to a broader range of computer vision problems.

In this thesis, we aim to tackle these limitations by exploring different approaches to en- hance the efficiency and effectiveness in representation learning. The common thread across the works presented is the use of interpolation-based techniques, such as mixup, to generate diverse and informative training examples beyond the original dataset. In the first work, we are motivated by the idea of deformation as a natural way of interpolating images rather than using a convex combination. We show that geometrically aligning the two images in the fea- ture space, allows for more natural interpolation that retains the geometry of one image and the texture of the other, connecting it to style transfer. Drawing from these observations, we explore the combination of mixup and deep metric learning. We develop a generalized formu- lation that accommodates mixup in metric learning, leading to improved representations that explore areas of the embedding space beyond the training classes. Building on these insights, we revisit the original motivation of mixup and generate a larger number of interpolated examples beyond the mini-batch size by interpolating in the embedding space. This approach allows us to sample on the entire convex hull of the mini-batch, rather than just along lin- ear segments between pairs of examples. Finally, we investigate the potential of using natural augmentations of objects from videos. We introduce a "Walking Tours" dataset of first-person egocentric videos, which capture a diverse range of objects and actions in natural scene transi- tions. We then propose a novel self-supervised pretraining method called DoRA, which detects and tracks objects in video frames, deriving multiple views from the tracks and using them in a self-supervised manner.

[ATTENTION dans le cadre du plan VIGIPIRATE : l’accès du public à cette soutenance est contraint à une inscription préalable obligatoire auprès de lydie [*] mabilinria [*] fr (aurelie[dot]patier[at]inria[dot]fr). L’accès ne sera pas autorisé sans inscription préalable. Par ailleurs, les visiteurs ne porteront ni bagage ni sac.]

Composition of the jury

Prof. Dr. Andrew Zisserman, Professor, FRS, University of Oxford
Prof. Dr. Jiri Matas, Professor, Czech Technical University, Prague
Prof. Dr. Nicolas Courty, Professor, Université Bretagne Sud
Dr. Piotr Bojanowski, Principal Research Scientist, Meta AI (FAIR)
Dr. Diane Larlus, Principal Research Scientist, Naver Labs Europe
Dr. Andrei Bursuc, Senior Research Scientist, valeo.ai