PROPOSAL

Answering Questions from Multimedia Collections


Supervisor: Omar Shahbaz Khan
Semester: Spring 2026
Tags: Multimedia Retrieval, VLM, RAG, Vector Store, Multimedia Indexing

Constrastive learning models have made it easier to find relevant content from multimedia collections through descriptive text queries, reducing the interactivity needed to solve simple tasks. However, for complex tasks not only pertaining visual elements, or tasks focused on answering questions involving the contents of a singular or multiple media items (videos/images) still requires a fair amount of interactivity.

  1. Retrieve results from query
  2. Inspect content (summaries, watching, browsing)
  3. Refine the query / add more context / different search strategy
  4. Repeat until answer to the question or relevant content has been discovered

RAG (Retrieval-Augmented-Generation) approaches utilize LLMs (Large-Language Models) to answer questions from a set of documents. Similarly VLMs (Vision-Language Models) have been shown to be capable of addressing questions from visual contents of videos.

The goal of this project is to use a VLM to answer questions from the contents of image and video collections, specifically the lifelog images from the Lifelog Search Challenge, videos from V3C and the CASTLE dataset.