Video & Multimodal Retrieval Papers: November 2025
Stay up-to-date with the rapidly evolving fields of video and multimodal retrieval! This article summarizes the latest research papers published as of November 25, 2025, focusing on advancements in both video and multimodal information retrieval techniques. Explore cutting-edge developments, innovative approaches, and newly released datasets that are shaping the future of how we search, access, and interact with video and multimodal content. For a better reading experience and more papers, check the Github page.
Video Retrieval: Pushing the Boundaries of Video Understanding
Video retrieval, a core area of computer vision and information retrieval, has seen significant progress in recent years. The ability to efficiently and accurately search and retrieve videos based on various criteria is crucial for numerous applications, ranging from video surveillance and content recommendation to education and entertainment. These criteria include semantic content, temporal context, and user intent. Recent research has focused on enhancing video understanding through advanced techniques like multimodal learning, attention mechanisms, and deep neural networks. Below, we explore the newest publications that contribute to the exciting advancements in this field.
ViMix-14M: A Curated Multi-Source Video-Text Dataset
Datasets are the bedrock of machine learning, and the release of ViMix-14M marks a significant contribution. This curated multi-source video-text dataset features long-form, high-quality captions and offers crawl-free access, which is a major advantage for researchers. The dataset's size and quality should facilitate the training and evaluation of more robust and generalizable video retrieval models, enabling them to understand the intricate relationship between video content and descriptive text. This richness of data allows for the exploration of complex video understanding tasks, paving the way for more accurate and context-aware video retrieval systems.
Video-RAG: Retrieval-Augmented Long Video Comprehension
Long video comprehension is a challenging task, and Video-RAG addresses this head-on by leveraging retrieval-augmented generation. This approach combines the strengths of retrieval-based and generation-based methods to achieve a more comprehensive understanding of long-form video content. By retrieving relevant information from external sources, the model can enhance its ability to answer questions, summarize content, and perform other video understanding tasks. The acceptance of this paper at NeurIPS 2025 underscores its significance and impact on the field. The innovative technique behind Video-RAG promises to improve the performance of video retrieval systems in handling complex and lengthy videos.
X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification
Person re-identification (ReID) is crucial for video surveillance and security applications. X-ReID introduces a novel approach by employing multi-granularity information interaction for video-based visible-infrared person re-identification. This technique allows the model to effectively learn and match person identities across different modalities (visible and infrared), enhancing its robustness to variations in lighting conditions and other environmental factors. The acceptance of this work by AAAI 2026 highlights its potential to advance the state-of-the-art in person re-identification. The multi-granularity approach used in X-ReID provides a more nuanced and accurate way to match individuals across different video streams.
A Superpersuasive Autonomous Policy Debating System
Policy debating systems are a fascinating application of AI, and this paper presents a superpersuasive autonomous system. Accepted to the CLIP workshop at AAAI 2026, this research explores how AI can be used to generate persuasive arguments in policy debates. This system could potentially revolutionize how debates are conducted, providing a platform for more informed and data-driven discussions. The system's ability to generate persuasive arguments demonstrates the growing sophistication of AI in understanding and reasoning about complex topics. This research represents a significant step towards building AI systems that can effectively communicate and debate policy issues.
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Reasoning about videos that contain rich textual information requires sophisticated techniques. Video-R4 introduces the concept of visual rumination to reinforce text-rich video reasoning. This approach allows the model to iteratively process and refine its understanding of the video content by leveraging both visual and textual cues. By mimicking the human process of thinking things over, the model can achieve a deeper and more accurate comprehension of the video. Video-R4's approach of combining visual and textual information is a promising direction for enhancing video reasoning capabilities.
VSI: Visual Subtitle Integration for Keyframe Selection
Keyframe selection is a crucial step in video summarization and understanding. VSI enhances this process by integrating visual subtitles, providing valuable textual context for identifying the most important frames in a video. By combining visual and textual information, VSI can more effectively capture the essence of the video content, leading to more accurate and informative summaries. The integration of visual subtitles is a clever way to improve keyframe selection, as it leverages the textual information that is often present in videos.
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Temporal grounding is the task of identifying the specific time segments in a video that correspond to a given query. This paper explores the use of generative multi-modal large language models for universal video temporal grounding. This approach allows the model to handle a wide range of queries and video content, making it a versatile solution for temporal grounding tasks. The use of large language models in this context demonstrates their potential for understanding and reasoning about video content.
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Reasoning over visual evidence is a complex task that requires the ability to analyze information at multiple scales. Conan adopts a progressive learning approach to mimic the way a detective reasons, gradually building an understanding of the situation by considering evidence at different levels of detail. This approach allows the model to effectively handle complex visual scenarios and make accurate inferences. The detective-like reasoning approach used in Conan is an intriguing way to tackle the challenges of visual reasoning.
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Retrieving videos with fine-grained temporal details is a challenging task, as it requires the model to understand the temporal relationships between different events in the video. This paper introduces a composed video retrieval approach that leverages the concept of