Top Video & Multimodal Retrieval Papers: November 2025

Nov 26, 2025 by Alex Johnson 55 views

Stay up-to-date with the latest advancements in video and multimodal retrieval! This article summarizes the top research papers from November 26, 2025, focusing on innovations in video understanding, cross-modal learning, and retrieval techniques. For an enhanced reading experience and access to even more papers, be sure to check out the Github page. Let's dive into the exciting developments in these fields.

Video Retrieval

Video retrieval is a rapidly evolving field, with new techniques emerging to improve the accuracy and efficiency of video search and analysis. The following papers highlight some of the most innovative approaches in this area. These papers cover a range of topics, including adversarial robustness, video explanation, vision-language navigation, and trajectory prediction. Understanding these advancements is crucial for anyone working with video data, from researchers to industry professionals. Let's explore the key findings and contributions of each paper.

1. Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration

This paper addresses the critical issue of adversarial robustness in multi-modal encoders. Multi-modal encoders are designed to process and integrate information from various sources, such as video, audio, and text. However, these systems are vulnerable to adversarial attacks, where subtle perturbations in the input data can lead to incorrect outputs. The authors of this paper propose an efficient calibration technique to enhance the robustness of multi-modal encoders against adversarial attacks. Their approach focuses on improving the alignment and consistency of representations across different modalities. This is particularly important in applications where reliability and security are paramount, such as autonomous driving and surveillance systems. The proposed method not only improves robustness but also maintains the performance of the encoder on clean data, making it a practical solution for real-world applications. The research contributes significantly to the field by providing a method to defend against potential threats, thereby increasing the trustworthiness of multi-modal systems.

2. Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

Explaining the decisions of video classifiers is crucial for building trust and understanding how these systems work. This paper introduces a novel approach to generating video counterfactual explanations, which provide insights into why a video classifier made a particular prediction. Counterfactual explanations identify the minimal changes to the input video that would alter the classifier's output. This method helps users understand the features that the classifier relies on and identify potential biases or limitations. By providing clear and interpretable explanations, this research makes video classification models more transparent and accountable. This is especially important in applications where the stakes are high, such as medical diagnosis and legal proceedings. The ability to generate video counterfactual explanations enhances the usability and trustworthiness of video classifiers, paving the way for their broader adoption in various domains.

3. FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

Vision-Language Navigation (VLN) is a challenging task that requires an agent to navigate through an environment based on natural language instructions. This paper presents FSR-VLN, a novel approach that combines fast and slow reasoning mechanisms using a hierarchical multi-modal scene graph. The fast reasoning component enables the agent to quickly process visual and textual information, while the slow reasoning component allows for more deliberate and contextual understanding. The hierarchical scene graph represents the environment at different levels of granularity, enabling the agent to make informed decisions at each step. This method significantly improves the efficiency and accuracy of VLN, allowing agents to navigate complex environments more effectively. Demo videos are available at https://horizonrobotics.github.io/robot_lab/fsr-vln/. This research advances the state-of-the-art in VLN and has implications for robotics, autonomous navigation, and human-computer interaction.

4. KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

Predicting the trajectories of vehicles and pedestrians is essential for autonomous driving systems. This paper introduces KEPT, a knowledge-enhanced approach that leverages vision-language models to predict trajectories from consecutive driving frames. KEPT integrates external knowledge, such as traffic rules and common driving patterns, to improve the accuracy and reliability of trajectory predictions. The vision-language models enable the system to understand the context of the scene and anticipate the movements of other agents. This research enhances the safety and efficiency of autonomous driving by providing more accurate and context-aware trajectory predictions. The incorporation of external knowledge is a key contribution, as it allows the system to make more informed decisions in complex driving scenarios. KEPT represents a significant step forward in the development of robust and reliable autonomous driving systems.

5. X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

Person re-identification (ReID) is the task of identifying the same person across different cameras and viewpoints. This paper introduces X-ReID, a novel approach for video-based visible-infrared person re-identification that utilizes multi-granularity information interaction. X-ReID effectively integrates information from both visible and infrared cameras, addressing the challenges posed by variations in lighting and pose. By considering information at multiple granularities, the system can capture both fine-grained details and high-level contextual cues. This approach significantly improves the accuracy and robustness of person re-identification, making it suitable for a wide range of applications, including surveillance and security systems. Accepted by AAAI2026, X-ReID showcases the potential of multi-granularity information interaction in addressing complex computer vision problems. Further modifications may be performed to enhance the system's performance and applicability.

6. Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

This paper introduces Video-R4, a framework that reinforces text-rich video reasoning with visual rumination. Video-R4 enhances the ability of models to understand and reason about videos by iteratively processing both visual and textual information. The visual rumination process allows the model to revisit and refine its understanding of the video content, leading to more accurate and coherent reasoning. This approach is particularly effective for videos that contain complex narratives and require deep understanding of both visual and textual cues. Video-R4 represents a significant advancement in video understanding and has the potential to improve performance in a variety of applications, such as video summarization, question answering, and content analysis. The iterative reasoning process is a key innovation, enabling the model to capture subtle details and nuances in the video content.

7. Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Creating video demonstrations from multistep descriptions is a challenging task that requires the system to understand and execute complex instructions. This paper presents Stitch-a-Demo, a method for generating video demonstrations from textual descriptions that involve multiple steps. The system breaks down the description into individual actions and then stitches together relevant video segments to create a coherent demonstration. This approach is particularly useful for creating instructional videos and tutorials. Stitch-a-Demo automates the process of video creation, making it easier and more efficient to generate high-quality video content from textual descriptions. This research has implications for education, training, and content creation, enabling users to create engaging and informative videos with minimal effort.

8. FOCUS: Efficient Keyframe Selection for Long Video Understanding

Understanding long videos is computationally expensive due to the large amount of data involved. This paper introduces FOCUS, an efficient keyframe selection method that reduces the computational cost of long video understanding. FOCUS selects the most informative frames from the video, allowing the system to focus on the critical content while discarding redundant information. This approach significantly improves the efficiency of video processing without sacrificing accuracy. By selecting keyframes strategically, FOCUS enables the system to process long videos more quickly and effectively. This research is crucial for applications that require real-time analysis of long videos, such as surveillance and video conferencing.

9. VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Detecting highlights and retrieving specific moments from videos are essential tasks for video analysis and retrieval. This paper introduces VideoLights, a novel framework that combines feature refinement and a cross-task alignment transformer for joint video highlight detection and moment retrieval. VideoLights enhances the quality of video features and aligns them across different tasks, leading to improved performance in both highlight detection and moment retrieval. The cross-task alignment transformer enables the system to leverage information from both tasks, further enhancing its capabilities. This research represents a significant advancement in video analysis and retrieval, providing a unified framework for addressing multiple tasks simultaneously.

10. Forgetful by Design? A Critical Audit of YouTube's Search API for Academic Research

This paper presents a critical audit of YouTube's Search API, highlighting potential limitations and biases that may affect academic research. The authors investigate whether the search API is