TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

1SRI, 2Johns Hopkins University, 3United States Military Academy, 4University of Colorado Boulder
ICCV 2025

Video Presentation

Abstract

We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks and achieve state-of-the-art performance on benchmarks for both tasks.

Motivation

Current Video Question Answering (VideoQA) systems are becoming increasingly powerful, but they often struggle with two key challenges. First, they typically require expensive, manually-created temporal annotations (timestamps) to learn how to locate evidence for an answer within a video. Second, many systems are limited to multiple-choice questions rather than generating free-form, open-ended answers.

This work tackles these limitations by proposing a framework for grounded VideoQA that operates under weak supervision—meaning it learns to ground answers in time without ever seeing ground-truth timestamps during training. Our approach is designed for long videos with complex questions about actor interactions and temporal event ordering, and it generates open-ended, natural language answers.

Example of TOGA answering questions about a video with temporal grounding.

TOGA takes a video and an open-ended question, and outputs a free-form answer along with the corresponding start and end times in the video.

Methodology

Our model, TOGA, is a Vision-Language Model (VLM) composed of four main modules: a frozen vision encoder, a frozen text encoder, a trainable Multi-Scale Vision-Language Connector (MS-VLC), and a trainable language decoder. The key to our approach lies in the MS-VLC and our multi-stage training strategy.

  • Multi-Scale Vision-Language Connector (MS-VLC): This module processes video frames at two different temporal resolutions: a sparse scale (low frame rate) to capture long-term context and a dense scale (high frame rate) to capture fine-grained, short-term actions. This allows the model to effectively ground both long and short events.
  • Weakly Supervised Multi-Stage Training: Since we don't have grounding annotations, we train TOGA in three stages.
    1. Vision-Text Alignment: We first train the MS-VLC to align video features with text descriptions.
    2. Instruction Tuning with Pseudo-Labels: We generate noisy "pseudo-labels" for grounding by describing short clips from videos. We use these to teach the model the format of a grounded answer (e.g., "Answer [start, end]").
    3. Consistent Grounding: To refine the noisy labels, we enforce a consistency constraint. For a generated answer like "a boy is running [10, 20]", we check if the answer to a follow-up question, "What is happening in [10, 20]?", is also "a boy is running". This self-correction mechanism allows the model to learn accurate grounding without explicit labels.
Overview of the TOGA architecture

The TOGA framework uses a multi-scale connector to process video features and a multi-stage training strategy with a consistency constraint to learn grounding without temporal annotations.

Diagram of the consistency framework for weak supervision

Our consistency framework for training without temporal annotations. We generate pseudo-labels by captioning random video clips and then filter noisy ones by enforcing consistency between the model's response to a grounding question and a referring question for the same time segment.

Results

We evaluated TOGA on several challenging benchmarks for both grounded and open-ended VideoQA. Our method achieves state-of-the-art performance, outperforming previous approaches in both settings.

On the NEXT-GQA dataset for weakly supervised grounded QA, TOGA surpasses existing methods on all grounding metrics, such as mIoU and mIoP. Notably, it achieves this while operating in a more difficult open-ended setup, where the model generates answers from scratch instead of selecting from provided options.

Table of results on the NEXT-GQA benchmark

Comparison with state-of-the-art on NEXT-GQA. TOGA achieves the best performance on grounding metrics in a challenging open-ended evaluation setting.

For open-ended QA on the MSVD-QA and ActivityNet-QA datasets, TOGA also sets a new state of the art, demonstrating its strong capabilities in generating accurate, free-form answers.

Table of results for open-ended QA benchmarks

TOGA outperforms existing methods on both accuracy and score metrics for open-ended VideoQA.

Ablation Studies

We conducted several ablation studies to validate our key design choices:

  • Multi-Scale vs. Single-Scale Connector: We found that the MS-VLC significantly outperforms models using only a sparse or only a dense scale. The multi-scale approach is particularly effective for grounding very short and very long events, where single-scale models struggle.
  • Table comparing multi-scale vs single-scale performance

    The multi-scale (MS-VLC) model achieves the best grounding performance (IoU), especially for short and long duration events.

  • Impact of Consistency Constraint: Removing the final training stage with the consistency constraint causes a massive drop in performance (mIoU falls from 24.4 to 12.1). This demonstrates that our consistency check is crucial for learning accurate grounding from noisy pseudo-labels.
  • Analysis of Question Types: We analyzed performance on different question types in NEXT-GQA. The model finds temporal questions (especially those about past or future events) more difficult than causal ("why/how") questions, as they require more complex reasoning about the sequence of events.

Qualitative Examples

We provide several qualitative examples to illustrate TOGA's performance. The model is capable of generating correct answers and accurately grounding them in the video for a variety of complex causal and temporal questions. Due to its open-ended nature, TOGA sometimes generates answers that are semantically equivalent but not an exact string match to the ground truth.

Qualitative examples on the NEXT-GQA dataset

Qualitative results on the NEXT-GQA dataset. TOGA provides grounded answers for both temporal and causal questions. Ground truth segments are in green, and TOGA's predictions are in yellow.

BibTeX

@inproceedings{Gupta2025TOGA,
  title={TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision},
  author={Gupta, Ayush and Roy, Anirban and Chellappa, Rama and Bastian, Nathaniel D. and Velasquez, Alvaro and Jha, Susmit},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025},
  url={https://arxiv.org/abs/2506.09445}
}