EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

Huang, Shiqi; Wang, Ziyue; Zuo, Zhongrong; Qiu, Han; She, Qi; Wen, Bihan

EvoVid:
Temporal-Centric Self-Evolution for Video Large Language Models

Shiqi Huang¹, Ziyue Wang², Zhongrong Zuo², Han Qiu², Qi She², Bihan Wen¹^*

¹School of Electrical and Electronic Engineering, Nanyang Technological University
²ByteDance
^*Corresponding Author

Paper arXiv Code [coming soon]

Overview of EvoVid. Questioner π_Q and Solver π_S co-evolve through two temporal-centric rewards.

Abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner–Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose EvoVid, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

Contributions

First self-evolving framework for Video-LLMs. EvoVid improves video understanding and reasoning directly from raw, unannotated videos, without relying on any human supervision.
Two temporal-centric verifiable rewards. A temporal-aware Questioner reward that preserves temporal perturbation sensitivity, and a temporal-grounded Solver reward that leverages inherent temporal segment supervision.
Extensive validation. Across four MLLMs and six video understanding and reasoning benchmarks, EvoVid delivers consistent gains, validating the effectiveness of video-based self-evolution.

Main Results

Across four MLLMs (Qwen2.5-VL-3B/7B and Qwen3-VL-4B/8B) and six video reasoning and general benchmarks, three iterations of EvoVid co-evolution yield consistent improvements over each base model—despite using no human-curated questions or annotations. The model progressively improves through a fully autonomous learning process, with the largest gains on temporally challenging benchmarks (e.g., +19.3% TempCompass on Qwen2.5-VL-3B).

Performance on video reasoning and general benchmarks across four MLLMs. For each base model we compare the untrained Base Model, the frozen-Questioner baseline, and three iterations of EvoVid co-evolution. Bold marks the best result per model; blue superscripts on the Iter-3 row report the absolute improvement over the corresponding base model.

Methods	Video-Holmes	VSI-Bench	VideoMMMU	MMVU	TempCompass	VideoMME
Qwen2.5-VL-3B-Instruct
Base Model (w/o training)	26.8	25.1	30.6	48.5	28.9	48.0
EvoVid ( Questioner)	27.8	26.3	35.0	53.3	34.5	48.3
EvoVid (Iter 1)	28.0	29.5	34.4	51.7	33.2	47.0
EvoVid (Iter 2)	26.9	28.4	36.9	53.1	46.3	48.9
EvoVid (Iter 3)	27.2^+0.4	29.5^+4.4	36.7^+6.1	54.6^+6.1	48.2^+19.3	48.6^+0.6
Qwen2.5-VL-7B-Instruct
Base Model (w/o training)	27.8	27.7	47.8	59.2	72.2	53.1
EvoVid ( Questioner)	28.4	29.3	46.6	60.5	72.5	51.5
EvoVid (Iter 1)	28.6	28.5	48.9	60.6	72.1	51.4
EvoVid (Iter 2)	29.9	30.9	48.3	60.5	73.1	53.8
EvoVid (Iter 3)	29.3^+1.5	31.7^+4.0	50.0^+2.2	62.4^+3.2	73.2^+1.0	53.6^+0.5
Qwen3-VL-4B-Instruct
Base Model (w/o training)	29.9	40.1	46.7	60.3	70.9	49.6
EvoVid ( Questioner)	30.7	40.6	45.1	61.1	70.6	49.6
EvoVid (Iter 1)	30.4	41.4	47.3	61.0	70.8	49.7
EvoVid (Iter 2)	29.1	43.1	47.3	62.4	70.9	49.7
EvoVid (Iter 3)	31.1^+1.2	42.8^+2.7	47.6^+0.9	62.6^+2.3	71.7^+0.8	51.4^+1.8
Qwen3-VL-8B-Instruct
Base Model (w/o training)	36.2	37.4	46.8	64.8	73.6	50.3
EvoVid ( Questioner)	36.3	35.3	47.8	63.7	73.4	50.4
EvoVid (Iter 1)	36.0	38.0	49.0	63.8	74.3	50.3
EvoVid (Iter 2)	35.2	38.3	48.1	67.0	73.7	51.4
EvoVid (Iter 3)	36.6^+0.4	39.8^+2.4	49.1^+2.3	66.2^+1.4	74.3^+0.7	51.2^+0.9

EvoVid:Temporal-Centric Self-Evolution for Video Large Language Models

Abstract

Contributions

Main Results

EvoVid:
Temporal-Centric Self-Evolution for Video Large Language Models