This repository represents the official implementation of the paper titled "C3-Bench: A Context-Aware Change Captioning Benchmark (ECCV 2026)".
Jaewoo Kim · Hyeongbeom Kim · Uehwan Kim
ECCV 2026
To meaningfully communicate and determine the correct change description among multiple logically valid alternatives in a heterogeneous visual world, each change must be grounded in specific contexts and associated criteria which clearly define the underlying semantics.
We introduce C3-Bench, a comprehensive benchmark for Context-aware Change Captioning, featuring:
-
4,996 human-annotated image pairs with change caption and context-specific criteria
-
51 real-world change contexts
-
4 visual domains:
- Natural Scenes
- Remote Sensing Imagery
- Image Editing
- Anomalies
-
Human-aligned LLM-as-a-Judge evaluation for fine-grained semantics and reversibility
-
Comprehensive benchmarking of 32 models, including:
- 6 conventional change captioning models, such as DUDA
- 9 leading proprietary MLLMs, such as GPT-5.2 and Gemini 3
- 17 open-source MLLMs, such as Qwen3 and InternVL3.5
-
Humans still set the upper bound.
Human evaluators outperform the strongest LMM, GPT-5.2, by 1.73 points in Aggregation and achieve a high Reversibility score of 0.93, revealing a clear gap between current models and human-level change understanding. -
Fluency is not understanding.
Conventional change captioning models often generate fluent sentences, but their performance drops sharply across diverse real-world contexts, showing that linguistic quality alone does not guarantee correct change reasoning. -
Context matters.
The failure of conventional models highlights the limitation of prior benchmarks: models trained on narrow, dataset-specific change definitions struggle when the target change semantics shift across contexts. -
LMMs reshape the landscape.
Proprietary LMMs deliver the strongest overall performance, with GPT-5.2 leading the benchmark, demonstrating the benefit of large-scale multimodal reasoning under explicit context conditioning. -
Open-source LMMs are catching up fast.
Qwen3-VL-32B achieves highly competitive results, approaching proprietary models and trailing GPT-5.2 by only 0.35 points in Aggregation.
If you find the work useful for your research, please cite:
@InProceedings{Kim_2026_ECCV,
author = {Kim, Jae-Woo and Kim, Hyeongbeom and Kim, Ue-Hwan},
title = {C3-Bench: A Context-Aware Change Captioning Benchmark},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2026}
}
=

-1.png)
-1.png)
