C3-Bench: A Context-Aware Change Captioning Benchmark

This repository represents the official implementation of the paper titled "C3-Bench: A Context-Aware Change Captioning Benchmark (ECCV 2026)".

Jaewoo Kim · Hyeongbeom Kim · Uehwan Kim
ECCV 2026

Overview of C3-Bench. The examples are from each context in C3-Bench.

💡 Problem Formulation

Change Captioning aims to describe the changes between two images.

However, what counts as change is inherently context-dependent. For example, when one is asked to describe the change between the given image pair (see right), what might first come to mind is "in which context?", as the definition of correct change can vary depending on the given context:

the valid description would be "the snow has covered the ground, and cloud cover has decreased." in respect of weather, whereas it is "a train has appeared on the left side of the tracks" for railway surveillance, with weather differences treated as pseudo-changes.

What has changed? (Motivation)
Without context, this generic question can admit multiple logically valid descriptions.

To meaningfully communicate and determine the correct change description among multiple logically valid alternatives in a heterogeneous visual world, each change must be grounded in specific contexts and associated criteria which clearly define the underlying semantics.

🌐 C3-Bench

We introduce C3-Bench, a comprehensive benchmark for Context-aware Change Captioning, featuring:

4,996 human-annotated image pairs with change caption and context-specific criteria
51 real-world change contexts
4 visual domains:
- Natural Scenes
- Remote Sensing Imagery
- Image Editing
- Anomalies
Human-aligned LLM-as-a-Judge evaluation for fine-grained semantics and reversibility
Comprehensive benchmarking of 32 models, including:
- 6 conventional change captioning models, such as DUDA
- 9 leading proprietary MLLMs, such as GPT-5.2 and Gemini 3
- 17 open-source MLLMs, such as Qwen3 and InternVL3.5

Examples from C3-Bench. Each image pair is displayed with its Domain: Context.

🏆 Results

Key Findings

Humans still set the upper bound.
Human evaluators outperform the strongest LMM, GPT-5.2, by 1.73 points in Aggregation and achieve a high Reversibility score of 0.93, revealing a clear gap between current models and human-level change understanding.
Fluency is not understanding.
Conventional change captioning models often generate fluent sentences, but their performance drops sharply across diverse real-world contexts, showing that linguistic quality alone does not guarantee correct change reasoning.
Context matters.
The failure of conventional models highlights the limitation of prior benchmarks: models trained on narrow, dataset-specific change definitions struggle when the target change semantics shift across contexts.
LMMs reshape the landscape.
Proprietary LMMs deliver the strongest overall performance, with GPT-5.2 leading the benchmark, demonstrating the benefit of large-scale multimodal reasoning under explicit context conditioning.
Open-source LMMs are catching up fast.
Qwen3-VL-32B achieves highly competitive results, approaching proprietary models and trailing GPT-5.2 by only 0.35 points in Aggregation.

C3-Bench results. Mean and standard deviation are reported over three GPT-5.2 runs.

📃 Citation

If you find the work useful for your research, please cite:

@InProceedings{Kim_2026_ECCV,
    author    = {Kim, Jae-Woo and Kim, Hyeongbeom and Kim, Ue-Hwan},
    title     = {C3-Bench: A Context-Aware Change Captioning Benchmark},
    booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
    year      = {2026}
}

=

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
c3_bench		c3_bench
image		image
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C3-Bench: A Context-Aware Change Captioning Benchmark

💡 Problem Formulation

🌐 C3-Bench

🏆 Results

Key Findings

📃 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

C3-Bench: A Context-Aware Change Captioning Benchmark

💡 Problem Formulation

🌐 C3-Bench

🏆 Results

Key Findings

📃 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages