ECCV 2026

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Lingxiao Li2 Yifan Wang1 Xinyan Gao3 Chen Tang3 Xiangyu Yue3 Chenyu You1

1Stony Brook University    2Boston University    3MMLab, CUHK

VisReason teaser showing multi-round visual reasoning across fine-grained, document, and spatial tasks.
VisReason trains MLLMs to reason globally, localize relevant regions, zoom in when needed, and verify answers through spatially grounded visual Chain-of-Thought. The teaser highlights the central behavior of the dataset: the model first reads the whole image, then narrows attention to answer-critical evidence instead of relying on a single global representation.

Abstract

Chain-of-Thought prompting has proven remarkably effective for eliciting complex reasoning in large language models. Yet its potential in multimodal large language models remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. We introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, RoI-grounded rationales that guide MLLMs through interpretable visual reasoning steps. We further curate VisReason-Pro, a 165K subset produced with a stronger GPT annotator and enriched with detailed reasoning traces and depth-augmented spatial annotations. Fine-tuning strong MLLM backbones on VisReason and VisReason-Pro improves step-by-step reasoning accuracy, RoI localization, interpretability, and fine-grained/spatial reasoning.

Highlights

489K visual-CoT training examples
165K VisReason-Pro expert subset
4 domains: text/doc, fine-grained, VQA, spatial
0.807 best average score on the Visual-CoT suite

VisReason focuses on process supervision rather than final-answer supervision alone. Each example can include a scene sketch, a predicted region of interest, a rationale for the selected region, and a final answer. This format teaches models to allocate visual attention adaptively instead of processing every image as a single uniform view.

VisReason Dataset

Existing visual reasoning datasets are often limited in scale, domain breadth, or stepwise grounding. VisReason addresses these gaps with a unified annotation format that supports global-to-local reasoning, RoI localization, and depth-aware spatial supervision from monocular depth and segmentation cues. The examples below show how this supervision is represented directly in the training data: each answer is paired with the intermediate visual evidence that the model should inspect.

Examples of VisReason annotations with multi-round visual Chain-of-Thought and region grounding.
Each image-question pair includes compact multi-round visual CoT: a scene sketch, optional zoom to a predicted RoI, and a rationale that grounds the final answer in localized evidence. The figure illustrates multiple domains, including fine-grained recognition, text/document understanding, and spatial relation reasoning, to show that the same annotation format can supervise very different visual reasoning skills.
Dataset coverage. VisReason aggregates diverse source datasets while adding process-level, spatially aware supervision.
Domain Source datasets Train size Val size Annotation focus
Text / DocTextVQA, TextCaps, DocVQA, DUDE, SROIE111K3,462Text and document image reasoning
Fine-grainedBirds-200-201110K491Small visual evidence and category detail
General VQAFlickr30k, Visual7W156K2,449General visual question answering
Spatial reasoningVSR, GQA-Pro, Open Images211K2,3262D relations and ordinal-depth cues
Total11 source datasets489K8,728Multi-round RoI-grounded visual CoT
Statistics of VisReason including CoT rounds, bounding box size, and response length.
VisReason provides rich multi-round supervision across diverse sources. Answer-critical RoIs are often small, reinforcing the need for models to localize, zoom, and verify. The statistics summarize how many reasoning rounds are needed, how large the selected regions tend to be, and how much textual supervision each round contributes.

Data Generation Pipeline

VisReason starts from image-question-answer triples and expands them into process-level supervision. VisReason-Pro further augments images with semantic segments and monocular depth to produce depth-aware spatial questions, target boxes, compact visual CoT traces, and verified final annotations. This pipeline is designed to keep the annotations compact while still making the reasoning path auditable: generated RoIs and answers are checked and corrected before being released as supervision.

Pipeline for VisReason and VisReason-Pro generation and supervision.
The pipeline derives semantic segments and ordinal-depth cues, generates spatial QA pairs, then iteratively emits and verifies scene sketches, RoIs, rationales, and answers. VisReason-Pro uses these additional spatial priors to emphasize relations such as left/right, above/below, in front of, and behind, without treating monocular depth as metric 3D ground truth.

Training MLLMs to Zoom and Verify

Given an image and a query, the trained model produces a sequence of reasoning actions. At each step, it emits a textual rationale and, when useful, a bounding box for the next region of interest. The crop from that box is fed back into the context, creating a global-to-local reasoning loop. We fine-tune Qwen2.5-VL-7B with supervised learning and LoRA, and additionally test transfer to InternVL-2.5-8B.

Overview of the VisReason iterative MLLM reasoning paradigm.
The VisReason paradigm lets an MLLM generate rationales and RoIs, crop the original image, append new visual features, and continue reasoning until the final answer is supported by localized evidence. This turns a static image input into an adaptive inference process where later steps can focus on details that were too small or ambiguous in the full view.

Evaluation Results

VisReason-Pro-7B achieves the strongest average score among the compared open-source MLLMs on the Visual-CoT benchmark. Gains are most pronounced for fine-grained recognition and spatial relation reasoning, the settings most aligned with RoI-grounded global-to-local supervision. Qwen2.5-VL-7B remains the strongest model on Text/Doc tasks, while VisReason-Pro improves the overall balance by specializing the model for localized visual evidence, relation reasoning, and stepwise grounded explanations.

Comparison on the Visual-CoT benchmark. Scores are averaged by task group; higher is better.
Model Text / Doc Fine-grained General VQA Spatial relation Average
MiniGPTv20.2000.6780.5990.6320.466
VisCoT-7B0.5560.5590.6130.6890.580
LLaVA-NeXT-8B0.7200.7150.7290.6470.705
InternVL-2.5-8B0.8410.7470.6970.6430.738
Qwen2.5-VL-7B0.9180.6810.7310.6180.770
VisReason-7B0.8790.7920.7310.6900.791
VisReason-Pro-7B0.8900.8310.7380.7100.807

The improvements are not only answer-level. On the VisReason-Pro held-out suite, the model also improves RoI grounding and better aligns its predicted regions with ordinal-depth cues. This is important because visual CoT is useful only when the intermediate steps point to evidence that actually supports the answer.

Grounding and spatial reasoning diagnostics on the VisReason-Pro held-out suite.
Model IoU@0.5 IoU@0.75 Grounded ratio BBox IoU Depth error
MiniGPTv20.140.06---
LLaVA-NeXT0.290.190.0390.2070.394
InternVL-2.50.080.030.0110.2140.290
Qwen2.5-VL--0.0350.1150.294
VisReason-Pro0.340.230.2760.2780.266
Ablation study. Baseline denotes VisReason, Pro denotes VisReason-Pro, and AZ denotes adaptive zoom-in.
Baseline Pro AZ Text / Doc General VQA Relation Fine-grained Average
0.9200.7390.5980.6810.770
yes0.8640.7440.6780.7980.779
yesyes0.8820.7380.7050.7920.791
yesyes0.8560.7500.6930.8090.781
yesyesyes0.9080.7450.7290.8310.807

The ablation shows that the full recipe matters: the base VisReason data improves fine-grained and relation-heavy tasks, VisReason-Pro adds higher-fidelity spatial rationales, and adaptive zoom-in gives the strongest final average. A blinded user study further suggests that the gains are visible to humans as more faithful, clearer reasoning traces.

Human evaluation on answer accuracy, grounded faithfulness, and stepwise clarity/sufficiency. Scores are on a 1-5 scale.
Method Answer accuracy Grounded faithfulness Stepwise clarity Mean
MiniGPTv22.371.941.882.06
VisCoT-7B2.832.582.342.58
LLaVA-NeXT-8B3.522.933.083.18
InternVL-2.5-8B3.873.323.243.48
VisReason-7B4.074.184.124.12
VisReason-Pro-7B4.194.464.374.34
Qualitative visualization of VisReason inference modes and predicted bounding boxes.
Qualitative examples show how multi-round visual CoT progressively localizes critical regions and integrates evidence from original and zoomed-in views. The visualization compares inference modes and highlights predicted boxes across rounds, making it easier to see when the model attends to the correct evidence and when grounding errors affect the answer.

Conclusion

VisReason provides large-scale, RoI-grounded process supervision for visual Chain-of-Thought reasoning. By combining broad domain coverage with compact multi-round rationales and depth-augmented spatial cues, it teaches MLLMs to reason through a more human-like global-to-local workflow. Experiments show that this supervision improves fine-grained recognition, spatial relation reasoning, RoI localization, and the perceived faithfulness of reasoning traces. We hope VisReason serves as a foundation for future work on interpretable, verifiable, and spatially grounded multimodal intelligence.

BibTeX

@inproceedings{li2026visreason,
  title = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
  author = {Li, Lingxiao and Wang, Yifan and Gao, Xinyan and Tang, Chen and Yue, Xiangyu and You, Chenyu},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2026}
}