VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Abstract

Chain-of-Thought prompting has proven remarkably effective for eliciting complex reasoning in large language models. Yet its potential in multimodal large language models remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. We introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, RoI-grounded rationales that guide MLLMs through interpretable visual reasoning steps. We further curate VisReason-Pro, a 165K subset produced with a stronger GPT annotator and enriched with detailed reasoning traces and depth-augmented spatial annotations. Fine-tuning strong MLLM backbones on VisReason and VisReason-Pro improves step-by-step reasoning accuracy, RoI localization, interpretability, and fine-grained/spatial reasoning.

Highlights

489K visual-CoT training examples

165K VisReason-Pro expert subset

4 domains: text/doc, fine-grained, VQA, spatial

0.807 best average score on the Visual-CoT suite

VisReason focuses on process supervision rather than final-answer supervision alone. Each example can include a scene sketch, a predicted region of interest, a rationale for the selected region, and a final answer. This format teaches models to allocate visual attention adaptively instead of processing every image as a single uniform view.

VisReason Dataset

Existing visual reasoning datasets are often limited in scale, domain breadth, or stepwise grounding. VisReason addresses these gaps with a unified annotation format that supports global-to-local reasoning, RoI localization, and depth-aware spatial supervision from monocular depth and segmentation cues. The examples below show how this supervision is represented directly in the training data: each answer is paired with the intermediate visual evidence that the model should inspect.

Examples of VisReason annotations with multi-round visual Chain-of-Thought and region grounding. — Each image-question pair includes compact multi-round visual CoT: a scene sketch, optional zoom to a predicted RoI, and a rationale that grounds the final answer in localized evidence. The figure illustrates multiple domains, including fine-grained recognition, text/document understanding, and spatial relation reasoning, to show that the same annotation format can supervise very different visual reasoning skills.

Dataset coverage. VisReason aggregates diverse source datasets while adding process-level, spatially aware supervision.
Domain	Source datasets	Train size	Val size	Annotation focus
Text / Doc	TextVQA, TextCaps, DocVQA, DUDE, SROIE	111K	3,462	Text and document image reasoning
Fine-grained	Birds-200-2011	10K	491	Small visual evidence and category detail
General VQA	Flickr30k, Visual7W	156K	2,449	General visual question answering
Spatial reasoning	VSR, GQA-Pro, Open Images	211K	2,326	2D relations and ordinal-depth cues
Total	11 source datasets	489K	8,728	Multi-round RoI-grounded visual CoT

Statistics of VisReason including CoT rounds, bounding box size, and response length. — VisReason provides rich multi-round supervision across diverse sources. Answer-critical RoIs are often small, reinforcing the need for models to localize, zoom, and verify. The statistics summarize how many reasoning rounds are needed, how large the selected regions tend to be, and how much textual supervision each round contributes.

Data Generation Pipeline

VisReason starts from image-question-answer triples and expands them into process-level supervision. VisReason-Pro further augments images with semantic segments and monocular depth to produce depth-aware spatial questions, target boxes, compact visual CoT traces, and verified final annotations. This pipeline is designed to keep the annotations compact while still making the reasoning path auditable: generated RoIs and answers are checked and corrected before being released as supervision.

Pipeline for VisReason and VisReason-Pro generation and supervision. — The pipeline derives semantic segments and ordinal-depth cues, generates spatial QA pairs, then iteratively emits and verifies scene sketches, RoIs, rationales, and answers. VisReason-Pro uses these additional spatial priors to emphasize relations such as left/right, above/below, in front of, and behind, without treating monocular depth as metric 3D ground truth.

Training MLLMs to Zoom and Verify

Given an image and a query, the trained model produces a sequence of reasoning actions. At each step, it emits a textual rationale and, when useful, a bounding box for the next region of interest. The crop from that box is fed back into the context, creating a global-to-local reasoning loop. We fine-tune Qwen2.5-VL-7B with supervised learning and LoRA, and additionally test transfer to InternVL-2.5-8B.

Overview of the VisReason iterative MLLM reasoning paradigm. — The VisReason paradigm lets an MLLM generate rationales and RoIs, crop the original image, append new visual features, and continue reasoning until the final answer is supported by localized evidence. This turns a static image input into an adaptive inference process where later steps can focus on details that were too small or ambiguous in the full view.

Evaluation Results

VisReason-Pro-7B achieves the strongest average score among the compared open-source MLLMs on the Visual-CoT benchmark. Gains are most pronounced for fine-grained recognition and spatial relation reasoning, the settings most aligned with RoI-grounded global-to-local supervision. Qwen2.5-VL-7B remains the strongest model on Text/Doc tasks, while VisReason-Pro improves the overall balance by specializing the model for localized visual evidence, relation reasoning, and stepwise grounded explanations.

Comparison on the Visual-CoT benchmark. Scores are averaged by task group; higher is better.
Model	Text / Doc	Fine-grained	General VQA	Spatial relation	Average
MiniGPTv2	0.200	0.678	0.599	0.632	0.466
VisCoT-7B	0.556	0.559	0.613	0.689	0.580
LLaVA-NeXT-8B	0.720	0.715	0.729	0.647	0.705
InternVL-2.5-8B	0.841	0.747	0.697	0.643	0.738
Qwen2.5-VL-7B	0.918	0.681	0.731	0.618	0.770
VisReason-7B	0.879	0.792	0.731	0.690	0.791
VisReason-Pro-7B	0.890	0.831	0.738	0.710	0.807

The improvements are not only answer-level. On the VisReason-Pro held-out suite, the model also improves RoI grounding and better aligns its predicted regions with ordinal-depth cues. This is important because visual CoT is useful only when the intermediate steps point to evidence that actually supports the answer.

Grounding and spatial reasoning diagnostics on the VisReason-Pro held-out suite.
Model	IoU@0.5	IoU@0.75	Grounded ratio	BBox IoU	Depth error
MiniGPTv2	0.14	0.06	-	-	-
LLaVA-NeXT	0.29	0.19	0.039	0.207	0.394
InternVL-2.5	0.08	0.03	0.011	0.214	0.290
Qwen2.5-VL	-	-	0.035	0.115	0.294
VisReason-Pro	0.34	0.23	0.276	0.278	0.266

Ablation study. Baseline denotes VisReason, Pro denotes VisReason-Pro, and AZ denotes adaptive zoom-in.
Baseline	Pro	AZ	Text / Doc	General VQA	Relation	Fine-grained	Average
			0.920	0.739	0.598	0.681	0.770
yes			0.864	0.744	0.678	0.798	0.779
yes		yes	0.882	0.738	0.705	0.792	0.791
yes	yes		0.856	0.750	0.693	0.809	0.781
yes	yes	yes	0.908	0.745	0.729	0.831	0.807

The ablation shows that the full recipe matters: the base VisReason data improves fine-grained and relation-heavy tasks, VisReason-Pro adds higher-fidelity spatial rationales, and adaptive zoom-in gives the strongest final average. A blinded user study further suggests that the gains are visible to humans as more faithful, clearer reasoning traces.

Human evaluation on answer accuracy, grounded faithfulness, and stepwise clarity/sufficiency. Scores are on a 1-5 scale.
Method	Answer accuracy	Grounded faithfulness	Stepwise clarity	Mean
MiniGPTv2	2.37	1.94	1.88	2.06
VisCoT-7B	2.83	2.58	2.34	2.58
LLaVA-NeXT-8B	3.52	2.93	3.08	3.18
InternVL-2.5-8B	3.87	3.32	3.24	3.48
VisReason-7B	4.07	4.18	4.12	4.12
VisReason-Pro-7B	4.19	4.46	4.37	4.34

Qualitative visualization of VisReason inference modes and predicted bounding boxes. — Qualitative examples show how multi-round visual CoT progressively localizes critical regions and integrates evidence from original and zoomed-in views. The visualization compares inference modes and highlights predicted boxes across rounds, making it easier to see when the model attends to the correct evidence and when grounding errors affect the answer.

Conclusion

VisReason provides large-scale, RoI-grounded process supervision for visual Chain-of-Thought reasoning. By combining broad domain coverage with compact multi-round rationales and depth-augmented spatial cues, it teaches MLLMs to reason through a more human-like global-to-local workflow. Experiments show that this supervision improves fine-grained recognition, spatial relation reasoning, RoI localization, and the perceived faithfulness of reasoning traces. We hope VisReason serves as a foundation for future work on interpretable, verifiable, and spatially grounded multimodal intelligence.

BibTeX

@inproceedings{li2026visreason,
  title = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
  author = {Li, Lingxiao and Wang, Yifan and Gao, Xinyan and Tang, Chen and Yue, Xiangyu and You, Chenyu},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2026}
}