ECCV 2026
VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning
1Stony Brook University 2Boston University 3MMLab, CUHK
Abstract
Chain-of-Thought prompting has proven remarkably effective for eliciting complex reasoning in large language models. Yet its potential in multimodal large language models remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. We introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, RoI-grounded rationales that guide MLLMs through interpretable visual reasoning steps. We further curate VisReason-Pro, a 165K subset produced with a stronger GPT annotator and enriched with detailed reasoning traces and depth-augmented spatial annotations. Fine-tuning strong MLLM backbones on VisReason and VisReason-Pro improves step-by-step reasoning accuracy, RoI localization, interpretability, and fine-grained/spatial reasoning.
Highlights
VisReason focuses on process supervision rather than final-answer supervision alone. Each example can include a scene sketch, a predicted region of interest, a rationale for the selected region, and a final answer. This format teaches models to allocate visual attention adaptively instead of processing every image as a single uniform view.
VisReason Dataset
Existing visual reasoning datasets are often limited in scale, domain breadth, or stepwise grounding. VisReason addresses these gaps with a unified annotation format that supports global-to-local reasoning, RoI localization, and depth-aware spatial supervision from monocular depth and segmentation cues. The examples below show how this supervision is represented directly in the training data: each answer is paired with the intermediate visual evidence that the model should inspect.
| Domain | Source datasets | Train size | Val size | Annotation focus |
|---|---|---|---|---|
| Text / Doc | TextVQA, TextCaps, DocVQA, DUDE, SROIE | 111K | 3,462 | Text and document image reasoning |
| Fine-grained | Birds-200-2011 | 10K | 491 | Small visual evidence and category detail |
| General VQA | Flickr30k, Visual7W | 156K | 2,449 | General visual question answering |
| Spatial reasoning | VSR, GQA-Pro, Open Images | 211K | 2,326 | 2D relations and ordinal-depth cues |
| Total | 11 source datasets | 489K | 8,728 | Multi-round RoI-grounded visual CoT |
Data Generation Pipeline
VisReason starts from image-question-answer triples and expands them into process-level supervision. VisReason-Pro further augments images with semantic segments and monocular depth to produce depth-aware spatial questions, target boxes, compact visual CoT traces, and verified final annotations. This pipeline is designed to keep the annotations compact while still making the reasoning path auditable: generated RoIs and answers are checked and corrected before being released as supervision.
Training MLLMs to Zoom and Verify
Given an image and a query, the trained model produces a sequence of reasoning actions. At each step, it emits a textual rationale and, when useful, a bounding box for the next region of interest. The crop from that box is fed back into the context, creating a global-to-local reasoning loop. We fine-tune Qwen2.5-VL-7B with supervised learning and LoRA, and additionally test transfer to InternVL-2.5-8B.
Evaluation Results
VisReason-Pro-7B achieves the strongest average score among the compared open-source MLLMs on the Visual-CoT benchmark. Gains are most pronounced for fine-grained recognition and spatial relation reasoning, the settings most aligned with RoI-grounded global-to-local supervision. Qwen2.5-VL-7B remains the strongest model on Text/Doc tasks, while VisReason-Pro improves the overall balance by specializing the model for localized visual evidence, relation reasoning, and stepwise grounded explanations.
| Model | Text / Doc | Fine-grained | General VQA | Spatial relation | Average |
|---|---|---|---|---|---|
| MiniGPTv2 | 0.200 | 0.678 | 0.599 | 0.632 | 0.466 |
| VisCoT-7B | 0.556 | 0.559 | 0.613 | 0.689 | 0.580 |
| LLaVA-NeXT-8B | 0.720 | 0.715 | 0.729 | 0.647 | 0.705 |
| InternVL-2.5-8B | 0.841 | 0.747 | 0.697 | 0.643 | 0.738 |
| Qwen2.5-VL-7B | 0.918 | 0.681 | 0.731 | 0.618 | 0.770 |
| VisReason-7B | 0.879 | 0.792 | 0.731 | 0.690 | 0.791 |
| VisReason-Pro-7B | 0.890 | 0.831 | 0.738 | 0.710 | 0.807 |
The improvements are not only answer-level. On the VisReason-Pro held-out suite, the model also improves RoI grounding and better aligns its predicted regions with ordinal-depth cues. This is important because visual CoT is useful only when the intermediate steps point to evidence that actually supports the answer.
| Model | IoU@0.5 | IoU@0.75 | Grounded ratio | BBox IoU | Depth error |
|---|---|---|---|---|---|
| MiniGPTv2 | 0.14 | 0.06 | - | - | - |
| LLaVA-NeXT | 0.29 | 0.19 | 0.039 | 0.207 | 0.394 |
| InternVL-2.5 | 0.08 | 0.03 | 0.011 | 0.214 | 0.290 |
| Qwen2.5-VL | - | - | 0.035 | 0.115 | 0.294 |
| VisReason-Pro | 0.34 | 0.23 | 0.276 | 0.278 | 0.266 |
| Baseline | Pro | AZ | Text / Doc | General VQA | Relation | Fine-grained | Average |
|---|---|---|---|---|---|---|---|
| 0.920 | 0.739 | 0.598 | 0.681 | 0.770 | |||
| yes | 0.864 | 0.744 | 0.678 | 0.798 | 0.779 | ||
| yes | yes | 0.882 | 0.738 | 0.705 | 0.792 | 0.791 | |
| yes | yes | 0.856 | 0.750 | 0.693 | 0.809 | 0.781 | |
| yes | yes | yes | 0.908 | 0.745 | 0.729 | 0.831 | 0.807 |
The ablation shows that the full recipe matters: the base VisReason data improves fine-grained and relation-heavy tasks, VisReason-Pro adds higher-fidelity spatial rationales, and adaptive zoom-in gives the strongest final average. A blinded user study further suggests that the gains are visible to humans as more faithful, clearer reasoning traces.
| Method | Answer accuracy | Grounded faithfulness | Stepwise clarity | Mean |
|---|---|---|---|---|
| MiniGPTv2 | 2.37 | 1.94 | 1.88 | 2.06 |
| VisCoT-7B | 2.83 | 2.58 | 2.34 | 2.58 |
| LLaVA-NeXT-8B | 3.52 | 2.93 | 3.08 | 3.18 |
| InternVL-2.5-8B | 3.87 | 3.32 | 3.24 | 3.48 |
| VisReason-7B | 4.07 | 4.18 | 4.12 | 4.12 |
| VisReason-Pro-7B | 4.19 | 4.46 | 4.37 | 4.34 |
Conclusion
VisReason provides large-scale, RoI-grounded process supervision for visual Chain-of-Thought reasoning. By combining broad domain coverage with compact multi-round rationales and depth-augmented spatial cues, it teaches MLLMs to reason through a more human-like global-to-local workflow. Experiments show that this supervision improves fine-grained recognition, spatial relation reasoning, RoI localization, and the perceived faithfulness of reasoning traces. We hope VisReason serves as a foundation for future work on interpretable, verifiable, and spatially grounded multimodal intelligence.
BibTeX
@inproceedings{li2026visreason,
title = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
author = {Li, Lingxiao and Wang, Yifan and Gao, Xinyan and Tang, Chen and Yue, Xiangyu and You, Chenyu},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}