ImageMining: A Vision-Centric Deep Search Benchmark

Hugging Face dataset: https://huggingface.co/datasets/zai-org/ImageMining

from datasets import load_dataset

dataset = load_dataset("zai-org/ImageMining")

ImageMining is a benchmark designed to evaluate multimodal agents on their ability to integrate high-density visual understanding with autonomous deep search. Unlike traditional VQA tasks, ImageMining requires models to actively mine visual inputs through agentic behaviors — multi-step tool calls such as localized cropping, magnification of minute details, and cross-referencing visual inputs to refine search queries.

Overview

The benchmark tests the paradigm of "think with image, deep search with image" — anchoring reasoning within visual contexts rather than relying on textual shortcuts or parametric knowledge. Task performance correlates strongly with the precision of on-image tool usage, evaluating models across a "Deep-Wide-Search" spectrum that spans search breadth across sources and depth in visual reasoning.

Statistics

Metric	Count
Total test cases	217
Domains	7
Sub-categories	23
Reasoning types	5

Data Format

Each entry in data.jsonl contains the following fields:

Field	Description
`id`	Unique identifier
`category_l1`	Primary domain category
`category_l2`	Fine-grained sub-category
`difficulty_tags`	List of reasoning types required
`image`	Corresponding image filename in `images/`
`question`	Question text (English)
`answer`	Ground-truth answer (English)
`question_zh`	Original question (Chinese)
`answer_zh`	Original answer (Chinese)
`reasoning`	Step-by-step reasoning chain (English)
`reasoning_zh`	Original reasoning chain (Chinese)
`need_image_before_search`	Whether visual input is needed before starting search (`yes`/`no`)
`need_image_during_search`	Whether visual input is needed during the search process (`yes`/`no`)

Example

{
  "id": 1,
  "category_l1": "Social & Humanities",
  "category_l2": "Politics",
  "difficulty_tags": ["Event Reasoning", "Image Retrieval Reasoning"],
  "image": "1.png",
  "question": "While holding an important position, this singer bought a rock record during a visit to China. What English text was written on the cover of that album?",
  "answer": "DOU WEI BLACK DREAM",
  "question_zh": "这位歌手在他担任要职时...",
  "answer_zh": "DOU WEI BLACK DREAM",
  "reasoning": "1. An image search reveals that the singer is Blinken...",
  "reasoning_zh": "1. 搜图得到该歌手是布林肯...",
  "need_image_before_search": "yes",
  "need_image_during_search": "yes"
}

Categories

Domains (category_l1)

Category	Count
Rich Text	41
Science	40
Place	36
Social & Humanities	31
Product	31
Entertainment & Sports	25
Nature	13

Sub-categories (category_l2)

Top sub-categories include: Place (36), Others (36), Complex Posters and Drawings (13), Electronics & Digital (10), Biology (9), Astronomy (8), Chemistry (8), Physics (7), Arts (7), Document (7), among others.

Reasoning Types (difficulty_tags)

Type	Count
Image Retrieval Reasoning	102
Text Reasoning	99
Object Recognition	99
Event Reasoning	66
Spatiotemporal Reasoning	66

Note: Each test case may involve multiple reasoning types.

Descriptions:

Object Recognition: Fine-grained identification of flora, fauna, artifacts, and real-world objects.
Spatiotemporal Reasoning: Geographic and temporal deduction grounded in visual cues.
Event Reasoning: Comprehension of news events, political milestones, and product launches.
Text Reasoning: Reasoning over embedded rich text such as academic papers, financial reports, and documents.
Image Retrieval Reasoning: Cross-referencing visual inputs to retrieve specific artworks, imagery, or information through search.

Dataset Structure

ImageMining/
├── data.jsonl          # Main dataset (217 entries)
├── images/             # Associated images (217 files, need to be downloaded separately)
│   ├── 1.png
│   ├── 2.jpg
│   └── ...
└── README.md

Download Images: The associated images can be downloaded from Tsinghua Cloud. After downloading, please extract the files into the images/ directory.

Key Design Principles

Visual-First Reasoning: Questions are designed so that visual input is essential — models cannot solve them through text alone.
Multi-Step Search: Success requires iterative search with visual feedback, including localized cropping and magnification of details.
Diverse Domains: Covers 7 domains and 23 sub-categories spanning science, politics, culture, nature, and more.
Annotated Reasoning Chains: Each entry includes a human-verified step-by-step reasoning process for interpretability.

License

This dataset is released for research purposes. Please refer to the license file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
data.jsonl		data.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ImageMining: A Vision-Centric Deep Search Benchmark

Overview

Statistics

Data Format

Example

Categories

Domains (category_l1)

Sub-categories (category_l2)

Reasoning Types (difficulty_tags)

Dataset Structure

Key Design Principles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ImageMining: A Vision-Centric Deep Search Benchmark

Overview

Statistics

Data Format

Example

Categories

Domains (category_l1)

Sub-categories (category_l2)

Reasoning Types (difficulty_tags)

Dataset Structure

Key Design Principles

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages