Feat: Add support for OmniDocBench dataset evaluation #48

W-Geong · 2026-01-16T14:42:07Z

[Feat] Add support for OmniDocBench dataset evaluation

📝 Description

This PR integrates the OmniDocBench dataset into the GAGE evaluation framework. It enables the evaluation of multimodal models on document understanding tasks including text blocks, formulas, tables, and reading order.

🛠️ Implementation Details

Configuration: Added config/custom/omnidocbench_qwen_mllm.yaml, using litellm as the backend.
Preprocessor: Implemented OmniDocPreprocessor class to convert custom data formats into GAGE-compatible formats.
Metrics:
- Added OmniDocBenchMetric class.
- Implemented OmniDocLazyCalcAggregator to handle the complex evaluation logic. It utilizes the external OMNIDOCBENCH_HOME environment variable to invoke the official OmniDocBench evaluation scripts and retrieve aggregated metrics.

⚙️ Environment Setup

To run this benchmark, ensure the GAGE environment is installed. Additionally, you need to set up the OmniDocBench dependencies:

Follow the instructions at OmniDocBench GitHub.
Note: A LaTeX compiler must be installed to calculate the CDM (Character Detection Matching) metric for mathematical formulas.

🚀 How to Run

Please configure the environment variables and run the evaluation script as follows:

# Set your OpenAI API Key (or backend key)
export OPENAI_API_KEY="sk-placeholder"

# Set the path to the cloned OmniDocBench repository
export OMNIDOCBENCH_HOME="/path/to/your/OmniDocBench-main"

# Run the evaluation
python run.py \
 --config config/custom/omnidocbench_qwen_mllm.yaml \
 --output-dir runs \
 --run-id Qwen3-Omni-30B-A3B-Instruct \
 --concurrency 64 \
 --max-samples 99999999

📊 Evaluation Results

The evaluation was performed on Qwen3-VL-30B, Qwen2.5-Omni, and Qwen3-Omni.

Note:

The raw metrics are stored in summary.json under ["metrics"]["metadata"]["overall"].

Per-page metrics are available in ["tasks"]["metrics"].

Model Name	Text Block (Edit Dist) ↓	Formula (CDM) ↑	Table (TEDs) ↑	Table Struct (TEDs) ↑	Reading Order (Edit Dist) ↓	Overall ↑
Qwen3-VL-30B-A3B-Instruct	0.064	99.949	76.584	80.959	0.083	90.04
Qwen2.5-Omni-7B	0.141	99.973	73.250	78.539	0.152	86.37
Qwen3-Omni-30B-A3B-Instruct	0.085	99.974	79.983	84.543	0.089	90.49

(↓ indicates lower is better, ↑ indicates higher is better)

💡 Key Observations

Performance vs. Scale: As expected, models with larger parameters generally yield better results.
Model Comparison: Qwen3-VL-30B and Qwen3-Omni-30B show comparable performance across most metrics.
Inference Speed: Qwen3-Omni-30B has slower inference speeds compared to the VL version. This is due to special audio operators which currently do not support high-performance backends like SGLANG.

WNQzhu

Thank you for your interest in our project and your code contribution. However, the code currently has the following issue:
Our sample format adheres to the new protocol (sample.py). Could you please modify the to_sample method to return samples in the new format?
Could you also send the program's running results to my email address? (zhuwnq@outlook.com)
Thank you.

W-Geong · 2026-01-19T11:31:10Z

Hi Mr. Zhu,

Thank you for your feedback.

I have sent the running results to your email (zhuwnq@outlook.com) as requested.

Regarding the code implementation, I noticed that _PreprocessorAdapter automatically invokes the legacy method _preprocessor.transform. I am currently unable to locate the entry point where the engine calls the to_sample method.

Could you please provide a specific data sample that follows the new format (e.g., from DocVQA or MMMU)? Alternatively, could you point out where the engine specifically calls to_sample()? This would help me align the implementation with the new protocol.

Best regards,

Wente Young

WNQzhu · 2026-01-19T12:23:12Z

Hi Mr. Zhu,

Thank you for your feedback.

I have sent the running results to your email (zhuwnq@outlook.com) as requested.

Regarding the code implementation, I noticed that _PreprocessorAdapter automatically invokes the legacy method _preprocessor.transform. I am currently unable to locate the entry point where the engine calls the to_sample method.

Could you please provide a specific data sample that follows the new format (e.g., from DocVQA or MMMU)? Alternatively, could you point out where the engine specifically calls to_sample()? This would help me align the implementation with the new protocol.

Best regards,

Wente Young

Hi Wentao,
Thank you again for your contribution!
Each raw data entry will be processed by a PreProcessor, which implements a to_sample() method to convert the raw input into our standardized sample format.
The sample format is defined here:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/sample.py

You can find two example preprocessors that follow this protocol:

(1) text:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/preprocessors/gpqa/gpqa_diamond_preprocessor.py

(2) multimodal:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/preprocessors/mathvista/mathvista_chat_preprocessor.py

Additionally, we encourage you to include unit tests to verify the correctness of your preprocessor and any associated metrics. Example test cases can be found in the tests/ directory.
Looking forward to your updates!

W-Geong · 2026-01-20T13:56:01Z

Thank you for the detailed feedback! I have refactored the implementation to use the PreProcessor class and the to_sample() method as requested. Additionally, I added OmniDocPreprocessorTests to verify the correctness of the data conversion logic. Let me know if further changes are needed.

WNQzhu · 2026-01-21T02:59:31Z

I’m sorry, but the preprocessor.to_sample code you provided still does not conform to the updated Sample protocol that I mentioned previously in this file:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/sample.py

@dataclass
class Sample:
    schema_version: str
    id: str
    messages: List[Message]
    task_type: Optional[Any] = None
    options: Optional[List[str]] = None
    references: List[Any] = field(default_factory=list)
    label: Optional[str] = None
    few_shot_examples: Optional[List[Any]] = None
    golden_trajectories: Optional[List[Any]] = None
    sandbox: Optional[Dict[str, Any]] = None    
    metadata: Optional[Dict[str, Any]] = None
    data_tag: Optional[Dict[str, Any]] = None
    raw_assets: Optional[Dict[str, Any]] = None
    tools: Optional[List[Any]] = None
    tool_choice: Optional[Union[str, Dict[str, Any]]] = None
    sampling_params: Optional[Dict[str, Any]] = None
    generation_params: Optional[Dict[str, Any]] = None
    eval_config: Optional[Dict[str, Any]] = None
    unconditioned_input: Optional[Union[str, list[Any]]] = None
predict_result: List[PredictResult] = field(default_factory=list)
eval_result: Dict[str, Any] = field(default_factory=dict)

Please ensure that your to_sample method returns a valid Sample object that matches this structure.
Please read the example code carefully.
Thank you!

WNQzhu requested changes Jan 19, 2026

View reviewed changes

Feat: Add support for OmniDocBench dataset evaluation

1216503

W-Geong force-pushed the main branch from 78efb35 to 1216503 Compare January 20, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add support for OmniDocBench dataset evaluation #48

Feat: Add support for OmniDocBench dataset evaluation #48

Uh oh!

W-Geong commented Jan 16, 2026 •

edited

Loading

Uh oh!

WNQzhu left a comment

Uh oh!

W-Geong commented Jan 19, 2026

Uh oh!

WNQzhu commented Jan 19, 2026

Uh oh!

W-Geong commented Jan 20, 2026

Uh oh!

WNQzhu commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feat: Add support for OmniDocBench dataset evaluation #48

Are you sure you want to change the base?

Feat: Add support for OmniDocBench dataset evaluation #48

Uh oh!

Conversation

W-Geong commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!