Skip to content

Conversation

@W-Geong
Copy link

@W-Geong W-Geong commented Jan 16, 2026


[Feat] Add support for OmniDocBench dataset evaluation

📝 Description

This PR integrates the OmniDocBench dataset into the GAGE evaluation framework. It enables the evaluation of multimodal models on document understanding tasks including text blocks, formulas, tables, and reading order.

🛠️ Implementation Details

  1. Configuration: Added config/custom/omnidocbench_qwen_mllm.yaml, using litellm as the backend.
  2. Preprocessor: Implemented OmniDocPreprocessor class to convert custom data formats into GAGE-compatible formats.
  3. Metrics:
    • Added OmniDocBenchMetric class.
    • Implemented OmniDocLazyCalcAggregator to handle the complex evaluation logic. It utilizes the external OMNIDOCBENCH_HOME environment variable to invoke the official OmniDocBench evaluation scripts and retrieve aggregated metrics.

⚙️ Environment Setup

To run this benchmark, ensure the GAGE environment is installed. Additionally, you need to set up the OmniDocBench dependencies:

  1. Follow the instructions at OmniDocBench GitHub.
  2. Note: A LaTeX compiler must be installed to calculate the CDM (Character Detection Matching) metric for mathematical formulas.

🚀 How to Run

Please configure the environment variables and run the evaluation script as follows:

# Set your OpenAI API Key (or backend key)
export OPENAI_API_KEY="sk-placeholder"

# Set the path to the cloned OmniDocBench repository
export OMNIDOCBENCH_HOME="/path/to/your/OmniDocBench-main"

# Run the evaluation
python run.py \
 --config config/custom/omnidocbench_qwen_mllm.yaml \
 --output-dir runs \
 --run-id Qwen3-Omni-30B-A3B-Instruct \
 --concurrency 64 \
 --max-samples 99999999

📊 Evaluation Results

The evaluation was performed on Qwen3-VL-30B, Qwen2.5-Omni, and Qwen3-Omni.

Note:

  • The raw metrics are stored in summary.json under ["metrics"]["metadata"]["overall"].
  • Per-page metrics are available in ["tasks"]["metrics"].
Model Name Text Block
(Edit Dist) ↓
Formula
(CDM) ↑
Table
(TEDs) ↑
Table Struct
(TEDs) ↑
Reading Order
(Edit Dist) ↓
Overall
Qwen3-VL-30B-A3B-Instruct 0.064 99.949 76.584 80.959 0.083 90.04
Qwen2.5-Omni-7B 0.141 99.973 73.250 78.539 0.152 86.37
Qwen3-Omni-30B-A3B-Instruct 0.085 99.974 79.983 84.543 0.089 90.49

(↓ indicates lower is better, ↑ indicates higher is better)

💡 Key Observations

  • Performance vs. Scale: As expected, models with larger parameters generally yield better results.
  • Model Comparison: Qwen3-VL-30B and Qwen3-Omni-30B show comparable performance across most metrics.
  • Inference Speed: Qwen3-Omni-30B has slower inference speeds compared to the VL version. This is due to special audio operators which currently do not support high-performance backends like SGLANG.

Copy link
Collaborator

@WNQzhu WNQzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your interest in our project and your code contribution. However, the code currently has the following issue:
Our sample format adheres to the new protocol (sample.py). Could you please modify the to_sample method to return samples in the new format?
Could you also send the program's running results to my email address? (zhuwnq@outlook.com)
Thank you.

@W-Geong
Copy link
Author

W-Geong commented Jan 19, 2026

Hi Mr. Zhu,

Thank you for your feedback.

I have sent the running results to your email (zhuwnq@outlook.com) as requested.

Regarding the code implementation, I noticed that _PreprocessorAdapter automatically invokes the legacy method _preprocessor.transform. I am currently unable to locate the entry point where the engine calls the to_sample method.

Could you please provide a specific data sample that follows the new format (e.g., from DocVQA or MMMU)? Alternatively, could you point out where the engine specifically calls to_sample()? This would help me align the implementation with the new protocol.

Best regards,

Wente Young

@WNQzhu
Copy link
Collaborator

WNQzhu commented Jan 19, 2026

Hi Mr. Zhu,

Thank you for your feedback.

I have sent the running results to your email (zhuwnq@outlook.com) as requested.

Regarding the code implementation, I noticed that _PreprocessorAdapter automatically invokes the legacy method _preprocessor.transform. I am currently unable to locate the entry point where the engine calls the to_sample method.

Could you please provide a specific data sample that follows the new format (e.g., from DocVQA or MMMU)? Alternatively, could you point out where the engine specifically calls to_sample()? This would help me align the implementation with the new protocol.

Best regards,

Wente Young

Hi Wentao,
Thank you again for your contribution!
Each raw data entry will be processed by a PreProcessor, which implements a to_sample() method to convert the raw input into our standardized sample format.
The sample format is defined here:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/sample.py

You can find two example preprocessors that follow this protocol:

(1) text:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/preprocessors/gpqa/gpqa_diamond_preprocessor.py

(2) multimodal:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/preprocessors/mathvista/mathvista_chat_preprocessor.py

Additionally, we encourage you to include unit tests to verify the correctness of your preprocessor and any associated metrics. Example test cases can be found in the tests/ directory.
Looking forward to your updates!

@W-Geong
Copy link
Author

W-Geong commented Jan 20, 2026

Thank you for the detailed feedback! I have refactored the implementation to use the PreProcessor class and the to_sample() method as requested. Additionally, I added OmniDocPreprocessorTests to verify the correctness of the data conversion logic. Let me know if further changes are needed.

@WNQzhu
Copy link
Collaborator

WNQzhu commented Jan 21, 2026

I’m sorry, but the preprocessor.to_sample code you provided still does not conform to the updated Sample protocol that I mentioned previously in this file:
https://github.com/HiThink-Research/GAGE/blob/main/src/gage_eval/assets/datasets/sample.py

@dataclass
class Sample:
    schema_version: str
    id: str
    messages: List[Message]
    task_type: Optional[Any] = None
    options: Optional[List[str]] = None
    references: List[Any] = field(default_factory=list)
    label: Optional[str] = None
    few_shot_examples: Optional[List[Any]] = None
    golden_trajectories: Optional[List[Any]] = None
    sandbox: Optional[Dict[str, Any]] = None    
    metadata: Optional[Dict[str, Any]] = None
    data_tag: Optional[Dict[str, Any]] = None
    raw_assets: Optional[Dict[str, Any]] = None
    tools: Optional[List[Any]] = None
    tool_choice: Optional[Union[str, Dict[str, Any]]] = None
    sampling_params: Optional[Dict[str, Any]] = None
    generation_params: Optional[Dict[str, Any]] = None
    eval_config: Optional[Dict[str, Any]] = None
    unconditioned_input: Optional[Union[str, list[Any]]] = None
predict_result: List[PredictResult] = field(default_factory=list)
eval_result: Dict[str, Any] = field(default_factory=dict)

Please ensure that your to_sample method returns a valid Sample object that matches this structure.
Please read the example code carefully.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants