-
Notifications
You must be signed in to change notification settings - Fork 5
Feat: Add support for OmniDocBench dataset evaluation #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WNQzhu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your interest in our project and your code contribution. However, the code currently has the following issue:
Our sample format adheres to the new protocol (sample.py). Could you please modify the to_sample method to return samples in the new format?
Could you also send the program's running results to my email address? (zhuwnq@outlook.com)
Thank you.
|
Hi Mr. Zhu, Thank you for your feedback. I have sent the running results to your email (zhuwnq@outlook.com) as requested. Regarding the code implementation, I noticed that _PreprocessorAdapter automatically invokes the legacy method _preprocessor.transform. I am currently unable to locate the entry point where the engine calls the to_sample method. Could you please provide a specific data sample that follows the new format (e.g., from DocVQA or MMMU)? Alternatively, could you point out where the engine specifically calls to_sample()? This would help me align the implementation with the new protocol. Best regards, Wente Young |
Hi Wentao, You can find two example preprocessors that follow this protocol: (2) multimodal: Additionally, we encourage you to include unit tests to verify the correctness of your preprocessor and any associated metrics. Example test cases can be found in the tests/ directory. |
|
Thank you for the detailed feedback! I have refactored the implementation to use the |
|
I’m sorry, but the preprocessor.to_sample code you provided still does not conform to the updated Sample protocol that I mentioned previously in this file: @dataclass
class Sample:
schema_version: str
id: str
messages: List[Message]
task_type: Optional[Any] = None
options: Optional[List[str]] = None
references: List[Any] = field(default_factory=list)
label: Optional[str] = None
few_shot_examples: Optional[List[Any]] = None
golden_trajectories: Optional[List[Any]] = None
sandbox: Optional[Dict[str, Any]] = None
metadata: Optional[Dict[str, Any]] = None
data_tag: Optional[Dict[str, Any]] = None
raw_assets: Optional[Dict[str, Any]] = None
tools: Optional[List[Any]] = None
tool_choice: Optional[Union[str, Dict[str, Any]]] = None
sampling_params: Optional[Dict[str, Any]] = None
generation_params: Optional[Dict[str, Any]] = None
eval_config: Optional[Dict[str, Any]] = None
unconditioned_input: Optional[Union[str, list[Any]]] = None
predict_result: List[PredictResult] = field(default_factory=list)
eval_result: Dict[str, Any] = field(default_factory=dict)Please ensure that your to_sample method returns a valid Sample object that matches this structure. |
[Feat] Add support for OmniDocBench dataset evaluation
📝 Description
This PR integrates the OmniDocBench dataset into the GAGE evaluation framework. It enables the evaluation of multimodal models on document understanding tasks including text blocks, formulas, tables, and reading order.
🛠️ Implementation Details
config/custom/omnidocbench_qwen_mllm.yaml, usinglitellmas the backend.OmniDocPreprocessorclass to convert custom data formats into GAGE-compatible formats.OmniDocBenchMetricclass.OmniDocLazyCalcAggregatorto handle the complex evaluation logic. It utilizes the externalOMNIDOCBENCH_HOMEenvironment variable to invoke the official OmniDocBench evaluation scripts and retrieve aggregated metrics.⚙️ Environment Setup
To run this benchmark, ensure the GAGE environment is installed. Additionally, you need to set up the OmniDocBench dependencies:
CDM(Character Detection Matching) metric for mathematical formulas.🚀 How to Run
Please configure the environment variables and run the evaluation script as follows:
📊 Evaluation Results
The evaluation was performed on Qwen3-VL-30B, Qwen2.5-Omni, and Qwen3-Omni.
(Edit Dist) ↓
(CDM) ↑
(TEDs) ↑
(TEDs) ↑
(Edit Dist) ↓
(↓ indicates lower is better, ↑ indicates higher is better)
💡 Key Observations
Qwen3-VL-30BandQwen3-Omni-30Bshow comparable performance across most metrics.Qwen3-Omni-30Bhas slower inference speeds compared to the VL version. This is due to special audio operators which currently do not support high-performance backends like SGLANG.