Skip to content

Comments

Add configurable Azure Document Intelligence analysis features via CLI and kwargs#1561

Open
kei-yamazaki wants to merge 1 commit intomicrosoft:mainfrom
kei-yamazaki:doc-intel-features
Open

Add configurable Azure Document Intelligence analysis features via CLI and kwargs#1561
kei-yamazaki wants to merge 1 commit intomicrosoft:mainfrom
kei-yamazaki:doc-intel-features

Conversation

@kei-yamazaki
Copy link

Summary

This PR makes Azure Document Intelligence analysis features configurable while preserving the existing default behavior.

What changed

  • Added CLI support for custom DI analysis features:
    • --docintel-feature (repeatable and comma-separated)
  • Added Python kwargs support:
    • MarkItDown(..., docintel_features=[...])
    • per-call override via convert(..., docintel_features=[...])
  • Switched feature handling to SDK constants (DocumentAnalysisFeature) for type safety
  • Added normalization for common feature input forms (e.g. FORMULAS, ocr_high_resolution, DocumentAnalysisFeature.STYLE_FONT)
  • Added validation for invalid feature names (raises ValueError)
  • Kept default behavior when features are not specified:
    • OCR-capable formats: FORMULAS, OCR_HIGH_RESOLUTION, STYLE_FONT
    • .docx, .pptx, .xlsx, .html: no analysis features
  • Updated README:
    • documented CLI feature option
    • documented default feature behavior when not specified
    • removed Python feature-customization sample section per latest doc direction

Why

  • Users need to control DI analysis features depending on accuracy/cost/performance requirements.
  • The previous implementation had fixed features and limited flexibility.
  • Using SDK constants improves safety and reduces string-based errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant