Add parallel processing to OCR text extraction of full documents by ntodd · Pull Request #124 · documentcloud/docsplit

ntodd · 2014-12-18T22:51:55Z

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction. If Parallel is not installed, use previous behavior.

deuxshaish · 2014-12-21T11:08:12Z

I like this a lot.. Will test and observe, thanks for the commit

pickhardt · 2023-05-06T01:52:27Z

This is a great idea.

Nate Todd added 2 commits December 18, 2014 17:20

Add parallel processing to OCR text extraction

1f1ec93

Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction. If Parallel is not installed, use previous behavior.

Add Parallel installation to documentation

7427d08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel processing to OCR text extraction of full documents#124

Add parallel processing to OCR text extraction of full documents#124
ntodd wants to merge 2 commits intodocumentcloud:masterfrom
ntodd:master

ntodd commented Dec 18, 2014

Uh oh!

deuxshaish commented Dec 21, 2014

Uh oh!

pickhardt commented May 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ntodd commented Dec 18, 2014

Uh oh!

deuxshaish commented Dec 21, 2014

Uh oh!

pickhardt commented May 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants