Skip to content

Add parallel processing to OCR text extraction of full documents#124

Open
ntodd wants to merge 2 commits intodocumentcloud:masterfrom
ntodd:master
Open

Add parallel processing to OCR text extraction of full documents#124
ntodd wants to merge 2 commits intodocumentcloud:masterfrom
ntodd:master

Conversation

@ntodd
Copy link

@ntodd ntodd commented Dec 18, 2014

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

Nate Todd added 2 commits December 18, 2014 17:20
Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction.  If Parallel is not installed, use previous behavior.
@deuxshaish
Copy link

I like this a lot.. Will test and observe, thanks for the commit

@pickhardt
Copy link

This is a great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants