OCRmyPDF
Closed due to inactivityPublic
Actions

Assigned To

Authored By

	livingsilver94
	Sep 14 2018, 11:01 PM

Description

Name: OCRmyPDF
Homepage: https://github.com/jbarlow83/OCRmyPDF
Why should this be included in the repository? We already have tesseract, but it doesn't directly handle PDF files. You must first run imagemagick to covert the pages, etc etc. OCRmyPDF just automates this, plus it has some additional features. Please read the project README for details.
Is it Open Source: yes
Who and how many users do you anticipate will use this software? I don't know HOW many users, but students and old book collectors (thus with no digital copies) will benefit from this for sure.
Link to source tarball/zip file: https://github.com/jbarlow83/OCRmyPDF/archive/v7.0.5.tar.gz

Related Objects

Mentioned Here: T8246: pybind11

Event Timeline

livingsilver94 created this task.Sep 14 2018, 11:01 PM

livingsilver94 updated the task description. (Show Details)Sep 15 2018, 9:27 AM

JoshStrobl triaged this task as Normal priority.Oct 16 2018, 10:18 PM

JoshStrobl moved this task from Backlog to Accepted For Inclusion on the Package Requests board.

Herald added a project: Needs Maintainer. · View Herald TranscriptOct 16 2018, 10:18 PM

livingsilver94 claimed this task.Oct 16 2018, 10:46 PM

It requires ruffus as rundep, which looks dead. What's your opinion on this?

Not to include it.

Apologies for bringing this up again, but I would like to point 2 things out:

It requires ruffus as rundep, which looks dead.

As of v.9.0.0 this dependency was removed (see release notes)

Why should this be included in the repository? We already have tesseract, but it doesn't directly handle PDF files. You must first run imagemagick to covert the pages, etc etc. OCRmyPDF just automates this, plus it has some additional features.

All of OCRmyPDF's required dependencies (and 1 out of 2 *optional* dependencies) are already availabe in the repositories. It's an actively developed program, small in size, with quick and great results. The only somewhat practical (open-source) alternative I've come across is OCRFeeder, a gui application that comes as a flatpak. But this delivers far worse results, it's slow and the flatpak hasn't seen any updates in almost 5 years now. In addition, it requires a ~2.5 GB download and 5 GB of disk space once installed.

Thus, I'd like to claim that OCRmyPDF is the better solution and much more in accordance with what I understand to be Solus' goals!

If ruffus was the reason for not including it, that obstacle is gone now.
If general maintenance of the package is the problem, I completely understand. But if that's the reason for not including it, please clarify.

Thank you for your time!

To run this application there are some missing dependencies in the repository, like:

img2pdf
jbig2enc
libimagequant
pngquant
pybind11 requested also on T8246
python-pdfminer.six
python-pikepdf
python-setuptools-scm-git-archive

Other:

python-pillow need to be updated.

Edit: Added full list of new packages.

DataDrake changed the task status from Wontfix to Frozen.Feb 21 2022, 6:16 AM

OCRmyPDFClosed due to inactivityPublicActions

Description

Related Objects

Event Timeline

OCRmyPDF
Closed due to inactivityPublic
Actions