Page MenuHomeSolus

OCRmyPDF
Closed due to inactivityPublic

Description

  • Name: OCRmyPDF
  • Homepage: https://github.com/jbarlow83/OCRmyPDF
  • Why should this be included in the repository? We already have tesseract, but it doesn't directly handle PDF files. You must first run imagemagick to covert the pages, etc etc. OCRmyPDF just automates this, plus it has some additional features. Please read the project README for details.
  • Is it Open Source: yes
  • Who and how many users do you anticipate will use this software? I don't know HOW many users, but students and old book collectors (thus with no digital copies) will benefit from this for sure.
  • Link to source tarball/zip file: https://github.com/jbarlow83/OCRmyPDF/archive/v7.0.5.tar.gz

Related Objects

Mentioned Here
T8246: pybind11

Event Timeline

JoshStrobl moved this task from Backlog to Accepted For Inclusion on the Package Requests board.

It requires ruffus as rundep, which looks dead. What's your opinion on this?

DataDrake added a subscriber: DataDrake.

Not to include it.

Apologies for bringing this up again, but I would like to point 2 things out:

It requires ruffus as rundep, which looks dead.

As of v.9.0.0 this dependency was removed (see release notes)

Why should this be included in the repository? We already have tesseract, but it doesn't directly handle PDF files. You must first run imagemagick to covert the pages, etc etc. OCRmyPDF just automates this, plus it has some additional features.

All of OCRmyPDF's required dependencies (and 1 out of 2 *optional* dependencies) are already availabe in the repositories. It's an actively developed program, small in size, with quick and great results. The only somewhat practical (open-source) alternative I've come across is OCRFeeder, a gui application that comes as a flatpak. But this delivers far worse results, it's slow and the flatpak hasn't seen any updates in almost 5 years now. In addition, it requires a ~2.5 GB download and 5 GB of disk space once installed.

Thus, I'd like to claim that OCRmyPDF is the better solution and much more in accordance with what I understand to be Solus' goals!

If ruffus was the reason for not including it, that obstacle is gone now.
If general maintenance of the package is the problem, I completely understand. But if that's the reason for not including it, please clarify.

Thank you for your time!

To run this application there are some missing dependencies in the repository, like:

  • img2pdf
  • jbig2enc
  • libimagequant
  • pngquant
  • pybind11 requested also on T8246
  • python-pdfminer.six
  • python-pikepdf
  • python-setuptools-scm-git-archive

Other:

  • python-pillow need to be updated.

Edit: Added full list of new packages.

DataDrake changed the task status from Wontfix to Frozen.Feb 21 2022, 6:16 AM