2

I have photo-scanned one of my most-used reference works, Watching the Skies. The aim was to allow me to do searches instead of using the (somewhat odd) index. I copied the images off my iphone and used Preview on my mac to produce a single PDF with all of the pages. The result is surprisingly good.

So now I have a huge document of fairly readable scanned images. But there is no "text layer" that I can search. Is there a way to do that within Preview, or some other tool I can use?

  • Today I learned that even though Preview OCRs images and allows you to copy text out of an image it will not let you search for text in an image. I also learned that it does not OCR images in PDF files at all. I thought Preview would do both of those things and now I am wondering why it does not. – Dave Nelson Apr 18 '23 at 20:36
  • Can you post a page (one with lots of text) from the book and I can run OwlOCR and Nitro PDF with it. – Gilby Apr 18 '23 at 22:54
  • Strictly speaking, this is a duplicate of https://apple.stackexchange.com/questions/76471/make-existing-pdf-searchable-ocr-via-command-line-script?rq=1. But that question and its answers are old enough that it is IMO worth addressing again. – Gilby Apr 18 '23 at 23:01
  • @DaveNelson - this is a mystery to me, especially after they added LiveText, which does precisely this, but only manually, and only on a selection. – Maury Markowitz Apr 19 '23 at 15:23

2 Answers2

1

Use Tesseract

This is open source OCR software available on MacPorts or Homebrew that can output to several formats:

  • txt
  • pdf
  • hocr
  • tsv
  • pdf with text layer only

Ideally, you’d want to take the image first (before making a PDF) and let Tesseract create the searchable PDF for you:

tesseract foobar.tif foobar pdf

However, if you already have an existing PDF, they have this solved for you as indicated in the FAQs

Use the config variable -c textonly_pdf=1 and Merge your image-only and text-only PDF.

See https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-274213632 for details.

Gilby
  • 10,852
Allan
  • 101,432
1

Use OwlOCR

This is a modestly priced OCR product (from the Apple App Store), which in my experience performs just as well as more expensive products like PDFPen Pro (now Nitro PDF Pro).

Whenever I have tried free OCR (most notably Tesseract), I have been thoroughly disappointed in the results.

Opinion: OCR on the Mac has been revolutionised by the inbuilt Image to Text engine. Quality OCR is no longer restricted to expensive products. My understanding is that OwlOCR (and a few others) tap into this engine.

Gilby
  • 10,852