Software Companions - Viewers and Converters

Image Optical Character Recognition (OCR)

How to do OCR in ViewCompanion Premium with Tesseract

ViewCompanion Premium can be used to convert an image, for example a scanned document, into a searchable PDF using optical character recognition with the help of Tesseract. Tesseract is a free optical character recognition engine (OCR) that can be installed on your system without any cost. You can find a link to download Tesseract at the bottom of this page.
The picture below shows a typical scanned document opened in ViewCompanion:


Scanned document with foxing

If your scanned document is old, it may have stains, browning or other age-related detoritations. Browning, also known as foxing, as shown in the above picture, can first be removed using the built-in Defoxing or Binarization filters.
Please note that if your file is a scanned PDF you will have to press the Edit PDF as Image button first before using this tool or the OCR function.

Since we primary need the text, we've been using the Binarization filter which will result in a black and white image, as shown below:


Document after binarization

After running the filter there may still be remaining stain that was not removed.
You can remove remaining noise by using the clear area tool and the clear polygon tool.
When you're ready to run the OCR locate the OCR button found in the Premium tab, as shown below:


Premium OCR tool

After a while you will be prompted to enter a file name for the resulting PDF file. When the OCR conversion is complete you can open the resulting PDF file in for example Acrobat to verify that it's now searchable:


Searchable PDF after OCR

Tesseract


You can download Tesseract here https://github.com/UB-Mannheim/tesseract/wiki


ViewCompanion Premium


Read more about ViewCompanion Premium


Click here to download a 30-days trial now