I like to say that OCR is in many ways like healthcare: you should try to prevent reading garbage before reaching for symptom cures like spelling correction on OCR output.

This article is more of a logically ordered collection of practical tips than a full-blown tutorial. I hope you will find it useful when dealing with difficult sets of texts to process.

I tend to use the open source tool developed by Google called tesseract. You run it on command line. One can OCR a whole directory of images with a Bash script looking, for example, like this:

for i in {0001..0500}
   tesseract 'IMAG'$i'.JPG' 'text-'$i -l pol+lat --psm 6

It runs Tesseract on images from IMAG0001.JPG to IMA0500.JPG and outputs the text to files from text-0001.txt up to text-0500.txt (the programs adds the .txt extension by itself). It this case, we specify two languages that can appear in the texts (Polish and Latin).

You can find the three-letter language codes here (ISO 639-2/T and ISO 639-2/B), although you should also see them when installing the appropriate language model for Tesseract – on Linux, I just use my system packages for getting those.

Page segmentation mode

Left to itself, Tesseract has to guess an awful lot about the text that it tries to read. We should try to tell it as much as possible.

In many cases, we want to tell Tesseract there’s only one column of text on the page. That’s what the --psm 6 flag does. Here are some other options, extracted from the Tesseract’s advanced help output (you can see it in the console with tesseract --help-extra):

3    Fully automatic page segmentation, but no OSD. (Default)
4    Assume a single column of text of variable sizes. /so the font size fluctuates inside the page/
5    Assume a single uniform block of vertically aligned text. /this seems to mean that text runs vertically, 90 degrees rotated?/
6    Assume a single uniform block of text. /this is what I use in this example/
7    Treat the image as a single text line.
8    Treat the image as a single word.
9    Treat the image as a single word in a circle.
10    Treat the image as a single character.
11    Sparse text. Find as much text as possible in no particular order.
12    Sparse text with OSD.
13    Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

OSD here means orientation and script detection. Normally Tesseracts assumes that you give it the images in the proper orientation (i.e., not rotated) and it knows the script (such as the Latin script, for example) from the language declaration.

When converting my images, I saw initially that OCR had problems with realizing that we have only one column of text. I may not have the best example, since it’s not an easy text to comprehend, but here’s a one column text that Tesseract garbled nonsensically into “many columns”:

13. DUbI'a RQ P. w starych summaCh'zawiedzione, które za czworgiem
iż O tym[ pewny wiadomości nie masz,

dożywocia wolne zostawać miały,
za wyści-em czworga dożywocia

któreby po konstytucyi anni 1566il już
Wolne hely, a_byg bel nalezion obyczaj, jakoby się z tym nikth in frau-
demi Reipublicae nie utaiwał, i owszem żeby i z tych dóbr kwarta Bei-
pub'licae za taksą- wyżej mianowana. i prowent K. JM. przymnożony- bel,
także .z. dóbr-aad rationem quingentorum milium flor. zawiedzionych,
i' tych, które po odjachaniu króla. Henryka z Korony przez żołnierze
w zapłacie zatrzymany okkupowane są; takowe sumy aby ważne 'nie hely.

And here’s the same fragment after improving the image somewhat and setting the page segmentation:

© 13. Dobra R. P. w starych summach zawiedzione, które za czworgiem
dożywocia wolne zostawać miały, iż 0 tym” pewny wiadomości nie masz,
któreby po konstytucyi anni 1566* już za wyściem czworga doży wocia
wolne beły, aby” beł nalezion obyczaj, jakoby się z tym nikt" in frau-
dem! Reipublicae nie utaiwal, i owszem ieby i z tych dóbr kwarta Rei-
publieae za taksą wyżej mianowaną i prowent K. JM. przymnożony: beł,
także z. dóbr: ad rationem quingentórum milium flor. zawiedzionych,
i tych, które po odjachaniu króla Henryka z Korony przez żołnierze
w zapłacie zatrzymany okkupowane Są; takowe sumy aby ważne nie beły.

Batch cropping with ImageMagick

If you’re dissatisfied with the quality of Tesseract’s output, the best way to improve it is to try some cropping and enhancement techniques on your images – then compare how OCR performs on the different versions.

One thing that may occur is images having too little contrast (perhaps they’re shaded) so it’s harder to distinguish between the text and the background. In my experience, playing with this didn’t improve much – but your images and scanning hardware artifacts may be different. That’s why it’s very important to test, on small batches, which image enhancements actually lead to improvement.

Many transformations related to bending, cropping etc. can be made in GUI with the free and very sensible ScanTailor program. For me it was crashing sometimes, so I resorted to the oldschool and Linux-standard ImageMagick CLI toolkit. It has fairly good docs and plenty of info on the Internet on how to use it.

I have two important tips related to ImageMagick:

  1. There are two main commands: convert saves the result image to a different file, and mogrify modifies the file in-place. They have otherwise pretty much the same syntax (see below for an example). If you use mogrify, make sure to have backup copies!
  2. If you do batch processing natively in ImageMagick (e.g. doing convert *.jpg ... for the whole directory), it tries to do everything in RAM before writing anything to disk. With large batches it can lead to OS freezing and the program crashing without saving anything. My advice: run batch processing in a Bash script instead, like so:
for i in {0001..0500}
   convert 'IMAG'$i'.JPG' -crop 85%x100%+0+0 'out-'$i'.jpg'

What this particular script does is cropping the right 15% of width of the image: cropping a fragment of an image

To break down the 85%x100%+0+0 argument to -crop:

  • leave 85% of the original width (or, crop 15% of the image),
  • leave 0% of the original height,
  • move the output image 0 px to the right (if we put 100px there, the command would crop 100px on the left and 100px less of the 15% on the right),
  • move the output image 0 px toward the bottom.

Often you want to make this transformation one by one, while observing how each image’s thumbnail changes in the file browser, for example (assuming you have time for this). Use

mogrify IMAG001.JPG -crop 85%x100%+0+0

instead of

convert IMAG001.JPG -crop 85%x100%+0+0 'out-001.jpg'

if you want to change the file in-place (in that case, just skip the output filename).

To rotate 90 degrees clockwise:

convert IMAG001.JPG -rotate "90" 'out-001.jpg'

Using -rotate "-90" would make it counter-clockwise.

Locally stretching an image in GIMP


I want to share one more tip with you. Sometimes, when using a hand scanner, some fragments of text get shrinked and difficult for OCR software to read. In fact, images distorted in this way used to be what you would get as CAPTCHA challenges back in the day. The method I found for dealing with the worst cases of this is using GIMP (email me if you know something better and I'll update). This doesn't scale very well, but seems to be rather effective. Chances are you can find something similar in Photoshop etc.

The tool in question is called handle transform and in my installation I can access it with Shift+L. Place three dots by clicking – two at the boundaries of the affected region and one in the middle: placing handles for the transform in GIMP

Now grab the middle dot and gently move your mouse sideways (not vertically!). The image will start to stretch. Be bold! Make it too wide. It’s only important that tesseract will make out the letters, it doesn’t really have feelings or sense of comfort (just so you know). locally stretching the image in GIMP

You should confirm by pressing Enter or clicking “Transform”. You can now press Ctrl+E to export the image.

You may ask, why wouldn’t we uniformly stretch the whole image? And the reason is that we want to make the width of each letter more or less the same – so that tesseract can look at the image and guess that the letter width is more or less x. We want to get rid of areas where this assumption would be violated: for example where the letters would be relatively too narrow, narrower that roughly x. Making characters as easily distinguishable as possible is the key to OCR preprocessing.

So now, hopefully, you have your images converted to text as well as possible, and you can proceed to all the fancy processing and NLP. Good luck!