RedTitan RS2 JIT Compiler - OCR using Tesseract advanced configuration

RedTitan RS2 JIT Compiler - OCR using Tesseract advanced configuration

RedTitan EscapeE may be used to recover textual information from most print file and document formats. An area of the page called a 'field' can be drawn on the screen. The textual content of any number of fields can be recovered and saved for later processing.

Some printer files, like inkjet driver output, or image formats like JPEG and TIF present special problems. Where the document was not created using fonts, the textual information may still be recovered using 'Optical Character Recognition'.

EscapeE uses RS2 script to interface with the TESSERACT OCR engine.

Contents

Quick start guide
OCR limitations
Installation requirements
Recognised character sets
Special input parameters and whitelist blacklist filters
Download binaries for RS2 Scripting and Tesseract OCR (15Mb)

OCR limitations

If you intend to use OCR to process documents, it is important that the limitations of this process are understood. The character shapes used in printed text were not designed to make computer recognition easy, and very often, printing and scanning technologies can also cause surprising errors. e.g. (but not limited to)

Similar Glyphs

Uppercase I and the digit 1
Digit zero 0 and character Oh O
Upper and lower case versions of characters - o,O x,X

Discrimination errors

C and G
D O and Q
Y and V

Character separation

M treated a IVI
Kerning or no inter-character spacing

Of course, scanning at low resolution or from poor originals can introduce further 'noise' related issues. Very few OCR systems are accurate with very small characters, floating accents, low contrast and punctuation. The very nature of the process makes proofreading difficult. The results may at first appear very good because the redundancy in a written language is sufficient for the human reader. If the results are intended for computer processing then it is important to try and build checking into the process. e.g. compute check digits in reference number, look up valid names and cross reference.

Installation requirements

Ask RedTitan support for IDF licence permission. EscapeE OCR requires a minimum TESSERACT 3.01 to be installed as follows. {root} is the folder where ESCAPEE.EXE is installed. The minimum set may be downloaded from the tesseract OCR site. RedTitan uses a direct API and does not use the command line tools. To support a non-English language it is sufficient to download a single 'trained data' file. e.g. deu.traineddata contains the special characters (like umlauts) used in German text.

    {root}\plugins\tesseract\tessdata\eng.cube.bigrams
    {root}\plugins\tesseract\tessdata\eng.cube.fold
    {root}\plugins\tesseract\tessdata\eng.cube.lm
    {root}\plugins\tesseract\tessdata\eng.cube.nn
    {root}\plugins\tesseract\tessdata\eng.cube.params
    {root}\plugins\tesseract\tessdata\eng.cube.size
    {root}\plugins\tesseract\tessdata\eng.cube.word-freq
    {root}\plugins\tesseract\tessdata\eng.DangAmbigs
    {root}\plugins\tesseract\tessdata\eng.freq-dawg
    {root}\plugins\tesseract\tessdata\eng.inttemp
    {root}\plugins\tesseract\tessdata\eng.normproto
    {root}\plugins\tesseract\tessdata\eng.pffmtable
    {root}\plugins\tesseract\tessdata\eng.tesseract_cube.nn
    {root}\plugins\tesseract\tessdata\eng.traineddata
    {root}\plugins\tesseract\tessdata\eng.unicharset
    {root}\plugins\tesseract\tessdata\eng.user-words
    {root}\plugins\tesseract\tessdata\eng.word-dawg
    {root}\plugins\tesseract\tessdata\osd.traineddata

The following files must be present to provide the RedTitan technical interface.

{root}evaldial.exe                       RS2 scripting wizard  
  {root}\plugins\EVALUATE.EEP            evaluate plugin - RS2 scripting API
    {root}\plugins\tesseract\tess301.dll interface to tessercat OCR engine

Recognised character sets

Tesseract OCR uses a database that contains the recognition information for a number of glyphs. This information is derived from a process called 'training'. Target character shapes are compared to the database information and given a recognition 'confidence' value. The OCR engine is unaware of the positional context of a character shape but extended parameters may be used to give hints to the recognition system. e.g. the 'whitelist' parameter lists what characters may appear in the scanned graphic. A character that is not in the selected language set will not be recognised.

By default, all 'trained' characters are processed in the English(eng) set as follows.

Note: The English trained uses punctuation characters defined in the 16bit UNICODE+20 plane. To eplicitly reference these characters from a simple RS2 script language statement you must use a UTF8 capable editor (like NOTEPAD)

To add support for a particular language, search for the support files using the ISO 639-3 country code e.g.

deu.traineddata	German
fra.traineddata	French
nld.traineddata	Netherlands
spa.traineddata	Spanish
ita.traineddata	Italian

Quick start

STEP 1 - Draw around and name the fields you want to OCR.

STEP 2 - Create a dummy field called RS2(say). On the FIELDS dialogue ADVANCED tab attach the RS2 field to the EVALUATE plugin. Click the CONFIGURE button to lauch the RS2 wizard.

STEP 3 - Click EDIT to create an RS2 script file.

STEP 4 - Choose OCR and set field value from the wizard task list and click ADD

STEP 5 - Check the fields NAME and DOB and click OK

STEP 6 - The wizard has added the RS2 script to OCR the NAME and DOB fields and set the field text values in EscapeE. Click OK and SAVE the changes to the RS2 file.

STEP 7 - The OCR fields have now been recovered as text. Use the FILE EXPORT menu to reprocess the data.

Special input parameters

Example RS2 EVALUATE script.

                   // REDTITAN RS2 CONTROL
                   L:=[];
                   ocr('REF',L);
                   text(10,10,L);

Note 1: The Pascal syntax is extended in RS2 to include a simple list type.
Note 2: The first line of an RS2 file is treated as a signature and is mandated.

In this simple example, the field called REF is processed by the OCR engine and re-displayed on a different part of the page(useful when debugging OCR behaviour). Any input parameters are supplied in the list parameter as literal name value pairs.

e.g. L:=["lang=deu","whitelist=0123456789,."];

whitelist gives a list of characters that are permitted in the recognition process.
blackist gives a list of characters that are NOT permitted in the recognition process.
lang the language character set to be used. The equivalent traineddata file must be installed.
dir a complete path to the tessdata directory. Used with care this capability can be used to experiment with new recognition database or add dictionary files.