VietOCR

VietOCR (English | Vietnamese)

VietOCR, available in Java and .NET executable, is a GUI frontend for Tesseract OCR engine. Both versions sport similar graphic user interface and are capable of recognizing text from images of common formats. The program can also function as a console application, executing from the command line.

Batch processing is supported as well. The program monitors a watch folder for new image files, automatically processes them through the OCR engine, and outputs recognition results to an output folder.

Language data for Vietnamese and English is already bundled with the program. Data for other languages can be downloaded from Tesseract website and should be placed into tessdata folder. Note that the language data files for Tesseract 2.0x and 3.0 are of different format and not interchangeable, so be sure to download the ones compatible with your Tesseract version (2.0x - 3.02, 3.03, and 4.00).

Notes: Some languages — such as Arabic or Hindi — have cube components; they need to be downloaded and copied into tessdata as well.

Installation

The Java version requires Java Runtime Environment 8 or later (installation instructions). On Windows, Microsoft Visual C++ 2015-2022 Redistributable Package is also required.

For Linux, Tesseract and its language data packages are in the Graphics (universe) repository. They can be installed using Synaptic or by the following command:

sudo apt-get install tesseract-ocr tesseract-ocr-vie

The files will be placed in /usr/bin and /usr/share/tesseract-ocr/tessdata, respectively.

On the other hand, if Tesseract is built and installed from the source, they will be placed in /usr/local/bin and /usr/local/share/tessdata. VietOCR is designed to know the language data files at those locations; however, in case tessdata is located in another directory besides those mentioned, you will need to set the environment variable TESSDATA_PREFIX, for example:

export TESSDATA_PREFIX=/usr/local/share/

(or equivalent) in your .profile or whatever or setenv to set the environment variable. Note that the directory path must end in a /.

Optionnal support for Tess4J library is provided. Be noted that any exception from inside Tess4J will cause the program to crash.

The .NET version requires Microsoft .NET Framework 4.8. If you encounter "Exception has been thrown by the target of an invocation" or "The program can't start because VCRUNTIME140.dll is missing from your computer" errors, please install Microsoft Visual C++ 2015-2022 Redistributable Package.


Figure 1: VietOCR.NET WinForm GUI	Figure 2: VietOCR Swing GUI

Scanning support on Windows is provided via the Windows Image Acquisition Library v2.0, which requires Windows XP Service Pack 1 (SP1) or later; the library comes standard with Windows Vista and 7. To install the WIA Library on Windows XP, copy the wiaaut.dll file to your System32 directory (usually located at C:\Windows\System32) and run from the command line:

regsvr32 C:\Windows\System32\wiaaut.dll

On Linux, scanning requires installation of SANE packages:

sudo apt-get install libsane sane sane-utils libsane-extras xsane

PDF support is possible via GPL Ghostscript. After installation of the library, please ensure the shared library object (gsdll64.dll or libgs.so) is in the search path by setting the appropriate environment variable. On Windows, append the following to Path value (accessible through Control Panel > System > Advanced > Environment Variables) for GS version 10.03.1:

;C:\Program Files\gs\gs10.03.1\bin

To install GS on Linux:

sudo apt-get install ghostscript

To set path:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

where in /usr/local/lib, libgs.so link to libgs.so.10.03.1 is located. However, this step may not be needed since path might have been set during the installation of GS.

Spellcheck functionality is available through Hunspell, whose dictionary files (.aff, .dic) should be placed in dict folder of VietOCR. user.dic is an UTF-8-encoded file which contains a list of custom words, one word per line.

On Linux, Hunspell and its dictionaries can be installed by Synaptic or apt, as follows:

sudo apt-get install hunspell hunspell-en-us

OCR Operation

The included Vietnamese language data were generated specifically for Times New Roman, Arial, Verdana, and Courier New fonts. Therefore, the recognition would have better success rate for images having similar font glyphs. OCRing images that have font glyphs look different from the supported fonts generally will require training Tesseract to create another language data pack specifically for those typefaces.

Update: More language data has been generated for legacy Vietnamese fonts, VNI and TCVN3 (ABC). It can be downloaded through the Download Language Data submenu.

The images to be OCRed should be scanned at resolution from at least 200 DPI (dot per inch) to 400 DPI. Scanning at higher resolutions will not necessarily result in better recognition accuracy, which currently can be higher than 97% for Vietnamese (reference image) — the next release of Tesseract may improve it even further. Even so, the actual rates still depend greatly on the quality of the scanned image.

The typical settings for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.

The Screenshot Mode offers better recognition rates for low-resolution images, such as screen prints, by rescaling them to 300 DPI.

Tips: OCR on selection zones on the image (region of interest) defined by mouse drag is generally found to produce better accuracy.

In addition to the built-in text postprocessing algorithm, you can add your own custom text replacement scheme via a UTF-8-encoded tab-delimited text file named x.DangAmbigs.txt, where x is the ISO639-3 language code. Both plain and Regex text replacements are supported.

Some built-in tools are provided to merge several images or PDF files into a single one for convenient OCR operations, or to split a PDF file into smaller ones if it is too large, which can cause out-of-memory exceptions. Pasting images from clipboard is supported.

Postprocessing

The recognition errors can be classified into three categories. Many of the errors are related to the letter cases — for example: hOa, nhắC — which can be easily corrected by popular Unicode text editors. Many other errors are a result of the OCR process, such as missing diacritical marks, wrong letters with similar shape, etc. — huu – hưu, mang – marg, h0a – hoa, la – 1a, uhìu - nhìn. These can also be easily fixed by Vietnamese spell checker programs. VietOCR's built-in Postprocessing function can help correct many of the above errors.

The last category of errors is more difficult to detect because they are semantic errors, which means that the words are valid entries in the dictionary but are wrong in the context — e.g., tinh – tình, vân – vấn. These errors require the editor to read though and manually correct them according to the original image.

The following editing process using the built-in functionality is suggested:

Group lines. The lines need to be grouped to the paragraph they belong, as being OCRed, each line becomes a separate 1-line paragraph. Use Remove Line Breaks function under Format menu. Note that this operation may not be needed for poems.
Select Change Case, also under Format menu, and choose Sentence case to correct the letter case errors. Locate and fix the rest of remaining letter case errors.
Correct the misspelled errors using the integrated Spell Check.

Through the above steps, most of common errors should be eliminated. The remaining, semantic errors are few, but it requires a human editor to read though and make necessary edits to make the document look just like the original scanned document. If heavier editing is required, you can use word processors or full-featured text editors — Word, Writer, Notepad, VietPad, etc. — for such task.

Limitations

Tesseract 2.0x does not support page layout, therefore can only recognize one column text. Tesseract 3.0x has included page layout analysis, supporting recognition of multi-column documents.

Installation

OCR Operation

Postprocessing

Limitations

References