SourceForge.net Logo
jTessBoxEditor

jTessBoxEditor is a box editor for Tesseract OCR data, providing editing of box data of both Tesseract 2.0x and 3.0x formats. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 6.0 or later.

jTessBoxEditor is released and distributed under the Apache License, v2.0.

Double click on the JAR file to launch the program, or execute the following command:

java -Xms128m -Xmx512m -jar jTessBoxEditor.jar

jTessBoxEditor Swing UIBox View
jTessBoxEditor Swing UI

You will need to provide the TIFF/Box files as input to the program. Images to be used in training should be of 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF format.

The box files, encoded in UTF-8 format, are generated by Tesseract executables with appropriate command-line options (see Tesseract Wiki). The training process can be automated using train.ps1, a Windows PowerShell script.

Note that the coordinate system used in the box file has (0,0) at the bottom-left; on computer graphics devices, however, (0,0) is defined as top-left. jTessBoxEditor uses and displays in the graphics device coordinates. The edited box files are still read and written in proper format.

A conversion function is included to convert numeric character reference (NCR) and escape sequence in the Character text field to Unicode characters.

The Tools menu provides a couple convenient methods for creating images for training. The Merge TIFF function can save multiple images containing text of the same font into a single multi-page TIFF file.

The Generate TIFF/Box function generates, for a given input UTF-8 text file, a TIFF/Box pair of files suitable for training with Tesseract. The generated image is, depending on anti-aliasing mode enabled, a binary or 8-bpp grayscale, uncompressed multi-page TIFF with 300-DPI resolution. Letter tracking, or spacing between characters, can be adjusted by the Tracking spinner control to eliminate bounding box overlapping issues. Note that some boxes could be slightly different (by 1 or 2 pixels) from the ones that would have been generated by Tesseract itself; nevertheless, the generated box file can be used to validate the one created by Tesseract with the use of a Unicode-compatible file compare tool, such as WinMerge.

Generate TIFF/Box

Combining symbols or diacritics, like those found in Devanagari or Indic scripts, that need to be combined with the main, base character can be specified by the user in a UTF-8 text file, specifically data/combiningsymbols.txt, which is read by Generate TIFF/Box function. This setup gives the users the flexibility in defining combining symbols/diacritics for their language scripts.

If there is any question, please post in VietOCR Forums.