Book Digitization

Optical character recognition (OCR) is the translation of scanned images of typewritten or printed text into editable text. It is widely used to convert books and documents into electronic files or ebooks, to archive publications in digital library, or to publish the text on a website.

Images usually are acquired through document scanners or digital cameras. Destructive scanning involves cutting or debinding the books, whereas non-destructive scanning employs digital camera setups to quickly and efficiently capture images of book pages without damaging the original document.

For OCR to be successful, some post processing may be required to correct artifacts introduced during image capture stages or defects in original documents themselves. There are several free or open-source programs designed specifically for those postprocessing steps. The tools provide automation on image scropping, rotation, denoise, despeckle, dewarping, etc.

ImageMagick: a software suite to create, edit, compose, or convert bitmap images. Use ImageMagick to resize, flip, mirror, rotate, distort, shear and transform images, adjust image colors, etc.

TextCleaner: an ImageMagick script that processes a scanned document of text to clean the text background.

Scan Tailor: an interactive post-processing tool for scanned pages, it performs operations such as page splitting, deskewing, adding/removing borders, and others.

unpaper: tries to enhance the quality of scanned images by removing dark edges that appeared through scanning or copying on areas outside the actual page content (e.g., dark areas between the left-hand-side and the right-hand-side of a double-sided book-page scan). The program also tries to deskew a misaligned image. Input and output files can be in either .pbm , .pgm or .ppm format, thus generally in .pnm format; due to this image type limitation, in order to effectively use this tool, image format conversion from more popular formats — such as TIFF or PNG — is generally required.

Book Scan Wizard: geared towards correcting defects in images captured by camera, it provides cropping, rotating, keystoning correction, and rescaling functions.

For post-OCR corrections, you will need word processors or text editors that support spelling and text find/replace function. For editing Vietnamese-language text documents, VietPad can be used.

References: