Optical Character Recognition
Revision as of 21:43, 8 June 2013 by Karol (typos)
There are several steps to the whole OCR process, the actual OCR engine is only part of this:
- document layout analysis
- optical character recognition
- post-processing (formatting, PDF creation)
OCR (Optical Character Recognition) Engines
- CuneiForm — A command line OCR system originally developed and open sourced by Cognitive technologies. Supported languages: eng, ger, fra, rus, swe, spa, ita, ruseng, ukr, srp, hrv, pol, dan, por, dut, cze, rum, hun, bul, slo, lav, lit, est, tur.
- GOCR/JOCR — An OCR engine which also supports barcode recognition.
- Ocrad — An OCR program based on a feature extraction method.
- Tesseract — "Probably one of the most accurate open source OCR engines available". Package splitted, you need install some datafiles for each language ( for example).
Layout analyzers and user interfaces
- OCRFeeder — Python GUI for Gnome which performs document analysis and rendition, and can use either CuneiForm], GOCR, Ocrad or Tesseract as OCR engines. It can import from PDF or image files, and export to HTML or OpenDocument.
- YAGF — graphical interface for the CuneiForm text recognition program on the Linux platform. Available from community repository
- gImageReader — A graphical GTK frontend to Tesseract
- gscan2pdf — scans, runs Tesseract and creates a PDF all in one go
- OCRopus — OCR platform, modules exist for document layout analysis, OCR engines (it can use Tesseract or its own engine), natural language modeling, etc. Available from AUR