What is it?
BOCRA is a recursive acronym that expands to Bocra Optical Character Recognition Application. The final A is sort of forced, mainly to make the name pronounceable, but also as an insider joke (if you don't already know, don't bother asking because you wouldn't find it amusing). The initial B could also stand for Bengali since that's the primary target language that motivated the authors, but in principle it could be used for other languages as well. In practice, the approach [to be] implemented has no special benefit for languages whose characters/glyphs are separated (i.e. not connected) when printed.
Why?
OCR has been around for a while, but most available implementations are geared towards English, whereas I was interested in Bengali. One would think that the standard techniques in the field could be adapted, and this is probably true to some extent. There were two problems: first, I am not particularly familiar with the field (and don't know anyone who is) and second, the Bengali script has some idiosyncrasies which makes the standard approaches somewhat non-trivial.
Let me expand on this last point. As far as I have been able to make out, the standard OCR paradigm can be summarized as segmentation followed by recognition. Specifically, one starts with an image, segments it into (presumably) lines of text, then words and then glyphs. These individual glyphs are then each recognized separately. This way, each individual problem is a simple classification problem: given a piece of an image, classify it into one of a finite set of possibilities. For printed material, this is often as simple as finding the connected pieces on the page.
Here's the catch: in Bengali and some related scripts (like Devanagari), the basic shapes that need to be recognized are connected (usually by a `headline' from which most glyphs `hang'), and are thus not easily segmented. For example:
One obvious approach would then be to develop good segmentation algorithms. Several researchers have taken this approach; unfortunately it is difficult to judge the usefulness of such approaches because as far as we know no actual implementations are available for public use. We take a different approach for the following reasons:
- It's always more fun to try something new
- Some of my own earlier experiments suggest that even though segmentation is much simplified once the headline is identified, separating those little marks that appear under `base' glyphs (হ্রস্ব উ-কার, দীর্ঘ ঊ-কার, ঋ-কার, হসন্ত, etc) can be particularly difficult
- The fact that there are no publicly available software implementations (be it commercial or free) using the segmentation approach is an indirect indication that such methods might have limitations
Our approach is entirely different. We segment an image into lines and words, but no further. Instead, we try to detect the presence (and location) of templates (previously identified during the training phase) in the word. Our main technical contribution is the trick that allows this detection. This approach has the obvious advantage of not needing segmentation. Unfortunately, it is not a panacea; it only works if the font (including size) remains the same throughout the training and actual recognition process.
Caveats
The goal of BOCRA is to be an OCR software that works for Bengali and is available (as Free Software) for use. However, there are several things anyone planning to use it should be aware of:
- I'm not an OCR expert, so I'm probably doing many stupid things (I would love to hear from those who know better)
- I experimented with the segmentation approach for a while, but I wasn't satisfied with the results. This is very likely because I know so little about OCR and image processing in general, and I'm sure there are better ways of doing it. Be that as it may, I have given up on that approach, and consequently no support for such approaches is planned in BOCRA. That said, I have nothing against it, and if anyone wants to work on such support, they are welcome. However, I should mention GOCR, and more recently Tesseract and OCRopus, as possibly more mature platforms for experiments along those lines.
- BOCRA targets a very specific type of OCR problem, namely where the input images are high quality scans of high quality printed text written in a single font in a uniform point size. This sounds very limited, but it covers what I'm interested in, which is to make public domain Bengali literature available in electronic form (which, as you can imagine, involves transcribing many many pages of a book written in an uniform font). If this disappoints you, I understand, but there's not much I can do to help because the ideas implemented in BOCRA do not easily generalize to situations where these assumptions do not hold.
About this document
This document serves two purposes. It describes the status of the software implementation, and gives an overview of its usage. This should probably be done with some better tools like Doxygen and Docbook, but I'll postpone that to such time when the project has matured a bit. For now, this will be a single HTML page. I also intend to describe the ideas behind the implementation, often in statistical terminology. I would normally have done this in LaTeX, but writing the occasional Bengali is much easier in HTML. I haven't done much of this yet.
Implementation
License
This software may be distributed under the terms of the GNU General Public License, version 2, or at your option, any later version.
Prerequisites
To work with the code, you need in addition to a standard GCC toolchain,
- The Qt 4 opensource edition
- R
- A client for the Subversion revision control system
I don't know much C++, but this seems like as good a chance to learn as any. Qt 4 has a couple of advantages over Qt 3; it's GPL on Windows too, and some namespace changes (I think) make it easier to embed R, which will be used for some things.
I work on Linux, although other platforms (Windows, Mac OS X)
should work too (but I don't know the caveats, if any). To fix ideas,
let's standardize on (for the moment at least) Qt 4.1.0 and R 2.2.1.
R needs to be compiled with the --enable-R-shlib
flag
during configuration. This is not the default, so make sure you do
this if you compile yourself. If you use a vendor supplied binary,
this may already be enabled; check the existence of
libR.so
on your system. If not, you will need to
recompile from source. Don't hesitate to ask if you have problems.
For Qt, the default configuration should be fine. Once you have
installed a subversion client (e.g. the subversion
package on Debian), ask me for
the archive address (I won't post it here for certain reasons).
Status
The software is moderately functional. Some constants that shouldn't be are currently hard-coded. The GUI can be used to add to the collection of templates (i.e. training), but it cannot be used to remove templates or even display the current collection. This is not a priority for me because these can both be done easily in R, but of course in the long run the use of R should be completely hidden from the user. Note that it won't be hard to actually do this. Some other things can be improved as well (e.g. identification of the headline, especially for punctuation marks, where use must be made of neighbouring words). There are some definite bugs, since the program sometimes segfaults. Other than that, the software mostly works.
To enable fast prototyping, and to avoid premature optimization and/or reinvention of wheels, especially with regard to serialization (i.e. saving information persistently across sessions) and post-processing text (transform orthographic order into textual order), I have tentatively decided to do much of the work in R. The application starts (an embedded instance of) R as a thread, which is used liberally in various stages.
The Typical Workflow
A typical project will start with a collection of several grayscale images with a common font, scan resolution, etc. I usually normalize them by something like
convert -normalize foo.png foo-norm.pngOne then starts by launching the software and loading the first image. Any skew correction, if required, has to be done beforehand (I had some tools to calculate this automatically, I can try to resurrect them if there is interest). After loading, the first step is to segment it into lines and words, e.g. by pressing
CTRL+I
.
Pressing CTRL+T
would then toggle the display between the
original grayscale and the thresholded monochrome version used for the
segmentation. The image can be zoomed as well. Here are some
screenshots using this toy image:
Grayscale image

Segmented image

If sufficient training has already been done, one can then proceed
to OCR the whole page by pressing CTRL+R
. For a fresh
project, this training has to be done first. Training is done in two
stages. First, look for a word that has a glyph you know needs to be
added (where by glyph we mean any subpart that is expected to recur in
the image). Then, double click on that word with the left mouse
button. This should bring up a pop up window with a contour
of that word. This contour is computed using R, which is run in the
background in a separate thread. The goal is to select a subset of
this contour that represents a coherent template. This can consist of
multiple disconnected segments. A single click near a point while
holding the SHIFT
key down removes the nearest point from
the contour. Double clicking near a point selects (or deselects) the
point and all points belonging to the same contour segment. Multiple
contour segments, when selected together, defines a template. For
example:
Template selection

Note the horizontal red line. This is important, and indicates the (automatically detected) location of the top of the headline. If this is wrong, that word should not be used. The current detection algorithm needs to be improved (which is easy in principle, it just needs a bit of work).
A selected template can be added to the template collection by
clicking the right mouse button somewhere on the pop up window while
holding down the SHIFT
key, and entering a
romanised symbol for the glyph in the dialog that pops up.
The rules of naming are important to ensure proper conversion (later)
into unicode, and are explained below. Note
that a vertical red line appears at the location of the mouse click.
This defines the origin for the glyph, which is used to order the
glyph relative to other glyphs in words in which they are
detected.
Template addition

The template collection thus built up persists across sessions, and
is actually stored in an R data file. Currently, this file is called
"templates.rda"
in the current working directory, though
in future the path should be made customizable. Here is an example of one such file,
consisting of the following templates gleaned from several pages of a
story by Upendrakishore Raychaudhuri:
A template collection

Using this template collection, we can perform OCR on the toy image we have been using for our screenshots. In principle, the OCR step only detects the presence and position of glyphs in the template; some further post-processing needs to be done to deal with the idiosyncratic orthography of the Bengali script. All this is done, producing the following output:
Api karala nA gupike Ara bAjAbAra jAghagA kothAYa habe ? mAjhakhAne ese pa.Dala tAraparaThis javascript converter is designed to convert this romanised text into Unicode Bengali. Using it, we get
আপি করল না গুপিকে আর বাজাবার জাঘগা কোথায় হবে ? মাঝখানে এসে পড়ল তারপরA fuller sample, consisting of 11 pages of text, is given here.
Design
Each .h/.cpp
pair of files represents a class. This
is work in progress, so just look in the source code for now (it's not
very complicated). Some abstract discussions follow.
Classes
At least the following classes need to exist:
The main application window
This should contain the interface that has access to all the components.
Image
This class would represent a raw image. It could be rotated, smoothed, etc., though initially we will do those things externally. Mostly, it needs methods to display it, segment it into lines and words. These words then need to be converted to their boundary contours (using R). I'm not sure whether to do it all at the beginning or only when necessary. This would depend on memory usages. R should have as few things in memory as possible at any time, but not sure how well Qt would handle things.
Project
Serializable representation of a project, representing a particular book usually, that corresponds to a particular combination of font, size and scanning parameters. All projects have to be trained, which essentially involve a collection of glyphs: templates with information on what they really are. This information has to be retained across sessions, since otherwise we would need to do the tedious task of training all over again. I currently achieve this by piggybacking on R's serialization mechanism.
Contour
A useful form of what the R function contourLines
returns. This is essentially an ordered sequence of points in the 2-D
Euclidean plane, with some form of NA
's to indicate
breaks. This will be the basic representation of words, as well as
templates. In the latter case, the templates with some meta
information will be a critical component of a project object.
Contour editor
A GUI to manipulate a contour object. This is essential when
choosing templates, which would involve choosing words, choosing
subsets of contours, and saving them along with meta information. The
current interface works as follows: contour points are shown.
Shift+click on a point makes it NA
, breaking the contour.
Double clicking on a point selects or deselects the contour that point
belongs to. One or more selected contours can together be saved as a
template.
The rendering is be done by a paintEvent()
method.
The interaction changes the underlying contour object, and the
paintEvent()
method just renders the latest one.
Actions
The main user-controlled action is training: select a word by double clicking on it, select subsequences of a contour, add to list of templates.
The interesting part is recognition: use templates to recognize words. To speed up things, we use deduced information on vertical location of the headline (which may need processing at least a line at a time). There is some code to identify the headline which mostly works, but will need some further work. Currently, the headline (actually the top of the headline) is identified one word at a time, so something is tried even for `daanri's, question marks, etc. Even for proper words, sometimes the bottom of the headline is identified instead of the top.
Thoughts
The template editor GUI should allow for choosing specific thresholds for specific glyphs.
Notation
Table of conventions for romanising bengali characters (currently inaccurate):
Bengali Codepoint | Template Name | Internal Map | Comments |
অ, অ-কার | a | a | |
আ, ◌া | A, _A_ | A | |
ই, ি◌ | i, _i_ | i | |
ঈ, ◌ী | I, _I_ | I | |
উ, ◌ু | u, _u_ | u | |
ঊ, ◌ূ | U, _U_ | U | |
ঋ, ◌ৃ | Ri, _Ri_ ? | R | |
এ, ে◌ | e, _e_ | e | |
ঐ, ৈ◌ | E, _E_ | E | |
ও, ে◌া | o, NA | o | |
ঔ, ে◌ৗ | O, _O_ | O | _O_ should indicate right half of the split vowel sign only |
Bengali Codepoint | User map | Internal Map | Comments |
ক | k | k | |
খ | kh | kh | |
গ | g | g | |
ঘ | gh | gh | |
ঙ | G | G | |
চ | ch | c | |
ছ | chh | ch | |
জ | j | j | |
ঝ | jh | jh | |
ঞ | J | J | |
ট | T | T | |
ঠ | Th | Th | |
ড | D | D | |
ঢ | Dh | Dh | |
ণ | N | N | |
ত | t | t | |
থ | th | th | |
দ | d | d | |
ধ | dh | dh | |
ন | n | n | |
প | p | p | |
ফ | ph | ph | |
ব | b, v | b | |
ভ | bh | bh | |
ম | m | m | |
য | y | y | |
র | r | r | |
ল | l | l | |
শ | sh | sh | |
ষ | Sh, S | S | |
স | s | s | |
হ | h | h | |
ড় | .D | X | |
ঢ় | .Dh | Z | |
য় | Y, .y | Y | |
০-৯ | 0-9 | 0-9 | not implemented yet |
Bengali Codepoint | User map | Internal Map | Comments |
◌ং | .n, M | M | |
◌ঁ | _CBINDU_ | C | |
◌ঃ | :, H | H | |
। | | | | | daanri |
্ | # | # | explicit hasanta |
Test cases
For testing purposes, 11 pages of scanned images are available here. The results obtained from them are described here.
Theory
This will be written later. The basic idea is actually pretty simple, the pain is mostly in getting the details right. To even get that far, we need to write a lot of software first, so I'll focus on that for now,
Development
Sources
A development snapshot of the sources that mostly works is available here. The subversion sources can be checked out from
https://svn.sourceforge.net/svnroot/bocraFor example, to check out the trunk (where development takes place), you might use:
mkdir bocra cd bocra svn checkout https://svn.sourceforge.net/svnroot/bocra/bocra/trunk/
Mailing List
A mailing list is available to discuss BOCRA. This may be split into separate mailing lists (for users and developers) in future if there's enough interest. The SF.net project page, of course, is http://www.sf.net/projects/bocra/ where developers can join the project.
Frequently Asked Questions
Did anyone really ever ask these questions?
Probably not. These are more questions that I anticipate might have been asked frequently (had it not already been answered here) if and when the project ever became popular enough (one can always hope!).
Why are they called Frequently Asked Questions then?
Don't ask me! Everybody does it.
Why does the website look weird in Internet Explorer?
Think of it as a favor. The fact that you are still using an obsolete browser when better alternatives exist is a very good indicator that you will not find the contents of this website very interesting. If you are discouraged by the bad rendering and decide to go away, that saves both you and me some time.
Will BOCRA work for handwritten text?
Not likely, unless the writer is an expert calligrapher with at least seven years of rigorous training.
Can BOCRA separate regions of text from images?
Not currently. In fact, it is even confused by an image containing more that one column of text. To use BOCRA now, you will have to do all preprocessing (i.e. skew correction, cropping, etc) with external software. Hopefully, some day BOCRA will be able to do these things automatically (and perhaps even have a scanner interface), but that's not likely to happen soon.