Sophie: clara-0.9.8-1mdk i586

clara-0.9.8-1mdk.i586.rpm

<HTML><HEAD><TITLE>Clara Book</TITLE></HEAD>
<BODY BGCOLOR=#D0D0D0>
<TABLE WIDTH=100% BORDER=1 BGCOLOR=#E2D3FC><TR><TD><CENTER><H1><BR>Clara OCR Advanced User's Manual<BR></H1></CENTER></TD></TR></TABLE>
<P>
<CENTER>
[<A href=index.html>Main</A>]
[<A href=clara-faq.html>FAQ</A>]
[<A href=clara-tut.html>Tutorial</A>]
[<A href=clara-adv.html>User's Manual</A>]
[<A href=clara-dev.html>Developer's Guide</A>]
</CENTER>

<P>
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for
the cooperative OCR of books. There are some screenshots
available at <A HREF=http://www.claraocr.org/>http://www.claraocr.org/</A>.

<P>
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Advanced User's Manual". It's currently unfinished. First-time
users are invited to read "The Clara OCR Tutorial". Developers
must read "The Clara OCR Developer's Guide".

<P>

<P>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B> CONTENTS</B></FONT></TD></TR></TABLE>
<UL>
<P>
<LI> <A HREF=#1.>1. Welcome to Clara OCR</A>
<UL>
<P>
<LI> <A HREF=#1.1>    1.1 Early historical notes</A>
<LI> <A HREF=#1.2>    1.2 Design notes</A>
<LI> <A HREF=#1.3>    1.3 Supported Alphabets</A>
<LI> <A HREF=#1.4>    1.4 Clara vs the others</A>
<LI> <A HREF=#1.5>    1.5 The requirements</A>
<LI> <A HREF=#1.6>    1.6 How to download and compile Clara</A>
<LI> <A HREF=#1.7>    1.7 Compilation and startup pitfalls</A>
<P>
</UL>
<LI> <A HREF=#2.>2. A first OCR project</A>
<UL>
<P>
<LI> <A HREF=#2.1>    2.1 Scanning and thresholding</A>
<LI> <A HREF=#2.2>    2.2 Avoiding skew</A>
<LI> <A HREF=#2.3>    2.3 The work directory</A>
<LI> <A HREF=#2.4>    2.4 The pattern types</A>
<LI> <A HREF=#2.5>    2.5 Skeleton tuning</A>
<LI> <A HREF=#2.6>    2.6 Classification tentatives</A>
<LI> <A HREF=#2.7>    2.7 Alignment tuning</A>
<P>
</UL>
<LI> <A HREF=#3.>3. Complex procedures</A>
<UL>
<P>
<LI> <A HREF=#3.1>    3.1 Using two directories</A>
<LI> <A HREF=#3.2>    3.2 Adding a page (to be written)</A>
<LI> <A HREF=#3.3>    3.3 Multiple books</A>
<LI> <A HREF=#3.4>    3.4 Adding a book (to be written)</A>
<LI> <A HREF=#3.5>    3.5 Removing a page</A>
<LI> <A HREF=#3.6>    3.6 Building the bookfont</A>
<LI> <A HREF=#3.7>    3.7 Dealing with classification errors</A>
<LI> <A HREF=#3.8>    3.8 Rebuilding session files (to be written)</A>
<LI> <A HREF=#3.9>    3.9 Importing revision data</A>
<LI> <A HREF=#3.10>    3.10 How to use the web interface</A>
<LI> <A HREF=#3.11>    3.11 Revision acts maintenance</A>
<LI> <A HREF=#3.12>    3.12 Analysing the statistics</A>
<LI> <A HREF=#3.13>    3.13 Upgrading Clara OCR (to be written)</A>
<P>
</UL>
<LI> <A HREF=#4.>4. Reference of the Clara GUI</A>
<UL>
<P>
<LI> <A HREF=#4.1>    4.1 The application window</A>
<LI> <A HREF=#4.2>    4.2 Tabs and windows</A>
<LI> <A HREF=#4.3>    4.3 The Application Buttons</A>
<LI> <A HREF=#4.4>    4.4 The Alphabet Map</A>
<P>
</UL>
<LI> <A HREF=#5.>5. Reference of the menus</A>
<UL>
<P>
<LI> <A HREF=#5.1>    5.1 File menu</A>
<LI> <A HREF=#5.2>    5.2 Edit menu</A>
<LI> <A HREF=#5.3>    5.3 View menu</A>
<LI> <A HREF=#5.4>    5.4 Alphabets menu</A>
<LI> <A HREF=#5.5>    5.5 Options menu</A>
<LI> <A HREF=#5.6>    5.6 Window options menu</A>
<LI> <A HREF=#5.7>    5.7 OCR steps menu</A>
<P>
</UL>
<LI> <A HREF=#6.>6. Reference of command-line switches</A>
<UL>
<P>
</UL>
<LI> <A HREF=#7.>7. AVAILABILITY</A>
<UL>
<P>
</UL>
<LI> <A HREF=#8.>8. CREDITS</A>
<UL>
</UL>
</UL>
<A NAME=1.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>1. Welcome to Clara OCR</B></FONT></TD></TR></TABLE>
<P>
Clara is an optical character recognition (OCR) software, that
is, a software that tries to identify the graphic images of the
characters on a scanned document, converting them from their
images to ASC, ISO or other codes.

<P>
The name Clara stands for "Cooperative Lightweight chAracter
Recognizer".

<P>
Clara offers a revision interface that may be used from a
standalone GUI or through the WWW by various different reviewers
simultaneously. Because of this feature it's a "cooperative" OCR
(it's also "cooperative" in the sense of its free/open status and
development model).

<P>

<P>

<P>

<P>

<P>
<A NAME=1.1>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.1 Early historical notes</B></FONT></TD></TR></TABLE>
<P>
For some years now we have tested and used OCR softwares, mainly
for old books. Popular OCR softwares (those bundled with
scanners) are useful tools. However, OCR is not a simple
task. The results obtained using those programs vary largely
depending on the the printed document, and, for most texts we're
interested on, the results are really poor or even unusable. In
fact, it's not a surprise that many digitalization projects
prefer not to use OCR, but typists only.

<P>
For a programmer, it is somewhat intuitive that OCR could achieve
good results even from low quality texts, when an add-hoc
approach is used, focusing one specific book (for
instance). Within this approach, OCR becomes a matter of finding
one software adequate for the texts you're trying to OCR, or
perhaps develop a new one. So a free and easy to customize OCR
(on the source code level) would be a valuable resource for text
digitalization projects.

<P>
Dealing with graphics is not among our main occupations, but
after analysing many scanned materials, we began to write some
simple and specialized recognition tools. More recently (in the
third quarter of 1999) a simple X interface linked to a naive
bitmap comparison heuristic was written. From that prototype,
Clara OCR evolved. Since then, many new ideas from various
persons helped to make it better.

<P>

<P>
<A NAME=1.2>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.2 Design notes</B></FONT></TD></TR></TABLE>
<P>
It's not a bad idea to enumerate some principles that have driven
Clara OCR development. They'll make easier to understand the
features and limitations of the software (these principles may
change along time).

<P>
1. Clara is an OCR for printed texts, not for handwritten
texts.

<P>
2. Clara was not designed to be used to OCR one or two single
pages, but to OCR a large number of documents with the same
graphic characteristics (font, size, etc). So it can take
advantage of a fine (and perhaps expensive) training. This will
be tipically the case when OCRing an entire book.

<P>
3. We chose not support directly multiple graphic formats, but
only Jeff Poskanzer's raw PBM. Non-PBM files will be read through
filters.

<P>
4. Clara OCR wants to be a tool that makes viable the sum and
reuse of human revision effort. Because of this, on the OCR model
implemented by Clara, training and revision are one same
thing. The revision is a sum of punctual and independent acts and
alternates with reprocessing steps along a refinement process.

<P>
5. The Clara GUI was implemented and behaves like a minimalistic
HTML viewer. This is just an easy and standard way to implement a
forms interface.

<P>
6. We have tried to make the source code portable across
platforms that support the C library and the Xlib. Clara has no
special provision to be ported to environments that do not
support the Xlib. We avoided to use a higher level graphic
environment like Motif, GTK or Qt, but we do not discourage
initiatives to add code to Clara OCR adapt or adapt better to
these or other graphic environments.

<P>
7. We generally try to make the code efficient in terms of RAM
usage. CPU and disk usage (for session files) are less prioritary.

<P>

<P>
<A NAME=1.3>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.3 Supported Alphabets</B></FONT></TD></TR></TABLE>
<P>
Clara OCR focuses the Latin Alphabet ("a", "b", "c", ...),
used by most European languages, and the indo-arabic digits
("0", "1", "2", ...), but we're trying to support as many
alphabets as possible.

<P>
To say that Clara OCR supports a given alphabet means that
Clara OCR

<P>
(a) is able to be trained from the keyboard for the symbols of
that alphabet, eventually applying some transliteration from that
alphabet to latin. For instance, when OCRing a greek text, if the
user presses the latin "a" key (assuming that the keyboard has
latin labels), Clara is expected to train the current symbol as
"alpha".

<P>
(b) knows the vertical alignment of each letter of that alphabet,
for instance, knows that the bottom of an "e" is aligned at the
baseline;

<P>
(c) knows which letters accept or require which signs (accents
and others, like the dot found on "i" and "j");

<P>
(d) contains code to help avoiding common mistakes, like
recognizing "e" as "c", "l" as "1", etc.

<P>
To say that Clara OCR supports a given alphabet does not
necessarily mean that Clara OCR

<P>
(a) knows some particular encoding (ISO-8859-X, Unicode, etc)
for that alphabet;

<P>
(b) contains or is able to use fonts for that alphabet to
display the OCR output on the PAGE (OUTPUT) window.

<P>
Even ignoring the standard encondings for one given
alphabet (e.g. ISO-LATIN-7 for Greek), Clara eventually
will be able to produce output using TeX macros, like
{\Alpha}.

<P>

<P>
<A NAME=1.4>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.4 Clara vs the others</B></FONT></TD></TR></TABLE>
<P>
Clara differs from other OCR softwares in various aspects:

<P>
1. Most known OCRs are non-free and Clara is free. Clara focus
the X Windows System. Clara offers batch processing, a web
interface and supports cooperative revision effort.

<P>
2. Most OCR softwares focus omnifont technology disregarding
training. Clara does not implement omnifont techniques and
concentrate on building specialized fonts (some day in the
future, however, maybe we'll try classification techniques that
do not require training).

<P>
3. Most OCR softwares make the revision of the recognized text a
process totally separated from the recognition. Clara
pragmatically joins the two processes, and makes training and
revision one same thing. In fact, the OCR model implemented by
Clara is an interactive effort where the usage of the heuristics
alternates with revision and visual fine-tuning of the OCR,
guided by the user experience and feeling.

<P>
4. Clara allows to enter the transliteration of each pattern
using an interface that displays a graphic cursor directly over
the image of the scanned page, and builds and maintains a mapping
between graphic symbols and their transliterations on the OCR
output. This is a potentially useful mechanism for documentation
systems, and a valuable tool for typists and reviewers. In fact,
Clara OCR may be seen as a productivity tool for typists, instead
of a typical OCR.

<P>
5. Most OCR softwares are integrated to scanning tools offerring
to the user an unified interface to execute all steps from
scanning to recognition. Clara does not offer one such integrated
interface, so you need a separate software (e.g. SANE) to
perform scanning.

<P>
6. Most OCR softwares expect the input to be a graphic file
encoded in tiff or other formats. Clara supports only raw PBM.

<P>

<P>

<P>
<A NAME=1.5>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.5 The requirements</B></FONT></TD></TR></TABLE>
<P>
Clara OCR will run on a PC (386, 486 or Pentium) with GNU/Linux
and Xwindows. Clara OCR will hopefully compile and run on a PC
with any unix-like operating system and Xwindows. Currently Clara
OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems
lacking X windows support (e.g. MS-Windows). Higher-level
libraries like Motif, GTK or Qt are not required.

<P>
A relatively fast CPU is recommended (300MHz or more). Clara OCR
won't require large amounts of RAM for processing some few
documents (typically around 5 megabytes of virtual memory), but
the memory requirements for a large collection of documents may
be huge (40 megabytes or more). The normal operation will create
session files on your hard disk, so some megabytes of free disk
space are required (a large project may require plents of
gigabytes). Clara OCR can read and write gzipped files (see the
-z command-line switch).

<P>
If you need to build the executable and/or the documentation,
then an ANSI C compiler (with some GNU extensions) and a (version
5) perl interpreter are required.

<P>

<P>
<A NAME=1.6>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.6 How to download and compile Clara</B></FONT></TD></TR></TABLE>
<P>
For those who need to download and compile the source code
(hopefully this will be unnecessary for most users as soon as
Clara binary distributions become available), it may be
downloaded from <A HREF=http://www.claraocr.org/>http://www.claraocr.org/</A>. It's a
compressed tar archive with a name like clara-x.y.tar.gz (x.y is
the version number).

<P>
The compilation will generally require no more than issue the
following commands on the shell prompt:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ gunzip clara-x.y.tar.gz
    $ tar xvf clara-x.y.tar
    $ cd clara-x.y
    $ make
    $ make doc</PRE>
</TD></TR></TABLE></CENTER>
Now you can copy the executable (the file "clara") to some
directory of binaries (like /usr/local/bin), and the man page
(file "clara.1") to some directory of man pages (like
/usr/local/man/man1). By now there is no "make install" to
perform these copies automatically.

<P>
If some of these steps fail, please try to obtain assistance from
your local experts. They will solve most simple problems
concerning wrong paths or compiler options. You can also read the
subsection "Compilation and startup pitfalls".

<P>

<P>
<A NAME=1.7>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>1.7 Compilation and startup pitfalls</B></FONT></TD></TR></TABLE>
<P>
This subsection is intended to help people that are experiencing
fatal errors when building the executable or when starting
it. After each error message we'll point out some hints.

<P>
Bear in mind that most hints given below are very elementary
concerning Unix-like systems. If you have problems, try to read
all hints because details explained once are not repeated. If you
cannot understand them, please try to ask your local experts, or
try to read an introductory book on Unix things. Please don't
email questions like these to the Clara developers, except when
the hint suggests it.

<P>
1. Path-related pitfalls

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ make
    bash: make: command not found</PRE>
</TD></TR></TABLE></CENTER>
The shell could not find the "make" utility. Maybe there is no
such utility installed on your system, or maybe the path to it is
unknown to the shell. You can try to find the "make" utility with
a command like

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ find /usr -name make -print</PRE>
</TD></TR></TABLE></CENTER>
The following command will display the current path:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ echo $PATH</PRE>
</TD></TR></TABLE></CENTER>
Remember that on Unix-like systems the environment is
per-process. So if you change the PATH variable on the shell
prompt within an xterm, this won't affect the other running
shells (on the other xterms). Remember that the Unix shells
expect to be explicitly informed about which variables must be
exported to subprocesses (use "export" in Bourne-like shells and
"setenv" on C-like shells).

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ make
    gcc -I/usr/X11R6/include -g   -c gui.c -o gui.o
    make: gcc: Command not found
    make: *** [gui.o] Error 127</PRE>
</TD></TR></TABLE></CENTER>
The make utility could not find the gcc compiler. Check if gcc is
installed. If not, check if some other C compiler is installed
(for instance, "cc"), and edit the makefile to chage the value of
the CC variable.

<P>
If you don't know what I'm speaking about, take a look on the
directory where the Clara source codes are, and you'll see there
a file named "makefile". This file contains the names of the
tools to be used and rules to build the Clara executable. It
contains also important paths, like those where the system
headers (files .h) and libraries can be found. If the names or
the paths don't reflect those on your system, you need to edit
the makefile accordingly.

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ make
    gcc -I/usr/X11R6/include -g   -c gui.c -o gui.o
    In file included from gui.c:16:
    gui.h:12: X11/Xlib.h: No such file or directory
    make: *** [gui.o] Error 1</PRE>
</TD></TR></TABLE></CENTER>
The compiler could not find the header Xlib.h. Maybe your system
does not include such header, or maybe it is on another directory
not explicited on the makefile through the INCLUDE variable.

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ make
    gcc -o clara clara.o skel.o gui.o mc.o ...
    /usr/bin/ld: cannot open -lX11: No such file or directory
    make: *** [clara] Error 1</PRE>
</TD></TR></TABLE></CENTER>
The linker could not find the X11 library. Maybe your system does
not include such library, or maybe it is on another directory not
explicited on the makefile through the LIBPATH variable.

<P>
2. Compilation pitfalls

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ make
    gcc -I/usr/X11R6/include -g   -c clara.c -o clara.o
    clara.c:70: parse error before `int'
    make: *** [clara.o] Error 1</PRE>
</TD></TR></TABLE></CENTER>
A syntax error on the line 70 of the file clara.c. Double check
if the sources were not changed. Try to obtain the sources
again. If you're a programmer, try to fix the problem. In any
case, report it to claraocr@claraocr.org.

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ make
    clara.c: In function `process_cl':
    clara.c:2293: `ZPS' undeclared (first use in this function)
    clara.c:2293: (Each undeclared identifier is reported only once
    clara.c:2293: for each function it appears in.)
    make: *** [clara.o] Error 1</PRE>
</TD></TR></TABLE></CENTER>
A reference to an undeclared variable. Double check if the
sources were not changed. Try to obtain the sources again. If
you're a programmer, try to fix the problem. In any case, report
it to claraocr@claraocr.org.

<P>

<P>
3. Runtime pitfalls

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ clara &
    [1] 1924
    bash: clara: command not found</PRE>
</TD></TR></TABLE></CENTER>
The Clara executable does not exist or is not on the path. Most
Unix systems don't include the current directory ("./") on the
path, so if you're trying to start Clara from the directory where
it was compiled, specify the current directory ("./clara").

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ ./clara &
    [1] 1922
    _X11TransSocketUNIXConnect: Can't connect: errno = 111
    cannot connect to X server</PRE>
</TD></TR></TABLE></CENTER>
Clara could not connect the X server. The X Windows System is a
client-server system. The applications (xterm, xclock, etc)
connect to a display server (the X server). If the server is not
running, clients cannot connect to it. In some cases, it's
required to inform explicitly the client about the server it must
connect, using the environment variable DISPLAY.

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ ./clara
    Segmentation fault (core dumped)</PRE>
</TD></TR></TABLE></CENTER>
If you can reproduce the problem, report it
to claraocr@claraocr.org. If you're a programmer and Clara was
compiled with the -g option, try a debugger to locate the point
of the source code where the segmentation fault happened. Using
gdb, it's quite easy:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ gdb clara
    (gdb) run</PRE>
</TD></TR></TABLE></CENTER>
Now try to reproduce the steps that led to the segmentation
fault.

<P>

<P>

<P>
<A NAME=2.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>2. A first OCR project</B></FONT></TD></TR></TABLE>
<P>
Clara OCR is intended to OCR a relatively large collection of pages at
once, typically a book. So we will refer the material that we are
OCRing as "the book".

<P>
Let's describe a small but real project as an example on how to use
Clara OCR to OCR one "book". This section is in fact an in-depth
tutorial on using Clara OCR. In order to try all techniques explained
along this section, please download and uncompress the file referred
as "page 143" of Manuel Bernardes Branco Dictionary (Lisbon, 1879),
available at <A HREF=http://www.claraocr.org>http://www.claraocr.org</A>. It's a tarball containing the
two text columns (one per file) of that page.

<P>
Just to make the things easier, we will assume that the files
143-l.pgm and 143-r.pgm were downloaded to the directory
/home/clara/books/MBB/pgm/. We will assume also that the programs
"clara", and "selthresh.pl" are on the PATH. Some programs
required to handle PBM files (pgmtopbm, pnmrotate and others, by
Jef Poskanzer) are also required. These programs can be easily
found around there, and are included on most free operating
systems.

<P>

<P>
<A NAME=2.1>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.1 Scanning and thresholding</B></FONT></TD></TR></TABLE>
<P>
Clara OCR cannot scan paper documents by itself. Scanning must be
performed by another software. The Clara OCR development effort is
using SANE (<A HREF=http://www.mostang.com/sane>http://www.mostang.com/sane</A>) to produce 600 or 300 dpi
images. The Clara OCR heuristics are tuned to 600 dpi.

<P>
Scanners offer three scanning modes: "lineart" (black-and-white),
"grayscale" and "color". Clara OCR requires PBM input. By
definition, PBM is a "bitmap" or, in other words, each pixel can
only assume the values "white" or "black". A good choice for
scanning new, high quality printed materials is lineart mode, 600
dpi, PBM format. If your scanning program do not support the PBM
format, save the images in TIFF format and convert to PBM using
the tifftopnm command. If for some reason the TIFF format cannot
be used, choose any other format that preserves all data (avoid
"compressing" formats like JPEG) and for which a conversion tool
is available to produce PBMs.

<P>
Obs. The PBM format does not carry the original resolution
(dots-per-inch) at which the image was scanned. As some
heuristics use informations like dimension, by now Clara OCR
expects to be informed about the resolution through the
command-line switch -y.

<P>
Grayscale means that each pixel assumes one gray "level",
typically from 0 (black) to 255 (white). This is a good choice
for scanning old or low-quality printed materials, because it's
possible to use specialized programs to analyse the image and
choose a "threshold", in such a way that all pixels above that
threshold will be considered "white", and all others will be
considered black (when scanning in lineart, the threshold is
chosen by the scanning program or by the user). So in most cases
grayscale will permit achieving better results. However, as
grayscale images are much larger than bitmap images, 300 dpi
(instead of 600 dpi) may be mandatory when using grayscale due to
disk consumption requirements.

<P>
Obs. the page 143 of Manuel Bernardes Branco Dictionary that
we're using along these tests was scanned using the SANE
scanimage command:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    scanimage -d microtek2:/dev/sga --mode gray -x 150 -y 210
              --resolution 300 > 143.pgm</PRE>
</TD></TR></TABLE></CENTER>
Thresholding is not the only method for converting grayscale
images into bitmaps (such conversion is also called
"binarization"), but it's the current method used by Clara OCR.
In practice, a too low threshold will brake many symbols on their
thin parts, and a too high threshold will link symbols together
(in the figure, an "a-i" link and a broken "u").

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

               XX                  
               XX                  

     XXXXX    XXX      XXX   XXX   
    X     XX   XX       XX    XX   
          XX   XX       XX    XX   
     XXXXXXX   XX       XX    XX   
    X     XX   XX       XX    XX   
    X     XX   XX       XX    XX   
     XXXXX XXXXXXX       XX  XXXX  </PRE>
</TD></TR></TABLE></CENTER>
It's a hard task to detect broken and linked symbols. The Clara OCR
heuristics that handle these cases are incipient, so thresholding must
must be carefully performed, in order to not compromise the OCR
results. If the printing intensity, the noise level or the paper
quality vary from page to page, thresholding must be performed on a
per-page basis.

<P>
Clara OCR includes a simple threshold selection script. Let's try it
on our 2-page book. Just create a directory, cd to it and run the
selthresh.pl script informing the resolution and the names of the
images:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/books/BC
    $ mkdir pbm
    $ cd pbm
    $ selthresh.pl -y 300 -l 0.45 0.55 ../pgm/*pgm
    selthresh.pl: scaling 2 times
    Best thresholds:
    143-l.pgm 0.49
    143-r.pgm 0.51</PRE>
</TD></TR></TABLE></CENTER>
In this case, selthresh.pl will require around 4 minutes to
complete on a 500MHz CPU. For larger collections of pages,
selthresh.pl may take much longer to complete (hours or days). If
needed, the execution can be safely interrupted using Control-C
(it's ok to shutdown the machine while selthresh.pl is
running). The execution can be safely restarted from the point
where it was interrupted typing again the same command:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/books/BC/pbm
    $ selthresh.pl -y 300 -l 0.40 0.55 ../pgm/*pgm</PRE>
</TD></TR></TABLE></CENTER>
The option -l is used to inform an interval of thresholds to
try. By now, selthresh.pl is unable to choose by itself a "good"
interval. The user must manually check the results for some
thresholds in order to make a choice. For instance, to examine
the results for threshold 0.4 on page 143-l.pgm, try:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ pgmtopbm -threshold -value 0.4 ../pgm/143-l.pgm >143-l.pbm
    $ display 143-l.pbm</PRE>
</TD></TR></TABLE></CENTER>
Change the threshold, repeat and, once found a threshold value
that produces a "nice" visual result, specify to -l the interval
centered at that threshold, and total width 0.1 or 0.2. The same
interval may be used for all pages because selthresh.pl will warn
about a bad interval choice. Example:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ selthresh.pl -y 300 -l 0.30 0.35 ../pgm/143-l.pgm
    selthresh.pl: scaling 2 times
    Best thresholds:
    143-l.pgm 0.32 (bad interval, try -l 0.30 0.4)</PRE>
</TD></TR></TABLE></CENTER>
If a "bad interval" warning appears on the final output for some
pages, it's ok to restart selthresh.pl informing a new, wider
interval, as suggested by selthresh.pl. Only the suspicious pages
will be re-examined. In fact, selecting a narrow initial interval
(and making it larger as required) may be a good strategy to
reduce the total running time.

<P>
Once the best thresholds are known, use pgmtopbm to produce the
bitmaps. It's also a good idea to approach the resolution to 600 dpi
using pnmenlarge. Yet pnmenlarge does not add information to the
image, the classification heuristics will behave better. In our case,
the command should be

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/books/BC/pbm
    $ pnmenlarge 2 ../pgm/143-l.pgm | \
          pgmtopbm -threshold -value 0.49 >143-l.pbm
    $ pnmenlarge 2 ../pgm/143-r.pgm | \
          pgmtopbm -threshold -value 0.51 >143-r.pbm</PRE>
</TD></TR></TABLE></CENTER>
Obs. it's not a bad idea to visualize the PBM files, or at least
some of them. Yet selthresh.pl produced good results for us, your
mileage may vary.

<P>
In order to capture the output of selthresh.pl (to extract the
per-page best thresholds), it's ok to re-generate it as many
times as needed (just repeat the same selthresh.pl command,
because once all computations become performed, the script will
just read the results from selthresh.out and output the results).

<P>
A final warning: selthresh.pl may be fooled by too dark images. So
if the right limit is much larger than it should be, selthresh.pl
may produce bad results. So be careful concerning the right limit
of the interval. As a practical advice, keep in mind that the
best threshold for most images is less then 0.6. In the near
future we'll use statistical measurements to choose the interval
to analyse, in order to prevent such problems and to make
unnecessary a manual choice.

<P>

<P>
<A NAME=2.2>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.2 Avoiding skew</B></FONT></TD></TR></TABLE>
<P>
Sometimes the printing is skewed relatively to the paper. Skew is
a problem to the OCR heuristics. Clara OCR currently does not
offer a good solution for skew correction. As its engine just
detects components by pixel contiguity and builds classes of
symbols, in practice the effect of skew will be a larger number
of patterns, and therefore a larger revision cost.

<P>
In some cases, a careful manual scanning can solve the
problem. When acceptable, a set-square solves the problem: just
align one text line at one set-square rule and the edge of the
scanner glass at the other rule (we're supposing that the
bookbinding was disassembled).

<P>
A preprocessor able to compute and correct skew is expected to
become available soon.

<P>

<P>
<A NAME=2.3>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.3 The work directory</B></FONT></TD></TR></TABLE>
<P>
Clara OCR expects to find on one same directory one or more
images of scanned pages. In our case, this directory is assumed
to be /home/clara/books/BC/pbm. By default, on this same
directory, various files will be created to store the OCR data
structures. So, if 143-l.pbm and 143-r.pbm are the pages to OCR, then
after processing all pages at least once (not done yet) the work
directory will contain the following files:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    143-l.pbm
    143-l.html
    143-l.session
    143-r.pbm
    143-r.html
    143-r.session
    acts
    patterns</PRE>
</TD></TR></TABLE></CENTER>
The files "*.pbm" are the PBM images, the files "*.html" are the
current OCR output, the file "patterns" is the current
"bookfont", the file "143-l.session" contains the OCR data structures
for the page 143-l.pbm, and the file "acts" stores the human revision
effort already spent.

<P>
Whe Clara OCR is processing the page x.pbm, the files
"x.session", "acts" and "patterns" are in memory. These three
files together are generally referred as "the section". So the
menu option "save session" means saving all three files.

<P>

<P>
<A NAME=2.4>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.4 The pattern types</B></FONT></TD></TR></TABLE>
<P>
Patterns are selected symbols from the book. They're obtained
from manual training, or from automatic selection. The patterns
are used to deduce the transliteration of the unknown symbols by
the bitmap comparison heuristics. In other words, the OCR
discovers that one symbol is the letter "a" or the digit "1"
comparing it with the patterns.

<P>
The book font is the collection of all patterns. The term "book
font" was chosen to make sure that we're not talking about the X
font used by the GUI. The book font is stored on a separate file
("patterns", on the work directory). Clara OCR classifies the
patterns into "types", one type for each printing font. By now,
most of this work must be done manually. Someday in the future,
the auto-tuning features and the pre-build customizations will
hopefully make this process less painful.

<P>
So, before OCRing one book, it's convenient to observe the
different fonts used. In our case, we have three fonts (the
quotations refer the page 5.pbm):

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    Unknown Latin 9pt         ("Todos sao iguais...")
    Unknown Latin 9pt bold    ("Art. 5")
    Unknown Latin 8pt italic  (footings)</PRE>
</TD></TR></TABLE></CENTER>
It's not mandatory to exactly identify each font by its "correct" name
(Roman, Arial, Courier, etc). In our case, we've chosen "Unknown
Latin". These labels can be manually entered using the PATTERN (TYPES)
tab, one "type" for each "font". So we'll have 3 "types", and, for
each one, various parameters can be manually informed. At least the
alphabet must be informed. In fact, the PATTERN (TYPES) tab allows
structuring very carefully all fonts used along the book. Even some
intrincated details, like the classification techniques that can be
used for each symbol, can be set.

<P>
Now we can select some patterns from the pages 5.pbm and
6.pbm. Try:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/books/BC/pbm
    $ clara &</PRE>
</TD></TR></TABLE></CENTER>
Load the page 5.pbm. Observe the symbols, select a nice one using
the mouse button 1 or the arrows (say, a letter "a", small) and
train it pressing the corresponding key (the "a" key). Repeat
this process for various symbols, all from one same type (so do
not mix bold with non-bold, etc). The entered patterns belong by
default to "type 0". The "Set pattern type" entry of the Edit
menu can be used to move all "type 0" patterns to some other type
(1, 2 or 3 in our case). The "Display absent symbols" on the
"Options" menu can be used to inform the symbols for which a
reasonable quantity (5) of samples do not exist yet. This way,
one can complete all fonts used along the book.

<P>
Now save the session (menu "File"), exit Clara OCR (menu "File"),
and enter Clara OCR again using the same commands above. Try to
load one file and/or to observe the patterns on the tabs PATTERN,
PATTERN (list), TUNE (SKEL), etc. This is a good way to
experience that Clara OCR is started and exited many times along
the duration of one OCR project.

<P>
The last remark in this subsection: instead of the just described
manual pattern selection, Clara OCR is able to select by itself
the patterns to use from the pages. In order to use this feature,
after selecting the checkbox "Build the bookfont automatically"
(TUNE tab), classify the symbols (just press the OCR button using
the mouse button 1, or press the mouse button 2 over it and
select the "classify" item). However, the current recommendation
is to prefer the manual selection of patterns, at least as a
first step.

<P>

<P>
<A NAME=2.5>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.5 Skeleton tuning</B></FONT></TD></TR></TABLE>
<P>
Currently, symbol classification can be performed by three
different classifiers: skeleton fitting, border mapping or pixel
distance. The choice is done on the TUNE tab. Border mapping is
currently experimental. Pixel distance has been used as an
auxiliar classifier. Skeleton fitting is a more mature code and
is highly customizable. It's the default classification method by
now.

<P>
When using skeleton fitting, two symbols are considered similar
when each one contains the skeleton of the other. So the
classification result depends strongly on how skeletons are
computed.

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

  FIGURE: SKELETON FITTING</PRE>
</TD></TR></TABLE></CENTER>
Clara OCR offers seven different methods for computing
skeletons. Each method has tunable parameters. The choice of the
method and the parameters can be done through a visual inteface
on the TUNE (SKEL) tab. To try it, first save the session (menu
"File"), then enter that tab. At least one pattern must
exist. Vary the parameters and observe the results. Press the
left and right arrows to navigate through the patterns, and use
the "zoom" button to choose a comfortable image size. The last
selection will be used for all skeleton computations. To discard
it, exit Clara OCR without saving the session.

<P>
Instead of trying the TUNE (SKEL) tab, it's possible to specify
skeleton computation parameters through the -k command-line
switch. Note however that if a selection was performed through
the TUNE (SKEL) tab, that selection will override the parameters
informed to -k, so be careful.

<P>
Clara OCR has an auto-tune feature to choose the "best" skeleton
computation parameters. However, recent changes left that feature
broken, so forget it by now. As choosing those parameters by hand
is a hard task, unless you want to play with it, please use the
following minimalistic approach:

<P>
1. Quality printing without thin details

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    use -k 2,1.4,1.57,10,3.8,10,4,4
     or -k 0,1.4,1.57,10,3.8,10,4,4</PRE>
</TD></TR></TABLE></CENTER>
2. Quality printing with thin details

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    use -k 2,1.4,1.57,10,3.8,10,1,1
     or -k 4,,,,,,3,</PRE>
</TD></TR></TABLE></CENTER>
3. Poor printing without thin details

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    use -k 2,1.4,1.57,10,3.8,10,1,1</PRE>
</TD></TR></TABLE></CENTER>
4. Poor printing with thin details

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    use -k 2,1.4,1.57,10,3.8,10,1,1</PRE>
</TD></TR></TABLE></CENTER>
Yet the pattern computation parameters may change along the way,
it's wise to choose adequate skeleton computation parameters
before OCRing, and keep them fixed along the project. Every time
Clara OCR is started, inform the same parameters chosen. In our
case, we can use the default parameters. To do so, just enter
Clara OCR as before:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/books/BC/pbm
    $ clara &</PRE>
</TD></TR></TABLE></CENTER>
<A NAME=2.6>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.6 Classification tentatives</B></FONT></TD></TR></TABLE>
<P>
To classify the book symbols (i.e. to discover the
transliteration of unknown symbols using the patterns), enter
Clara OCR, select "Work on all pages" ("Options" menu) and press
the OCR button using the mouse button 1, or press the mouse
button 2 and select "Classification". The classification may be
performed many times. Each time, different parameters may be
tried to refine the results already achieved.

<P>
When the classification finishes, observe the pages 5.pbm and
6.pbm. Much probably, some symbols will be greyed. In other
words, the classifier was unable to classify all symbols. The
statistics presented on the PAGE (LIST) tab may be useful now. To
reduce the number of unknown symbols there are three choices: add
more patterns, change the skeleton computation parameters, or try
another classifier.

<P>
To add more patterns, just train some greyed symbols and
reclassify all pages again. The reclassification will be faster
than the first classification because most symbols, already
classified, won't be touched.

<P>
To change the skeleton computation parameters, exit Clara OCR,
restart it informing the new parameters through -k, select
"Re-scan all patterns" ("Edit" menu), select "Work on all pages"
("Options" menu) and reclassify. May be easier to choose and set
the new parameters using the TUNE (SKEL) tab, as explained
earlier. However, remember that the parameters chosen through the
TUNE (SKEL) tab override the parameters informed through -k.

<P>
To try another classifier, first select the "Re-scan all
patterns" entry on the "Edit" menu. Then enter the TUNE tab and
select the classifier to use from the available choices
(skeleton-base, border mapping and pixel distance). The pixel
distance may be a good choice. Then reclassify all pages.

<P>
The "Re-scan all patterns" is required because for each symbol
Clara OCR remembers the patterns already tried to classify it,
and do not try those patterns again. However, when the skeleton
computation parameters change, or when the classifier changes,
those same patterns must be tried again. Maybe in the future
Clara OCR will decide by itself about re-scanning all patterns.

<P>

<P>
<A NAME=2.7>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>2.7 Alignment tuning</B></FONT></TD></TR></TABLE>
<P>
At this point, we can generate the output for all pages. The
output is already available if the classification was performed
clicking the OCR button with mouse button 1. If not, just select
the "Work on all pages" item on the "Options" menu, and click the
OCR button using the mouse button 1. The per-page output will be
saved to the files 5.html and 6.html.

<P>
Maybe the ouptut will contain unknow symbols. Maybe the output
presents broken lines or broken words. If so, the numbers used to
perform symbol alignment must be changed. These numbers are
configured on the TUNE tab ("Magic numbers" section). They're
part of the session data, so they'll be saved to disk.

<P>
There are 7 such numbers:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    max word distance as percentage of x_height
    max symbol distance as percentage of x_height
    dot diameter measured in milimeters
    max alignment error as percentage of DD
    descent (relative to baseline) as percentage of DD
    ascent (relative to baseline) as percentage of DD
    x_height (relative to baseline) as percentage of DD
    steps required to complete the unity</PRE>
</TD></TR></TABLE></CENTER>
In order to understand why these numbers are relevant, suppose,
for instance, that Clara OCR already knows that the "b" symbol
below is a letter "b", but does not know that the "p" symbol is a
letter "p". To decide if the "p" symbol seems to be a letter
instead of a blot, Clara OCR checks if it fits the the typical
dimensions of a letter. To do so, alignemnt hints are needed. On
the figure we can see the baseline-relative ascent (AS), descent
(DS) and x_height (XH), and the dot diameter (DD).

<P>

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    XXX                          --\--
     XX                            |
     XX                            |
     XX                            |
     XX XXXXX   XX  XXXXX          |     --\--
     XXX     X   XXX     X         | AS    |  
     XX      XX  XX      XX        |       |  
     XX      XX  XX      XX        |       | XH  
     XX      XX  XX      XX        |       |  
     XX      XX  XX      XX  X     |       |     --\--
     XXX     X   XXX     X  XXX    |       |       | DD
     XX XXXXX    XX XXXXX    X   --\--   --\--   --\--
                 XX                |
                 XX                | DS
                 XX                |
                XXXX             --\--</PRE>
</TD></TR></TABLE></CENTER>
The most relevant numbers to configure are the dot diameter, the
maximum alignment error, the descent, the ascent and the x_height.
They inform the baseline-relative ascent, descent and x_height, as
percentages of the dot diameter. The usage of these numbers is
expected to stop some day in the future, when the pattern types
implementation become more mature.

<P>
<A NAME=3.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>3. Complex procedures</B></FONT></TD></TR></TABLE>
<P>
To OCR an entire book is a long process. Perhaps along it a
problem is detected. Bad choice of skeleton computation
parameters, or a bad page contaminating the bookfont, some files
loss due to a crash, etc. How to solve them?

<P>
Clara OCR does not offer currently a complete set of tools to
solve all these problems. In some cases, a simple solution is
available. In others, a solution is expected to become available
in future versions. This session will depict some practical
cases, and explain what can be done and what cannot be done for
each one.

<P>
<A NAME=3.1>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.1 Using two directories</B></FONT></TD></TR></TABLE>
<P>
In order to make easier the usage of read-only media, Clara OCR
allows splitting the files in two directories, one for images and
other for work files. The path of the first is stored on
pagesdir, and the second, on workdir. For instance:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

  (pagesdir)

    |
    +- 1.pbm
    |
    +- 2.pbm


  (workdir)

    |
    +- 1.session
    |
    +- 1.html
    |
    +- 2.session
    |
    +- 2.html
    |
    +- acts
    |
    +- font</PRE>
</TD></TR></TABLE></CENTER>
In this example, there are 2 pages (files "1.pbm" and
"2.pbm"). The current font is the file "pattern". The files
1.session and 2.session are the dumps of the data structures
built when processing the pages 1 and 2. The files 1.html and
2.html contain the current OCR output generated for pages 1 and 2.

<P>

<P>
<P><A NAME=3.2><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.2 Adding a page (to be written)</B></FONT></TD></TR></TABLE>
<A NAME=3.3>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.3 Multiple books</B></FONT></TD></TR></TABLE>
<P>
A somewhat rigid directory structure is recommended for
high-volume digitalization projects based on Clara and using the
web interface. In this case, there will be multiple "pagesdir"
directories ("book1" and "book2" from the docsroot in the figure)
and, for each one, a corresponding "workdir" ("book1" and "book2"
from the workroot in the figure).

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

  (booksroot)

    |
    +- book1/
    |       +- 1.pbm
    |       |
    |       +- 2.pbm
    |
    |
    +- book2/
            +- 1.pbm
            |
            +- 2.pbm


  (workroot)

    |
    +- book1/
    |       +- 1.session
    |       |
    |       +- 1.html
    |       |
    |       +- 2.session
    |       |
    |       +- 2.html
    |       |
    |       +- acts
    |       |
    |       +- doubts/
    |       |        +- s.1.319.pbm
    |       |        |
    |       |        +- u.2.7015.pbm
    |       |        |
    |       |        +- 1.958225189.17423.hal
    |       |
    |       +- pattern
    |
    +- book2/
    |       +- 1.session
    |       |</PRE>
</TD></TR></TABLE></CENTER>
For each book subdirectory on the workroot subtree, there will be
a "doubts" directory, used to communicate with the web
server. Each OCR run on some page of this book will generate
files of the form "u.page.symbol.pbm", that contains a pbm image
of one symbol. Once the CGI is
claimed to produce a revision page, it will choose one of these
files and rename it to s.page.symbol.pbm. This procedure is
performed without using locks, so two simultaneous revision
acesses may access the same symbol. The revision submission
generates a qmail-style file doc.time.pid.host.

<P>

<P>
<P><A NAME=3.4><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.4 Adding a book (to be written)</B></FONT></TD></TR></TABLE>
<A NAME=3.5>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.5 Removing a page</B></FONT></TD></TR></TABLE>
<P>
From the stats presented by the PAGE (LIST) tab it's possible to
detect problems on specific pages. A low factorization may be a
simptom of a bad choice of brightness for that page. In such a
case, it's probably a good idea to remove completely that page.

<P>
To remove a page is a delicate operation. Clara OCR currently
does not offer a "remove page" feature. Basically, it should
remove all patterns from that page, remove the revision data
acquired from that page, and remove the page image and its
session file.

<P>
<A NAME=3.6>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.6 Building the bookfont</B></FONT></TD></TR></TABLE>
<P>
Creating the pattern types (to be written).

<P>
Completing the pattern types (to be written).

<P>
Tuning skeleton parameters (to be written).

<P>
Auto-tuning skeleton parameters (to be written: when computing
pattern skeletons, auto-tune the parameters instead of using the
default parameters, entered through -k).

<P>

<P>
<A NAME=3.7>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.7 Dealing with classification errors</B></FONT></TD></TR></TABLE>
<P>
What to do when the OCR classifies incorrectly a large quantity of
symbols? (to be written)

<P>

<P>
<P><A NAME=3.8><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.8 Rebuilding session files (to be written)</B></FONT></TD></TR></TABLE>
<A NAME=3.9>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.9 Importing revision data</B></FONT></TD></TR></TABLE>
<P>
When OCRing a large book, a good approach is to divide its pages
into a number of smaller sections and OCR each one. So for a book
with, say, 1000 pages, we could OCR pages 1-200, then 201-400,
etc.

<P>
After finishing the first section, of course we desire reuse on
the second section the training and revision effort already
spent. This is not the same as adding the pages 201-400 to the
first section, because we do not want handle the pages 1-200
anymore.

<P>
Basically we need to import the patterns of the first section
when starting to process the second. Well, Clara OCR is currently
unable to make this operation.

<P>

<P>
<A NAME=3.10>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.10 How to use the web interface</B></FONT></TD></TR></TABLE>
<P>
The Clara OCR web interface allows remote training of symbols. To use
it, a web server able to run perl CGIs (e.g. Apache) is
required. Let's present the steps to activate the web interface for a
simple case, with only one book (named "book1"). Basically, one needs
to create a subtree anywhere on the server disk (say,
"/home/clara/www/"), owned by the user that will manage the project
(say, "clara"), with subdirectories, "bin", "book1" and
"book1/doubts":

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ id
    uid=511(clara) gid=511(clara) groups=511(clara)
    $ cd /home/clara/
    $ mkdir www
    $ cd www
    $ mkdir bin book1
    $ mkdir book1/doubts</PRE>
</TD></TR></TABLE></CENTER>
Then copy to the directory "bin" the files clara.pl and sclara.c from
the Clara OCR distribution (say, /usr/local/src/clara), edit clara.c
to change the hardcoded definition of the root directory to
"/home/clara/www", compile it and make it setuid:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd bin
    $ cp /usr/local/src/clara/clara.pl .
    $ cp /usr/local/src/clara/sclara.c .
    $ emacs sclara.c
    $ grep '^char *root' sclara.c
    char *root = "/home/clara/www";
    $ cc -o sclara -static sclara.c
    $ rm sclara.c
    $ chmod a+s sclara</PRE>
</TD></TR></TABLE></CENTER>
Edit the script clara.pl. Example for the clara.pl configuration
section (the script clara.pl contains default definitions for some of
these variables, please comment out those definitions):

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $CROOT = "/home/clara/www";
    $U = "/cgi-bin/clara";
    $book[0] = 'Author, <I>Test 1</I>, City, year';
    $subdir[0] = "book1";
    $LANG = 'en';
    $opt = '-W -R 10 -b -k 2,1.4,1.57,10,3.8,10,4,1';</PRE>
</TD></TR></TABLE></CENTER>
Now copy the PBM files to the directory "book1", create low-quality
jpeg previews, gzip the PBM files, and select some patterns:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/www/book1
    $ cp /usr/local/src/clara/imre.pbm .
    $ pbmreduce 8 imre.pbm | convert -quality 25 - imre.jpg
    $ gzip -9 imre.pbm
    $ clara -k 2,1.4,1.57,10,3.8,10,4,1</PRE>
</TD></TR></TABLE></CENTER>
(load one PBM file, train some symbols, save the session and quit the
program).

<P>
Now we need to process the PBM files in order to create some
"doubts". The script clara.pl also requires a symlink to the clara
binary (change the path /usr/local/bin/clara as required):

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/www/bin
    $ ln -s /usr/local/bin/clara clara
    $ ./clara.pl -s book1
    $ rm ../book1/*html
    $ ./clara.pl -p</PRE>
</TD></TR></TABLE></CENTER>
Now your server must be instructed to exec /home/clara/www/bin/clara.pl
when a visitor requests "/cgi-bin/clara" (if you prefer another URL,
change the clara.pl customization too). An easy way to accomplish that
is creating a symlink on the default directory for CGIs. The default
directory of CGIs is platform-dependent (e.g. /home/httpd/cgi-bin,
/usr/local/httpd/cgi-bin, /var/lib/apache/cgi-bin, etc). Example:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    # cd /home/httpd/cgi-bin
    # ln -sf /home/clara/www/bin/clara.pl clara</PRE>
</TD></TR></TABLE></CENTER>
Try to access the URL "/cgi-bin/clara" on your web server. The correct
behaviour is successfully loading a page entitled "Prototype of the
Cooperative Revision". If you have problems, be aware about some
common problems:

<P>
1. Apache expects to be explicitly allowed to follow symlinks. The
file access.conf should contain, in our case, a section similar to the
following:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    <Directory /home/httpd/cgi-bin>
    AllowOverride None
    Options ExecCGI FollowSymLinks
    </Directory></PRE>
</TD></TR></TABLE></CENTER>
2. The directory /home/clara must be world readable:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    # ls -ld /home/clara
    drwxr-xr-x  4 clara clara  1024 Sep 17 09:56 /home/clara</PRE>
</TD></TR></TABLE></CENTER>
If you succeeded, congratulations! Note that from time to time it'll
be necessary to reprocess the pages, adding to the session files the
data collected from the web, just like done before:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    $ cd /home/clara/www/bin
    $ ./clara.pl -p
    $ ./clara.pl -s book1</PRE>
</TD></TR></TABLE></CENTER>
<A NAME=3.11>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.11 Revision acts maintenance</B></FONT></TD></TR></TABLE>
<P>
Types of revision acts (to be written).

<P>
Discarding deduced data (to be written).

<P>

<P>

<P>

<P>
<A NAME=3.12>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.12 Analysing the statistics</B></FONT></TD></TR></TABLE>
<P>
The "page (list)" tab offers recognition statistics on a per-page
basis. The contents of each column on this tab is described
below:

<P>
POS: The sequential position on the list. The current page
is informed by an asterisk on this column.

<P>
FILE: The name of the file that contains the PBM image of the
document.

<P>
RUNS: The number of OCR runs on this page. Partial OCR runs,
like classification (started by the "classify" button also count
as one run.

<P>
TIME: Total CPU time wasted with OCR operations on this
page. I/O time (reading and saving session files) is not
included.

<P>
WORDS: Current number of words on this page. This variable is
updated by the "build" step.

<P>
SYMBOLS: Current number of symbols on this page. This variable
is updated by the "build" step.

<P>
DOUBTS: Current number of untransliterated CHAR symbols on
this page. This variable is updated by the "build" step.

<P>
CLASSES: Current number of classes on this page.

<P>
FACT: Quotient between the number of symbols and the number of
classes.

<P>
RECOG: Quotient between (symbols-doubts) and symbols, where
"symbols" is the number of symbols and "doubts" is the number of
doubts as defined above.

<P>
PROGRESS: difference between the current recog rate and the
recog rate for the previous run.

<P>
<P><A NAME=3.13><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>3.13 Upgrading Clara OCR (to be written)</B></FONT></TD></TR></TABLE>
<A NAME=4.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>4. Reference of the Clara GUI</B></FONT></TD></TR></TABLE>
<P>
In this section, the Clara application window will be described
in detail, both to document all its features and to define the
terminology.

<P>

<P>

<P>
<A NAME=4.1>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>4.1 The application window</B></FONT></TD></TR></TABLE>
<P>
The application window is divided into three major areas: the
buttons ("zoom", "OCR", "stop", etc) the "plate" (right),
including the tabs ("page", "symbol" and "font"), and one or more
"document windows" inside the plate.

<P>
We say "document window" because each window is exhibiting one
"document". This "document" may be the scanned page (PAGE
window), the current OCR output for this page (PAGE OUTPUT
window), the symbol form (PAGE SYMBOL window), the GPL (GPL
window) and so on. However, we'll refer the document windows
merely as "windows".

<P>
Around each window there are two scrollbars. On the botton of the
application window there is a status line. On the top there is
a menu bar (fully documented on the section "Reference of the
menus").

<P>

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    +-----------------------------------------------+
    | File Edit OCR ...                             |
    +-----------------------------------------------+
    | +--------+     +----+ +--------+ +-------+    |
    | |  zoom  |     |page| |patterns| | tune  |    |
    | +--------+   +-+    +-+        +-+       +-+  |
    | +--------+   | +-------------------------+ |  |
    | |  zone  |   | |                         | |  |
    | +--------+   | |                         | |  |
    | +--------+   | |                         | |  |
    | |  OCR   |   | |        WELCOME TO       | |  |
    | +--------+   | |                         | |  |
    | +--------+   | |    C L A R A    O C R   | |  |
    | |  stop  |   | |                         | |  |
    | +--------+   | |                         | |  |
    |      .       | |                         | |  |
    |      .       | |                         | |  |
    |              | |                         | |  |
    |              | |                         | |  |
    |              | +-------------------------+ |  |
    |              +-----------------------------+  |
    |                                               |
    | (status line)                                 |
    +-----------------------------------------------+</PRE>
</TD></TR></TABLE></CENTER>
<A NAME=4.2>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>4.2 Tabs and windows</B></FONT></TD></TR></TABLE>
<P>
Three tabs are oferred, and each one may operate in one or more
"modes". For instance, pressing the PATTERN tab many times will
circulate two modes: one presenting the windows "pattern" and
"pattern (props)" and another with the window "pattern
(list)".

<P>
On each tab, Clara OCR displays on the plate one or more
windows. Each such window is called a "document window" to
distinguish them from the application window. Each such window
is supposed to be displaying a portion of a larger document, for
instance

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    The scanned page (graphic)
    The OCR output (text)
    The list of pages (text)
    The list of patterns (text)
    The symbol description (text)</PRE>
</TD></TR></TABLE></CENTER>
Unless the user hides them, two scrollbars are displayed for each
document window, one horizontal and one vertical. On each one, a
cursor is drawn to show the relative portion of the full document
currently visible ont the display.

<P>
All available tabs and the modes for each one are listed
below. The numbers (1, 2, etc) are only to make easier to
distinguish one mode from the others. There is no effective
association between the modes and the numbers.

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

     tab      mode      windows
    -------------------------------

               1       WELCOME

               2       GPL

    page       3       PAGE_LIST

               4       PAGE
                       PAGE_OUTPUT
                       PAGE_SYMBOL

               5       PAGE_FATBITS
                       PAGE_MATCHES

    pattern    6       PATTERN

               7       PATTERN_LIST

               8       PATTERN_TYPES

    tune       9       TUNE

              10       TUNE_PATTERN
                       TUNE_SKEL

              11       TUNE_ACTS</PRE>
</TD></TR></TABLE></CENTER>
Note that the windows WELCOME and GPL have no corresponding
tab. When these windows are displayed, there is no active
tab. Except in these cases, the name of the current window is
always presented as the label of the active tab.

<P>
<A NAME=4.3>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>4.3 The Application Buttons</B></FONT></TD></TR></TABLE>
<P>
The application buttons are those displayed on the
left portion of the Clara X window. They're
labelled "zoom", "OCR", etc. Three types of
buttons are available. There are on/off buttons
(like "italic"), multi-state buttons (like the
alphabet button), where the state is informed by
the current label, and there are buttons that
merely capture mouse clicks, like the "zoom"
button. Some are sensible both to mouse button 1
and to mouse button 2, others are sensible only to
mouse button 1.

<P>

<P>
zoom - enlarge or reduce bitmaps. The mouse buttom 1
enlarge bitmaps, the mouse button 2 reduce bitmaps.
The bitmaps to enlarge or reduce are determined
by the current window. If the PAGE window is active,
then the scanned document is enlarged or reduced.
If the PAGE (fatbits) or the PATTERN window is active,
then the grid is enlarged or reduced. If the PAGE
(symbol) or the PATTERN (props) or the PATTERN (list)
window is active, then the web clip is enlarged or
reduced.

<P>

<P>
OCR - start a full OCR run on the current page or on
all pages, depending on the state of the "Work on
current page only" item of the Options menu.

<P>

<P>
stop - stop the current OCR run (if any). OCR
does not stop immediately, but will stop
as soon as possible.

<P>

<P>
zone - start definition of the OCR zone. Currently
zoning in Clara OCR is useful only for saving the
zone can as a PBM file, using the "save zone" item
on the "File" menu. By now, only one zone can be
defined and the OCR operations consider the
entire document, ignoring the zone.

<P>

<P>
type - read-only button, set accordingly to the pattern
type of the current symbol or pattern. The various letter
sizes or styles (normal, footnote, etc) used by the book are
numbered from 0 by Clara OCR ("type 0", "type 1", etc).

<P>

<P>
bold - read-only button, set accordingly to the bold
status of the current symbol or pattern.

<P>

<P>
italic - read-only button, set accordingly to the italic
status of the current symbol or pattern.

<P>

<P>
bad - toggles the button state. The bad flag
is used to identify damaged bitmaps.

<P>

<P>
latin/greek/etc - read-only button, set accordingly to
the alphabet of the current symbol or pattern.

<P>

<P>
<A NAME=4.4>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>4.4 The Alphabet Map</B></FONT></TD></TR></TABLE>
<P>
When the "Show alphabet map" option of the "View" menu is selected,
the GUI will include an alphabet map between the buttons and the
plate. This map presents all symbols from the current alphabet. The
current alphabet is selected using the alphabet button. The alphabet
button circulates all alphabets selected on the "Alphabets" menu.

<P>
Clara OCR offers an initial support for multiple alphabets. To become
useful, it needs more work. The alphabet map currently does not offer
any functionality. For some alphabets (Cyrillic and Arabic) the
alphabet map is disabled on the source code due to the large alphabet
size. Currently Clara OCR does not contain bitmaps for displaying
Katakana.

<P>

<P>
<A NAME=5.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>5. Reference of the menus</B></FONT></TD></TR></TABLE>
<P>
Most menus are acessible from their labels menu bar (on the top of the
application window). The labels are "File", "Edit", etc. Other menus
are presented when the user clicks the mouse button 2 on some special
places (for instance the button "OCR"). Let's describe all menus and
their entries.

<P>

<P>
<A NAME=5.1>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.1 File menu</B></FONT></TD></TR></TABLE>
<P>
This menu is activated from the menu bar on the top of the
application X window.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Load page
</B></TD></TR></TABLE><P>
<P>
Enter the page list to select a page to be loaded.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Save session
</B></TD></TR></TABLE><P>
<P>
Save on disk the page session (file page.session), the patterns
(file "pattern") and the revision acts (file "acts").

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Save zone
</B></TD></TR></TABLE><P>
<P>
Save on disk the current zone as the file zone.pbm.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Write report
</B></TD></TR></TABLE><P>
<P>
Save the contents of the PAGE LIST window to the file
report.txt on the working directory.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Quit the program
</B></TD></TR></TABLE><P>
<P>
Just quit the program (asking before if the session is to
be saved.

<P>
<A NAME=5.2>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.2 Edit menu</B></FONT></TD></TR></TABLE>
<P>
This menu is activated from the menu bar on the top of the
application X window.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Only doubts
</B></TD></TR></TABLE><P>
<P>
When selected, the right or the left arrows used on the
PATTERN or the PATTERN PROPS windows will move to the next
or the previous untransliterated patterns.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Re-scan all patterns
</B></TD></TR></TABLE><P>
<P>
When selected, the classification heuristic will retry
all patterns for each symbol. This is required when trying
to resolve the unclassified symbols using a second
classification method.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Fill region
</B></TD></TR></TABLE><P>
<P>
When selected, the mouse button 1 will fill the region
around one pixel on the pattern bitmap under edition on the
font tab.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Paint pixel
</B></TD></TR></TABLE><P>
<P>
When selected, the mouse button 1 will paint individual
pixels on the pattern bitmap under edition on the font tab.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Clear region
</B></TD></TR></TABLE><P>
<P>
When selected, the mouse button 1 will clear the region
around one pixel on the pattern bitmap under edition on the
font tab.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Clear pixel
</B></TD></TR></TABLE><P>
<P>
When selected, the mouse button 1 will clear individual
pixels on the pattern bitmap under edition on the font tab.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Sort by page
</B></TD></TR></TABLE><P>
<P>
When selected, the pattern list window will divide
the patterns in blocks accordingly to their (page)
sources.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Sort by matches
</B></TD></TR></TABLE><P>
<P>
When selected, the pattern list window will use as the
first criterion when sorting the patterns, the number of
matches of each pattern.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Sort by transliteration
</B></TD></TR></TABLE><P>
<P>
When selected, the pattern list window will use as the
second criterion when sorting the patterns, their
transliterations.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Sort by number of pixels
</B></TD></TR></TABLE><P>
<P>
When selected, the pattern list window will use as the
third criterion when sorting the patterns, their
number of pixels.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Sort by width
</B></TD></TR></TABLE><P>
<P>
When selected, the pattern list window will use as the
fourth criterion when sorting the patterns, their
widths.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Sort by height
</B></TD></TR></TABLE><P>
<P>
When selected, the pattern list window will use as the
fifth criterion when sorting the patterns, their
heights.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Del Untransliterated patterns
</B></TD></TR></TABLE><P>
<P>
Remove from the font all untransliterated fonts.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Reset match counters
</B></TD></TR></TABLE><P>
<P>
Change to zero the cumulative matches field of all
patterns.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Set pattern type.
</B></TD></TR></TABLE><P>
<P>
Set the pattern type for all patterns marked as "other".

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Reset skeleton parameters
</B></TD></TR></TABLE><P>
<P>
Reset the parameters for skeleton computation for all
patterns.

<P>
<A NAME=5.3>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.3 View menu</B></FONT></TD></TR></TABLE>
<P>
This menu is activated from the menu bar on the top of the
application X window.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show skeletons
</B></TD></TR></TABLE><P>
<P>
Show the skeleton on the windows PAGE_FATBITS. The
skeletons are computed on the fly.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show border
</B></TD></TR></TABLE><P>
<P>
Show the border on the window PAGE_FATBITS. The
border is computed on the fly.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show pattern skeleton
</B></TD></TR></TABLE><P>
<P>
For each symbol, will show the skeleton of its best match
on the PAGE (fatbits) window.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show pattern border
</B></TD></TR></TABLE><P>
<P>
For each symbol, will show the border of its best match
on the PAGE (fatbits) window.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show HTML source
</B></TD></TR></TABLE><P>
<P>
Show the HTML source of the document, instead of the
graphic rendering.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show web clip
</B></TD></TR></TABLE><P>
<P>
Toggle the web clip feature. When enabled, the PAGE_SYMBOL
window will include the clip of the document around the
current symbol that will be used through web revision.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show alphabet map
</B></TD></TR></TABLE><P>
<P>
Toggle the alphabet map display. When enabled, a mapping
from Latin letters to the current alphabet will be
displayed.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show current class
</B></TD></TR></TABLE><P>
<P>
Identify the symbols on the current class using
a gray ellipse.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show closures
</B></TD></TR></TABLE><P>
<P>
Identify the individual closures when displaying the
current document.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show symbols
</B></TD></TR></TABLE><P>
<P>
Identify the individual symbols when displaying the
current document.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show words
</B></TD></TR></TABLE><P>
<P>
Identify the individual words when displaying the
current document.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show lines (geometrical)
</B></TD></TR></TABLE><P>
<P>
Identify the lines (computed using geometrical criteria)
when displaying the current document.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Bold Only
</B></TD></TR></TABLE><P>
<P>
Restrict the show symbols or show words feature to
bold symbols or words.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Italic Only
</B></TD></TR></TABLE><P>
<P>
Restrict the show symbols or show words feature to
bold symbols or words.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show matches
</B></TD></TR></TABLE><P>
<P>
Display bitmap matches when performing OCR.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show comparisons
</B></TD></TR></TABLE><P>
<P>
Display all bitmap comparisons when performing OCR.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show matches
</B></TD></TR></TABLE><P>
<P>
Display bitmap matches when performing OCR,
waiting a key after each display.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show comparisons and wait
</B></TD></TR></TABLE><P>
<P>
Display all bitmap comparisons when performing OCR,
waiting a key after each display.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Show skeleton tuning
</B></TD></TR></TABLE><P>
<P>
Display each candidate when tuning the skeletons of the
patterns.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Presentation
</B></TD></TR></TABLE><P>
<P>
Perform a presentation. This item is visible on the menu
only when the program is started with the -A option.

<P>
<A NAME=5.4>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.4 Alphabets menu</B></FONT></TD></TR></TABLE>
<P>
This item selects the alphabets that will be available on the
alphabets button.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Arabic
</B></TD></TR></TABLE><P>
<P>
This is a provision for future support of Arabic
alphabet.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Cyrillic
</B></TD></TR></TABLE><P>
<P>
This is a provision for future support of Cyrillic
alphabet.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Greek
</B></TD></TR></TABLE><P>
<P>
This is a provision for future support of Greek
alphabet.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Hebrew
</B></TD></TR></TABLE><P>
<P>
This is a provision for future support of Hebrew
alphabet.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Kana
</B></TD></TR></TABLE><P>
<P>
This is a provision for future support of Kana
alphabet.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Latin
</B></TD></TR></TABLE><P>
<P>
Words that use the Latin alphabet include those from
the languages of most Western European countries (English,
German, French, Spanish, Portuguese and others).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Number
</B></TD></TR></TABLE><P>
<P>
Indo-arabic decimal numbers like
1234, +55-11-12345678 or 2000.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Ideogram
</B></TD></TR></TABLE><P>
<P>
Ideograms.

<P>
<A NAME=5.5>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.5 Options menu</B></FONT></TD></TR></TABLE>
<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Work on current page only
</B></TD></TR></TABLE><P>
<P>
OCR operations (classification, merge, etc) will be
performed only on the current page.

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Work on all pages
</B></TD></TR></TABLE><P>
<P>
OCR operations (classification, merge, etc) will be
performed on all pages.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Small font
</B></TD></TR></TABLE><P>
<P>
Use a small X font (6x13).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Medium font
</B></TD></TR></TABLE><P>
<P>
Use the medium font (9x15).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Small font
</B></TD></TR></TABLE><P>
<P>
Use a large X font (10x20).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Default font
</B></TD></TR></TABLE><P>
<P>
Use the default font (7x13 or "fixed" or the one informed
on the command line).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Hide scrollbars
</B></TD></TR></TABLE><P>
<P>
Toggle the hide scrollbars flag. When active, this
flag hides the display of scrolllbar on all windows.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Omit fragments
</B></TD></TR></TABLE><P>
<P>
Toggle the hide fragments flag. When active,
fragments won't be included on the list of
patterns.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Report scale
</B></TD></TR></TABLE><P>
<P>
Report the scale on the tab when the PAGE window is
active.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Display box instead of symbol
</B></TD></TR></TABLE><P>
<P>
On the PAGE window displays the bounding boxes instead of
the symbols themselves. This is useful when designing new
heuristics.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Emulate deadkeys
</B></TD></TR></TABLE><P>
<P>
Toggle the emulate deadkeys flag. Deadkeys are useful for
generating accented characters. Deadkeys emulation are disabled
by default The emulation of deadkeys may be set on startup
through the -i command-line switch.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Menu auto popup
</B></TD></TR></TABLE><P>
<P>
Toggle the automenu feature. When enabled, the menus on the
menu bar will pop automatically when the pointer reaches
the menu bar.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Search unexpected mismatches
</B></TD></TR></TABLE><P>
<P>
Compare all patters with same type and transliteration. Must
be used with the "Show comparisons and wait" option on
to diagnose symbol comparison problems.
<A NAME=5.6>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.6 Window options menu</B></FONT></TD></TR></TABLE>
<P>
This menu is activated when the mouse button 2 is pressed on
the window that displays the scanned document (that is, the
PAGE or the PAGE FATBITS window).

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Use as pattern
</B></TD></TR></TABLE><P>
<P>
The pattern of the class of this symbol will be the unique
pattern used on all subsequent symbol classifications. This
feature is intended to be used with the "OCR this symbol"
feature, so it becomes possible to choose two symbols to
be compared, in order to test the classification routines.

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>OCR this symbol
</B></TD></TR></TABLE><P>
<P>
Starts classifying only the symbol under the pointer. The
classification will re-scan all patterns even if the "re-scan
all patterns" option is unselected.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Bottom left here
</B></TD></TR></TABLE><P>
<P>
Scroll the window contents in order to the
current pointer position become the bottom left.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Merge with current symbol
</B></TD></TR></TABLE><P>
<P>
Merge this fragment with the current symbol.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Link as next symbol
</B></TD></TR></TABLE><P>
<P>
Create a symbol link from the current symbol (the one
identified by the graphic cursor) to this symbol.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Disassemble
</B></TD></TR></TABLE><P>
<P>
Make the current symbol nonpreferred and each of its
components preferred.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Link as accent
</B></TD></TR></TABLE><P>
<P>
Create an accent link from the current symbol (the one
identified by the graphic cursor) to this symbol.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Diagnose symbol pairing
</B></TD></TR></TABLE><P>
<P>
Run locally the symbol pairing heuristic to try to join
this symbol to the word containing the current symbol.
This is useful to know why the OCR is not joining two
symbols on one same word.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Diagnose word pairing
</B></TD></TR></TABLE><P>
<P>
Run locally the word pairing heuristic to try to join
this word with the word containing the current symbol
on one same line. This is useful to know why the OCR is
not joining two words on one same line.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Diagnose merging
</B></TD></TR></TABLE><P>
<P>
Run locally the geometrical merging heuristic to try
to merge this piece to the current symbol.

<P>
<A NAME=5.7>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#E2D3FC><FONT SIZE=+1><B>5.7 OCR steps menu</B></FONT></TD></TR></TABLE>
<P>
This menu is activated when the mouse button 2 is pressed on
the OCR button. It allows running specific OCR steps (all
steps run in sequence when the OCR button is pressed).

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Consist structures
</B></TD></TR></TABLE><P>
<P>
All OCR data structures are submitted to consistency
tests. This is under implementation.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Prepare patterns
</B></TD></TR></TABLE><P>
<P>
Compute the skeletons and analyse the patterns for
the achievement of best results by the classifier. Not
fully implemented yet.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Read revision data
</B></TD></TR></TABLE><P>
<P>
Revision data from the web interface is read, and
added to the current OCR training knowledge.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Detect blocks
</B></TD></TR></TABLE><P>
<P>
Start detecting blocks of text on the page.

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Classification
</B></TD></TR></TABLE><P>
<P>
start classifying the symbols of the current
page or of all pages, depending on the state of the
"Work on current page only" item of the Options menu. It
will also build the font automatically if the
corresponding item is selected on the Options menu.

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Geometric merging
</B></TD></TR></TABLE><P>
<P>
Merge closures on symbols depending on their
geometry.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Build words and lines
</B></TD></TR></TABLE><P>
<P>
Start building the words and lines. These heuristics will
be applied on the
current page or on all pages, depending on the state of
the "Work on current page only" item of the Options menu.

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Generate spelling hints
</B></TD></TR></TABLE><P>
<P>
Obs. this is not implemented yet.

<P>
Start filtering through ispell to generate
transliterations for unknow symbols or alternative
transliterations for known symbols. Clara will use the
dictionaries available for the languages selected on
the Languages menu. Filtering will be performed on the
current page or on all pages, depending on the state of
the "Work on current page only" item of the Options menu.

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Generate output
</B></TD></TR></TABLE><P>
<P>
The OCR output is generated to be displayed on the
"PAGE (output)" window. The output is also saved to the
file page.html.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>Generate web doubts
</B></TD></TR></TABLE><P>
<P>
Files containing symbols to be revised through the web
interface are created on the "doubts" subdirectory of
the work directory. This step is performed only when
Clara OCR is started with the -W command-line switch.

<P>
<A NAME=6.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>6. Reference of command-line switches</B></FONT></TD></TR></TABLE>
<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-a bf_auto,st_auto,classifier
</B></TD></TR></TABLE><P>Bookfont handling options

<P>

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-b
</B></TD></TR></TABLE><P>Run in batch mode.

<P>
The application window will not be created, and the OCR
will automatically execute a full OCR run on all pages
(or on the page specified through -f).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-C 0|1
</B></TD></TR></TABLE><P>Mode for display matches.

<P>
This option selects how the matches of symbols and
patterns from the bookfont are displayed. Mode 0 will
use black to the symbol pixels and gray to
the pattern skeleton pixels. Mode 1 will swap
these colors.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-c c|black,gray,white,darkgray,vdgray
</B></TD></TR></TABLE><P>Choose the colors to be used by the window components.

<P>
The Clara X window uses only five colors, internally called
"white", "black", "gray", "darkgray" and "vdgray" ("very dark
gray"). There are two predefined schemes to map these
internal colors into RGB values: "c" (color) and
the default (grayscale). Alternatively, the mapping may
be explicited, informing the RGB values separated by commas.
The notation #RRGGBB is not supported; RGB values must be
specified through color names known by the xserver (e.g.
"brown", "pink", "navyblue", etc, see the file
/etc/X11R6/lib/X11/rgb.txt). The following example
specify the default mapping:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    -c black,gray80,white,gray60,gray40</PRE>
</TD></TR></TABLE></CENTER>
To simulate reverse video try:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    -c white,gray40,black,gray60,gray80</PRE>
</TD></TR></TABLE></CENTER>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-D
</B></TD></TR></TABLE><P>X Display to connect (by default read the environment
variable DISPLAY).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-d
</B></TD></TR></TABLE><P>Run in debug mode. Debug messages will be sent to stderr.
Debug messages are generated when an acceptable but
not reasonable event is detected.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-e reviewer,type
</B></TD></TR></TABLE><P>Reviewer and reviewer type.

<P>
All revision data is assigned by Clara to its originator.
By default the reviewer name is "nobody" and its type
is "A".

<P>
The reviewer generally will be an email address or a
nickname, The type may be T (trusted), A (arbiter) or
N (anonymous). Example:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    -e ueda@ime.usp.br,T</PRE>
</TD></TR></TABLE></CENTER>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-F fontname
</B></TD></TR></TABLE><P>The X font to use (must be a font with fixed column size,
e.g. "fixed" or "9x15").

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-f path
</B></TD></TR></TABLE><P>Scanned page or page directory. Defaults
to the current directory.

<P>
The argument must be a pbm file (with absolute or relative
path) or the path (absolute or relative) of the directory
where the pbm file(s) was (were) placed.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-G list
</B></TD></TR></TABLE><P>Page geometric parameters used as hints for detecting
blocks and other heuristics.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-g wxh(+|-)x(+|-)y
</B></TD></TR></TABLE><P>X geometry.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-h
</B></TD></TR></TABLE><P>Display short help and exit.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-i
</B></TD></TR></TABLE><P>Emulate dead keys functionality.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-k list
</B></TD></TR></TABLE><P>Parameters SA,RR,MA,MP,ML,MB,RX,BT used to compute
skeletons.

<P>
BUG: these parameters are ignored when a "patterns"
already exists. In this case, Clara will read the
parameters from the "patterns" file.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-p n
</B></TD></TR></TABLE><P>Presentation.

<P>
Automatic presentation of the OCR
features. If n==1, just allow to start the presentation
using the "View" menu. If n==2, start the presentation
automatically, ignore the delays and exits just after
finishing the presentation.

<P>
The presentation is under development, but it already
offers a small tour. It expects to find the file "imre.pbm" on
the current directory (or that one informed through -f).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-P PNT1,PNT2,MD
</B></TD></TR></TABLE><P>Parameters for filtering symbol comparison.

<P>
PNT1 and PNT2 are the pixel number thresholds. These
thresholds are used to filter out bad candidates
when classifying symbols. The first threshold
is for strong similarity and the second for weak
similarity. The comparison algorithm performs two
passes. The first pass uses PNT1 to filter. The
second pass uses PNT2. So on the first pass only
patterns "quite similar" to the symbol to
classify are tried. On the second pass, we relax
and permit more patterns to be tried. This method
helps to achieve a good performance. As PNT1 becomes
larger, less patterns will be tried on the first
pass. As PNT2 becomes smaller, more patterns will be
tried on the second pass.

<P>
MD is the maximum clearance to try a skeleton. The
clearance must be an integer in the range 4..30
(default 6). The shape recognition algorithm will
refuse to try to fit an skeleton into a symbol if
the difference of the widths or heights of them is
larger than twice the clearance.

<P>
Examples:

<P>
<TABLE WIDTH=100%><TR><TD BGCOLOR=#E0E0E0><PRE>

    -P 50,5,8
    -P 40,3,6</PRE>
</TD></TR></TABLE></CENTER>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-R doubts
</B></TD></TR></TABLE><P>Maximum number of doubts per run. The argument must be
an integer (default 30).

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-S n
</B></TD></TR></TABLE><P>Set scrollbar width. Use a negative value to hide
scrollbars on startup.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-T
</B></TD></TR></TABLE><P>Avoid loading and creation of session files. Also reports bookfont size on
stdout before exiting. This option is intended to be used by the selthresh.pl
script.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-t
</B></TD></TR></TABLE><P>Switch on trace messages. Trace messages depict
the execution flow, and are useful for developers.
Trace messages are written to stderr.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-v
</B></TD></TR></TABLE><P>Verbose mode. Without this option, Clara runs quietly
(default). Otherwise, informative warnings about
potentially relevant events are sent to stderr.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-V
</B></TD></TR></TABLE><P>Print version and compilation options and exit.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-W
</B></TD></TR></TABLE><P>Web mode. Will read from the doubts subdir the input
collected from web, and will dump on that same directory
the doubts to be reviewed.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-w path
</B></TD></TR></TABLE><P>Work directory. Defaults to the page
directory (see -f).

<P>
The path of the directory where the OCR will write
the output, the acts, the book font and the session
files. The doubts directory (web operation) is assumed
to be a subdirectory of the work directory.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-X 0|1
</B></TD></TR></TABLE><P>Switch off (0) or on (1) index checking. Index checking is
performed in some critical points in order to detect memory
leaks. Index checking is unavailable when Clara is compiled
with the symbol MEMCHECK undefined.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-y resolution
</B></TD></TR></TABLE><P>Inform the resolution of the scanned image in dots per inch
(default 600). This resolution applies for all pages to be
processed until the program exits.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-z
</B></TD></TR></TABLE><P>Write (and read) compressed session files (*.session, acts
and patters will be compressed using GNU zip).

<P>
Be careful: if -z is used, any existing uncompressed file
(*.session, acts or patterns) will be ignored. So if you
start using uncompressed files and suddenly decides to
begin using compressed files, then compress manually all
existing files before starting Clara with the -z switch.

<P>
Clara OCR support for reading and writing compressed files
depends on the platform, and requires gzip and gunzip to
be installed in some directory of binaries included in
the PATH.

<P>
<P><TABLE BORDER=1><TR><TD BGCOLOR=#F0F0F0><B>-Z ZPS
</B></TD></TR></TABLE><P>ZPS, that is, the size of the bitmap pixels measured in
display pixels, when in fat bit mode. Must be an odd
integer.
<A NAME=7.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>7. AVAILABILITY</B></FONT></TD></TR></TABLE>
<P>
Clara OCR is free software. Its source code is distributed under
the terms of the GNU GPL (General Public License), and is
available at <A HREF=http://www.claraocr.org/>http://www.claraocr.org/</A>. If you don't know what is the GPL,
please read it and check the GPL FAQ at
<A HREF=http://www.gnu.org/copyleft/gpl-faq.html>http://www.gnu.org/copyleft/gpl-faq.html</A>. You should have
received a copy of the GNU General Public License along with this
software; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free
Software Foundation can be found at <A HREF=http://www.fsf.org>http://www.fsf.org</A>.

<P>

<P>
<A NAME=8.>
<P><TABLE BORDER=1 WIDTH=100%><TR><TD BGCOLOR=#79BEC6><FONT SIZE=+1><B>8. CREDITS</B></FONT></TD></TR></TABLE>
<P>
Clara OCR was written by Ricardo Ueda Karpischek. Imre Simon
contributed high-volume tests, discussions with experts,
selection of bibliographic resources, propaganda and many ideas
on how to make the software more useful.

<P>
Ricardo authored various free materials, some included in
Conectiva, Debian, FreeBSD and SuSE (the verb conjugator
"conjugue", the ispell dictionary br.ispell and the proxy
axw3). He recently ported the EiC interpreter to the Psion 5
handheld. Imre Simon promotes the usage and development of free
technologies and information from his research, teaching and
administrative labour at the University.

<P>
Ricardo Ueda Karpischek works as an independent developer and
instructor, and received no financial support to develop Clara
OCR. He's not an employee of any company or organization.

<P>
Roberto Hirata Junior and Marcelo Marcilio Silva contributed
ideas on character isolation and recognition. Richard Stallman
suggested improvements on how to generate HTML output. Marius
Vollmer is helping to add Guile support. Jacques Le Marois helped
on the announce process. We acknowledge Mike O'Donnell and Junior
Barrera for their good criticism. We acknowledge Peter Lyman for
his remarks about the Berkeley Digital Library, and Wanderley
Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior
for some web and bibliographic pointers. Bruno Barbieri Gnecco
provided hints and explanations about GOCR (main author: Jorg
Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently
supporting our tentatives of using portions of his code. Adriano
Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried
the tutorial before the first announce. Eduardo Marcel Macan
packaged Clara OCR for Debian and suggested some
improvements. Mandrakesoft is hosting claraocr.org. We
acknowledge Conectiva and SuSE for providing copies of their
outstanding distributions. Finally, we acknowledge the late Jose
Hugo de Oliveira Bussab for his interest in our work.

<P>
The fonts used by the "view alphabet map" feature came from
Roman Czyborra's "The ISO 8859 Alphabet Soup" page at
<A HREF=http://czyborra.com/charsets/iso8859.html>http://czyborra.com/charsets/iso8859.html</A>.

<P>
Obs. see also the Changelog (<A HREF=http://www.claraocr.org/CHANGELOG>http://www.claraocr.org/CHANGELOG</A>).

<P>
</HR></BODY></HTML>