tesseract  3.03
/usr/local/google/home/jbreiden/tesseract-ocr-read-only/textord/cjkpitch.h
Go to the documentation of this file.
00001 
00002 // File:        cjkpitch.h
00003 // Description: Code to determine fixed pitchness and the pitch if fixed,
00004 //              for CJK text.
00005 // Copyright 2011 Google Inc. All Rights Reserved.
00006 // Author: takenaka@google.com (Hiroshi Takenaka)
00007 // Created:     Mon Jun 27 12:48:35 JST 2011
00008 //
00009 // Licensed under the Apache License, Version 2.0 (the "License");
00010 // you may not use this file except in compliance with the License.
00011 // You may obtain a copy of the License at
00012 // http://www.apache.org/licenses/LICENSE-2.0
00013 // Unless required by applicable law or agreed to in writing, software
00014 // distributed under the License is distributed on an "AS IS" BASIS,
00015 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
00016 // See the License for the specific language governing permissions and
00017 // limitations under the License.
00018 //
00020 #ifndef CJKPITCH_H_
00021 #define CJKPITCH_H_
00022 
00023 #include          "blobbox.h"
00024 
00025 // Function to test "fixed-pitchness" of the input text and estimating
00026 // character pitch parameters for it, based on CJK fixed-pitch layout
00027 // model.
00028 //
00029 // This function assumes that a fixed-pitch CJK text has following
00030 // characteristics:
00031 //
00032 // - Most glyphs are designed to fit within the same sized square
00033 //   (imaginary body). Also they are aligned to the center of their
00034 //   imaginary bodies.
00035 // - The imaginary body is always a regular rectangle.
00036 // - There may be some extra space between character bodies
00037 //   (tracking).
00038 // - There may be some extra space after punctuations.
00039 // - The text is *not* space-delimited. Thus spaces are rare.
00040 // - Character may consists of multiple unconnected blobs.
00041 //
00042 // And the function works in two passes.  On pass 1, it looks for such
00043 // "good" blobs that has the pitch same pitch on the both side and
00044 // looks like a complete CJK character. Then estimates the character
00045 // pitch for every row, based on those good blobs. If we couldn't find
00046 // enough good blobs for a row, then the pitch is estimated from other
00047 // rows with similar character height instead.
00048 //
00049 // Pass 2 is an iterative process to fit the blobs into fixed-pitch
00050 // character cells. Once we have estimated the character pitch, blobs
00051 // that are almost as large as the pitch can be considered to be
00052 // complete characters. And once we know that some characters are
00053 // complete characters, we can estimate the region occupied by its
00054 // neighbors. And so on.
00055 //
00056 // We repeat the process until all ambiguities are resolved. Then make
00057 // the final decision about fixed-pitchness of each row and compute
00058 // pitch and spacing parameters.
00059 //
00060 // (If a row is considered to be propotional, pitch_decision for the
00061 // row is set to PITCH_CORR_PROP and the later phase
00062 // (i.e. Textord::to_spacing()) should determine its spacing
00063 // parameters)
00064 //
00065 // This function doesn't provide all information required by
00066 // fixed_pitch_words() and the rows need to be processed with
00067 // make_prop_words() even if they are fixed pitched.
00068 void compute_fixed_pitch_cjk(ICOORD page_tr,               // top right
00069                              TO_BLOCK_LIST *port_blocks);  // input list
00070 
00071 #endif  // CJKPITCH_H_
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines