Tuesday, October 30, 2012

Document Structure and Layout Analysis

A document image is composed of a variety of physical entities or regions such as text blocks, lines, words, figures, tables, and background. We could also assign functional or logical labels such as sentences, titles, captions, author names, and addresses to some of theseregions.Theprocessofdocument structure and layout analysistriestodecompose a given document image into its component regions and understand their functional roles and relationships. The processing is carried out in multiple steps, such as pre- processing, page decomposition, structure understanding, etc. We will look into

each of these steps in detail in the following sections. Document images are often generated from physical documents by digitization us- ing scanners or digital cameras. Many documents, such as newspapers, magazines and brochures, contain very complex layout due to the placement of figures, titles, and cap- tions,complexbackgrounds,artistictextformatting,etc.(seeFigure1).Ahumanreader uses a variety of additional cues such as context, conventions and information about language/script, along with a complex reasoning process to decipher the contents of a document. Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems. This is interesting since documents are designed to be effective and clear to human interpretation unlike natural images. (a) (b) Fig.1. Examples of document images with complex layouts. 2 As mentioned before, we distinguish between the physical layout of a document and its logical structure [4]. One could also divide the document analysis process into two parts accordingly. 1.1 Physical Layout and Logical Structure The physical layout of a document refers to the physical location and boundaries of var- ious regions in the document image. The process of Document Layout Analysis aims to decompose a document image into a hierarchy of homogenous regions, such as figures, background, text blocks, text lines, words, characters, etc. The algorithms for layout analysis could be classified primarily into two groups depending on their approach. Bottom-up algorithms start with the smallest components of a document (pixels or con- nected components) and repeatedly group them to form larger, homogenous, regions. In contrast, top-down algorithms start with the complete document image and divide it repeatedly to form smaller and smaller regions. Each approach has its own advantage and they work well in specific situations. In addition, one could also employ a hybrid approach that uses a combination of top-down and bottom-up strategies. In addition to the physical layout, documents contain additional information about itscontents,suchastitles,paragraphs,captions,etc.Suchlabelsarelogicalorfunctional in nature as opposed to the structural labels of regions assigned by layout analysis. Most documents also contain the notion of reading order, which is a sequencing of the textual contents that makes comprehension of the document, easier. Languages such as Arabic, Chinese, etc. can have different reading directions as well (right-to-left, top- to-bottom). The set of logical or functional entities in a document, along with their inter-relationships is referred to as the Logical Structure of the document. The analysis...

Website: cvit.iiit.ac.in | Filesize: -
No of Page(s): 17
Download Document Structure and Layout Analysis.pdf

No comments:

Post a Comment