Data extraction of unstructured data while maintaining logical document composition
Data extraction tools that are used to operate on unstructured documents, such as Microsoft Office and IBM Symphony, ignore logical composition expressed by document creators. Logical composition of a document refers to Header, Footer, Body, custom properties, ...
The logical composition is essential when operating on the extracted data. For example if I am classifying a document it might be more important that I find some string of text in a header versus footer versus body.
The idea here is to do data extraction while preserving logical document composition.