Snigdha Venkata

Publication Date


Advisor(s) - Committee Chair

Guangming Xing


Access granted to WKU students, faculty and staff only.

After an extensive unsuccessful search for the author, this thesis is considered an orphan work, which may be protected by copyright. The inclusion of this orphan work on TopScholar does not guarantee that that orphan work may be used for any purpose and any use of the orphan work may subject the user to a claim of copyright infringement. The reproduction of this work is made by WKU without any purpose of direct or indirect commercial advantage and is made for purposes of preservation and research.

See also WKU Archives - Authorization for Use of Thesis, Special Project & Dissertation

Degree Program

Department of Computer Science

Degree Type

Master of Science


The World Wide Web is a huge repository of information. Retrieval of desired information from such a source is a challenging task. Recommended by W3C, XML has become one of the most widely used document formats on the Web. Mining information from XML documents would need techniques that employ a measure of structural similarity between documents.

Tree edit distance would be a misleading measure of similarity for documents having similar structure but a large difference in size – the edit distance would be high owing to the size difference. Schema would be a much better representation of document structure rather than the document tree itself.

This thesis presents a novel approach that uses the edit distance between a document and schemata as a similarity measure to classify XML documents. Generalized schema rules are extracted based on certain grammatical inferences on the document tree to build a representative schema for each class. Using an efficient algorithm proposed by Xing et al. (2005) [1] for the computation of edit distance between an XML document and a schema, the XML documents are mapped to points or position vectors in a multi-dimensional space where a classification algorithm is applied to finally achieve document classification.


Computer Sciences | Physical Sciences and Mathematics