Advisor(s) - Committee Chair
Department of Computer Science
Master of Science
The World Wide Web is a huge repository of information. Retrieval of desired information from such a source is a challenging task. Recommended by W3C, XML has become one of the most widely used document formats on the Web. Mining information from XML documents would need techniques that employ a measure of structural similarity between documents.
Tree edit distance would be a misleading measure of similarity for documents having similar structure but a large difference in size – the edit distance would be high owing to the size difference. Schema would be a much better representation of document structure rather than the document tree itself.
This thesis presents a novel approach that uses the edit distance between a document and schemata as a similarity measure to classify XML documents. Generalized schema rules are extracted based on certain grammatical inferences on the document tree to build a representative schema for each class. Using an efficient algorithm proposed by Xing et al. (2005)  for the computation of edit distance between an XML document and a schema, the XML documents are mapped to points or position vectors in a multi-dimensional space where a classification algorithm is applied to finally achieve document classification.
Computer Sciences | Physical Sciences and Mathematics
Venkata, Snigdha, "Web Document Classification Using Edit Distances Between XML Document & Schemata" (2005). Masters Theses & Specialist Projects. Paper 3445.