XML cleaning model for data quality improvement using conditional integrity constraints

Hakawati, Mohammed Ragheb

View/Open

Access is limited to UniMAP community. (655.4Kb)

This item is protected by original copyright. (2.308Mb)

Declaration Form (188.6Kb)

Author

Hakawati, Mohammed Ragheb

Metadata

Show full item record

Abstract

Extensible Markup Language (XML) is emerging as the primary standard for representing and exchanging data, with more than 60% of the total, XML considered the most dominant document type over the web; nevertheless, their quality is not as expected. Consequently, it has become increasingly important to provide a full model which is able to detect, and correct inconsistencies recognized as violations of data dependencies causing the decrease of XML data quality. XML integrity constraint plays an important role in keeping XML dataset as consistent as possible, but their ability to solve data quality issues is still intangible. The main reason is that old-fashioned data dependencies were basically introduced to maintain the consistency of schema rather than that of data. The purpose of this study is to improve the quality of XML documents by introducing an enhanced cleaning model based on a new type of XML integrity constraints called XML Conditional Inclusion Dependencies (XCIND) and XML Conditional Functional dependencies (XCFD). The notations of the new rules are designed mainly for improving data instance and extended traditional XML dependencies by enforcing pattern tableaus of semantically related constants. Subsequent to this, a set of minimal approximate conditional dependencies (XCFD, XCIND) is discovered and learned from the XML tree using a set of mining algorithms. Finally, data inconsistencies are detected using denial queries for mined rules and repaired using a different set of update statements as solutions for inconsistent data values. Through the extensive experimental evaluation of real XML datasets, proposed mining algorithms demonstrated their efficacy and high performance in discovering all conditional dependencies with different support and confidence thresholds. The results showed that the new model could increase XML quality by detecting more real spurious data values than previous models based on traditional dependencies. Furthermore, the XML Cleaner can sense inconsistencies between same tree tuples or even between multilevel tree tuples insides the XML tree using the mentioned conditional dependencies. Moreover, the quality of the documents was assessed using two measures (Precision and Recall), and the accuracy of XML documents was improved over 94%, 83% respectively for these measures. To this end, XML conditional integrity constraints, just as their relational counterpart, prove their ability to pave the way toward new standards of cleaning applications for XML data model, especially in the big data era.

URI

http://dspace.unimap.edu.my:80/xmlui/handle/123456789/79144

Collections

School of Computer and Communication Engineering (Theses) [175]