XML cleaning model for data quality improvement using conditional integrity constraints

Hakawati, Mohammed Ragheb

Please use this identifier to cite or link to this item: http://dspace.unimap.edu.my:80/xmlui/handle/123456789/79144

Title:	XML cleaning model for data quality improvement using conditional integrity constraints
Authors:	Yasmin, Mohd Yacob, Dr.
Keywords:	XML (Document markup language) Extensible Markup Language (XML)
Publisher:	Universiti Malaysia Perlis (UniMAP)
Abstract:	Extensible Markup Language (XML) is emerging as the primary standard for representing and exchanging data, with more than 60% of the total, XML considered the most dominant document type over the web; nevertheless, their quality is not as expected. Consequently, it has become increasingly important to provide a full model which is able to detect, and correct inconsistencies recognized as violations of data dependencies causing the decrease of XML data quality. XML integrity constraint plays an important role in keeping XML dataset as consistent as possible, but their ability to solve data quality issues is still intangible. The main reason is that old-fashioned data dependencies were basically introduced to maintain the consistency of schema rather than that of data. The purpose of this study is to improve the quality of XML documents by introducing an enhanced cleaning model based on a new type of XML integrity constraints called XML Conditional Inclusion Dependencies (XCIND) and XML Conditional Functional dependencies (XCFD). The notations of the new rules are designed mainly for improving data instance and extended traditional XML dependencies by enforcing pattern tableaus of semantically related constants. Subsequent to this, a set of minimal approximate conditional dependencies (XCFD, XCIND) is discovered and learned from the XML tree using a set of mining algorithms. Finally, data inconsistencies are detected using denial queries for mined rules and repaired using a different set of update statements as solutions for inconsistent data values. Through the extensive experimental evaluation of real XML datasets, proposed mining algorithms demonstrated their efficacy and high performance in discovering all conditional dependencies with different support and confidence thresholds. The results showed that the new model could increase XML quality by detecting more real spurious data values than previous models based on traditional dependencies. Furthermore, the XML Cleaner can sense inconsistencies between same tree tuples or even between multilevel tree tuples insides the XML tree using the mentioned conditional dependencies. Moreover, the quality of the documents was assessed using two measures (Precision and Recall), and the accuracy of XML documents was improved over 94%, 83% respectively for these measures. To this end, XML conditional integrity constraints, just as their relational counterpart, prove their ability to pave the way toward new standards of cleaning applications for XML data model, especially in the big data era.
Description:	Doctor of Philosophy in Computer Engineering
URI:	http://dspace.unimap.edu.my:80/xmlui/handle/123456789/79144
Appears in Collections:	School of Computer and Communication Engineering (Theses)

Files in This Item:

File	Description	Size	Format
Page 1-24.pdf	Access is limited to UniMAP community.	655.47 kB	Adobe PDF	View/Open
Full text.pdf	This item is protected by original copyright.	2.36 MB	Adobe PDF	View/Open
Mohammed Ragheb.pdf	Declaration Form	188.61 kB	Adobe PDF	View/Open

Show full item record

UniMAP Library Digital Repository JSPUI

UniMAP Library Digital Repository preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets