XML cleaning model for data quality improvement using conditional integrity constraints
Abstract
Extensible Markup Language (XML) is emerging as the primary standard for
representing and exchanging data, with more than 60% of the total, XML considered the
most dominant document type over the web; nevertheless, their quality is not as expected.
Consequently, it has become increasingly important to provide a full model which is able
to detect, and correct inconsistencies recognized as violations of data dependencies
causing the decrease of XML data quality. XML integrity constraint plays an important
role in keeping XML dataset as consistent as possible, but their ability to solve data
quality issues is still intangible. The main reason is that old-fashioned data dependencies
were basically introduced to maintain the consistency of schema rather than that of data.
The purpose of this study is to improve the quality of XML documents by introducing an
enhanced cleaning model based on a new type of XML integrity constraints called XML
Conditional Inclusion Dependencies (XCIND) and XML Conditional Functional
dependencies (XCFD). The notations of the new rules are designed mainly for improving
data instance and extended traditional XML dependencies by enforcing pattern tableaus
of semantically related constants. Subsequent to this, a set of minimal approximate
conditional dependencies (XCFD, XCIND) is discovered and learned from the XML tree
using a set of mining algorithms. Finally, data inconsistencies are detected using denial
queries for mined rules and repaired using a different set of update statements as solutions
for inconsistent data values. Through the extensive experimental evaluation of real XML
datasets, proposed mining algorithms demonstrated their efficacy and high performance
in discovering all conditional dependencies with different support and confidence
thresholds. The results showed that the new model could increase XML quality by
detecting more real spurious data values than previous models based on traditional
dependencies. Furthermore, the XML Cleaner can sense inconsistencies between same
tree tuples or even between multilevel tree tuples insides the XML tree using the
mentioned conditional dependencies. Moreover, the quality of the documents was
assessed using two measures (Precision and Recall), and the accuracy of XML documents
was improved over 94%, 83% respectively for these measures. To this end, XML
conditional integrity constraints, just as their relational counterpart, prove their ability to
pave the way toward new standards of cleaning applications for XML data model,
especially in the big data era.