Data mining with taxonomies merged with categorical data has been studied in the past but often limited to small taxonomies.
Data Mining with Semantic Features Represented as Vectors of Semantic Clusters
Download Resources
PDF Accessibility
One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.
Data mining with taxonomies merged with categorical data has been studied in the past but often limited to small taxonomies. Taxonomies are used to aggregate categorical data such that patterns induced from the data can be expressed at higher levels of conceptual generality. Semantic similarity and relatedness measures can be used to aggregate categorical values for cluster-based data mining algorithms. Many aggregation techniques rely solely on hierarchical relationships to aggregate categorical values. While computationally attractive, these approaches have conceptual limitations that can lead to spurious data mining results. Alternatively, categorical data can be aggregated using hierarchical relationships and other semantic relationships that are expressed in ontologies and conceptual graphs thus requiring graph based similarity/relatedness measures. Scaling these techniques to large ontologies can be computationally expensive since there is a wider search space for expressing patterns. An alternative representation of semantic data is presented that has attractive computational properties when applied to data mining. Semantic data is represented as vectors of cluster memberships. The representation supports the use of cosine similarity measures to improve the run-time performance of data mining with ontologies. The method is illustrated via examples of KMeans clustering and Association Rule mining.