Author: Leeuwen Matthijs Vreeken Jilles Siebes Arno
Publisher: Springer Publishing Company
ISSN: 1384-5810
Source: Data Mining and Knowledge Discovery, Vol.19, Iss.2, 2009-10, pp. : 176-193
Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.
Abstract
Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.
Related content
On identifying codes in the hexagonal mesh
Information Processing Letters, Vol. 89, Iss. 1, 2004-01 ,pp. :
Behaviour and Information Technology, Vol. 31, Iss. 2, 2012-02 ,pp. :