Description |
Most knowledge discovery processes are biased since some part of the knowledge structure must be given before extraction. In many cases, such as in the analysis of large network logs, the user does not have sufficient domain knowledge to conduct an efficient Knowledge Discovery process. For this purpose, we propose a framework [21] that avoids this bias by supporting all major model structures, e.g. clustering, sequences, etc., as well as specifications of data and DM (Data Mining) algorithms, in the same language. The key concept of our work is the notion of schema which is related to the category theory and more precisely to the sketch theory enhanced for data mining requirements. A unification operation is provided to match automatically the data to the relevant DM algorithms in order to extract models and their related structure. In other words, the specification enables to automatically reuse specified algorithms for their adaption to the user's data. Finally, extracted models are evaluated and ranked using the MDL principle: given a data sample and an effective means to enumerate the relevant alternative theories that can explain the data, the best model is the one that minimizes length of the model description plus the length of the data description in this model.
The proposed framework is currently experimented with data randomly generated by models to validate the approaches and also with real network alarms. Precisely, in the framework of a CRE contract with France Telecom R&D (cf. section 7.3), we are working on two types of alarm logs: logs from a Virtual Private Network server that generates many alarms due to normal and abnormal connections and alarm logs generated by France-Telecom routers when they detect suspicious flows [88]. |