Knowledge Data Discovery and Data Mining
Dave John
Introduction
The problem of data overload looms menacingly as our ability to analyse massive datasets lags far behind our ability to gather and store the data. This is where knowledge discovery in database (KDD) and data mining (DM) comes into picture with its computational techniques and tools in extraction of useful patterns (knowledge) from the rapidly growing volume of data [1]. Data mining techniques, using some statistical algorithms, help in unveiling the hidden data patterns from the large data sets. In this fast-moving world, within a blink of an eye everyone requires useful information and as a result DM has stepped into almost all the Fields that information technology is related to. And its nature of mining utile information from the vast datasets adds value to DM.[8]
Knowledge Discovery in Database (KDD)
There is a misconception that Data Mining is the whole process of retrieving useful information(pattern) from a whole dataset, where as Data mining plays a major part of the overall process of Knowledge Discovery in Database(KDD). KDD is “ Non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data” [2]. KDD generates a DM model using the past records with a known target class (output) and this generated data mining models are used to predict the output of new records. KDD process can be sub-divided into following phases:-
Fig 1: KDD road map
1. Problem Specification:- Defining the problem to be identified is the major objective of problem specification. Data sources are specified and database familiarisation, including the number of records, number of attributes , proportion of missing data, outliers, noise level of data, determination of data type(numeric or categorical) are done. Also in this stage, feasibility of the project and specific data mining task to be applied are determined [3].
2. Resourcing:-In this phase, data sources specified in problem specification are obtained. All the data obtained from different data sources are transformed into structural format for an operational database, which is to be supplied as the input to the next stage of data mining. Data formatting is an important process as the data should be consistent from next stage [4].
3. Data Cleansing:- The main steps of this phase are processes taken for the removal of errors, handling missing data and outliers, as the formatted data sometimes will be noisy with some percentages of missing values along with outliers and inaccurate values. Since the significances of data mining tool mainly depends on an accurate database and the sample data used for model training and testing, the impact of missing data can be high.[3] If the missing data is neglected, it may result in false data mining conclusions as the effect of prediction in DM is directly proportional to the percentage of missing data [5]
4. Pre-processing:- Data pre-processing, also known as data preparation is a major step in KDD as it has it has a significant impact of prediction accuracy. There are different methods used for pre-processing, which may be different for different datasets and attributes. Filtering and wrapper methods, removing anomalies, eliminating duplicate records (by a domain knowledge expert) are some of the methods which can be adopted in data preparation. Data pre-processing, performed repeatedly not in a prescribed order, generates a quality dataset smaller than the original, which leads to quality patterns(knowledge) [6].
5. Data Mining:- In this phase one or more data mining techniques (algorithms) are selected and applied, and depending on the data mining task, there is a wide range of algorithms available, which can be used by setting different parameter values for the algorithms. Before the algorithm is applied, the dataset is usually split into two, training and test, which can be done using cross-validation or percentage split to the dataset. The main purpose of splitting the dataset is to use training subset of dataset to train the model (define rules) and then evaluate the performance of the classifier on new unseen instances (on test dataset). The mode built is applied to all test dataset instance to predict the class value. It then compares the predicted class value to the real class value to find the correct percentage. The create model is only useful to the business if the predicted class value and the real class values are statistically significant. If the predicted class values are statistically significant, it can be used to another similar dataset for predicating the class value [7].
Fig 2:- Split of dataset into training and test.
6. Evaluation:- Different approaches are used to evaluate the results of the generated model. Test partition of the dataset is used in this stage. Discovered knowledge is evaluated in following areas such as; simplicity, performance on the database, application area suitability, generality and visualisation [3].
7. Interpretation:- Further evaluation of discovered knowledge is performed in this stage by domain experts. Also newly discovered knowledge is compared with existing knowledge to validate it.
8. Exploitation:- Properly evaluated, new knowledge is applied in real time scenarios while minimising the risk and maximising the potential benefits.
References
[1] The KDD process for extracting useful knowledge from volumes of data U Fayyad, G Piatetsky-Shapiro, P Smyth - Communications of the ACM, 1996 - dl.acm.org Vol. 39, page 27-34
http://shawndra.pbworks.com/
Website Accessed 23rd March,2020
[2] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996b). Knowledge discovery and data mining: Towards a unifying framework. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, volume 1, pages 82–88.
[3] Debuse, J., de la Iglesia, B., Howard, C., and Rayward-Smith, V. (2001). Building the KDD roadmap: A methodology for knowledge discovery. Industrial Knowledge Management, 1(1):179–196.
[4] J. Wang. Data mining opportunities and challenges. IGI Global, 701 E.Chocolate
Avenue, Suite 200, Hershey, PA 17033, 2003.
[5] E. Acuna and C. Rodriguez. The Treatment of Missing Values and its Effect on
Classifier Accuracy. Classification, Clustering, and Data Mining Applications, pages
639{647, 2004.
[6] Mortadha M. Hamad, Banaz.A.Qader Data Pre-processing for knowledge discovery College of Computer ,University of Anbar.
[7] I. H. Witten, E. Frank, M. A. Hall, C. J. Pa. lData Mining Practical Machine Learning Tools and Techniques.
[8] M. Kamber and J. Han. Data mining: Concepts and techniques, volume 228. Morgan Kaufmann Publishers, 2001.
👍
Nice!