Statistical Modules of discovering and learning from Data mining applications
Data mining and subsequent retrieval is a key issue in the era of rapid adoption of information technology for transaction processing by all forms of business and government . This has resulted in large scale databases. Such application include Bank transactions, import export data, mining data, land data, Voter I-Card data, Adhaar card data , supermarket transaction data, credit card usage data and telephone call details. There exists other forms of technologies such as satellites, remote data acquisitions techniques, medical records have given birth to databases of enormous size. In fact Data mining is a technique which helps in extracting information and hidden patterns of value to the owners and the other stakeholders of the data. Essentially data mining is used in many areas of interest. This include identification of financial fraud, evaluating risk involved with a customer which a business organizations derive special value from data mining. Thus the knowledge and interpretations gained from data mining exercises would be of immense value in customer relationship management (CRM), employee management relationship (ERM) and other decision making areas.
(PART-A) - Different modules interpret from data mining which deals with the process of learning from the archived data obtained through various sources. Basically the traditional approaches to learning relied on Statistical techniques for obtaining knowledge from available data. Also many of the contemporary approaches to learning from data are influenced by the learning process of biological systems such as humans too. Empirically it is observed that biological systems learn empirically with the unknown statistical nature of the environment while being ignorant of the structure or the principle behind the process. The examples of such learning can be seen in babies learning to walk while being unaware of principles of physics !! For analysis and interpretation it is useful to categorize learning through data mining into types of tasks, corresponding to different objectives of analyzing the data. Such categorization, however, is not unique, and may call for further division into finer tasks.
(PART-B) - Exploratory Data Analysis (EDA):
The goal here is simply to explore the data in many different ways with no specific search criteria. Generally, EDA techniques are interactive and visual. Effective graphical display methods are used for representing relatively small, low - dimensional data sets.
B1. Descriptive Modeling: The goal of a descriptive model is to describe all of the data or the process generating the data. Examples of such descriptions include models for the probability distribution of the data, partitioning of thee-dimensional space into groups (cluster analysis and segmentation), and models describing the relationship between variables (dependency modeling).
B2. Predictive Modeling: The objective here is to build a model that will permit the value of one variable to be predicted from the known values of other variables. The common predictive models are classification and regression. While in classification, the variable being predicted is categorical, in regression the variable is quantitative . The term “prediction” is used here in a general sense, and no notion of a time continuum is implied .
B3. Discovering Patterns and Rules: The aim here is to identify and detect patterns.
B.3.1. Retrieval by Content: The objective here is to find known or hypothesized patterns in the data set. This task is most commonly used for text and image data sets. For text, the pattern may be a set of keywords, and the objective is to find relevant documents within a large set of possibly relevant documents (e.g., Web pages).
B.3.2. For images, the user may have a sample image, a sketch of an image, or a description of an image, and wish to find similar images from a large set of images. In both cases the definition of similarity is critical, but so are the details of the search strategy.
It may be noted that the problem of learning from data samples has its origin in the general notion of inference in classical philosophy. Every predictive learning process consists of two main phases: learning or estimating unknown dependencies in the system from a given set of samples, and b. using estimated dependencies to predict new outputs for future input values of the system.
B4 These two steps correspond to the two classical types of inference known as induction (progressing from particular cases-training data-to a general mapping or model) and deduction (progressing from a general model and given input values to particular cases of output values). Though, Statistical learning theory (SLT) is relatively new, but it is a highly formalized theories for finite-sample inductive learning. It defines all the important concepts for inductive learning and provides mathematical proofs for most inductive-learning results. However, other approaches such as neural networks, Bayesian inference, and decision rules are more engineering oriented. They have an emphasis on practical implementation and have avoided need for strong theoretical proofs and formalization. Statistical learning theory describes the process statistical estimation and hypothesis testing with small data samples. It explicitly considers the sample size and provides quantitative description of the trade-off between the complexity of the model and the available information. Understanding SLT is important for designing methods of inductive learning. Many nonlinear learning procedures recently developed in neural networks, artificial intelligence, data mining, and statistics can be understood and interpreted in terms of general SLT principles.
Though SLT is quite general, it was originally developed for pattern recognition or classification problems. Therefore, the theory are mainly applied for classification tasks. There is growing empirical evidence, however, of successful application of the theory to other types of learning problems.
There are two common types of the inductive-learning methods known as
B4.1. supervised learning (or learning with a teacher), Supervised learning is useful in estimating an unknown relationship from given input-output samples. Classification and regression are common applications supported by this type of inductive learning. It may be noted that supervised learning assumes the existence of a teacher-fitness function for estimating the proposed model. The term “supervised” denotes that the output values for training samples are known.
B4.2. unsupervised learning (or learning without a teacher). In the case of unsupervised learning there is no need for a teacher. Only sample data with input values are given. The learner is required form and evaluate models by itself. It may be noted that the objective of unsupervised learning is to discover natural structure with the sample data
In any way, the data mining applications are concerned about discovering and learning hidden and unknown patterns of knowledge from the archived data. The prime focus and learning tasks are classification, prediction, clustering, stigmatization, dependency modeling and visualization. The techniques for performing the above mentioned tasks are derived from many disciplines. The key discipline which has contributed to the data mining techniques is Statistics. Of course there are other related disciplines which are related to probability and statistics which have also contributed to data mining. They are basically association rules, clustering techniques and decision trees. More recently, soft computing techniques such as artificial neural networks, genetics algorithms have contributed to the data mining techniques.