Text Mining
The Problem!
When conducting a research study for a specific problem, data must be collected in order to analyze the problem. For instance, in the construction field, occupational accidents have become a major problem. The most severe causes of accidents according to some latest's researches are found out to be as, worker factor, technological factor, natural factors, and surrounding activities. When conducting research for a such case often the necessary data must be collected through interviews and questionnaires. There is a wide possibility of answers being bias, skipped, interpretation issues, or accessibility issues when the data collections are carried out. In addition to that data collected will be transformed into reports saved in datasheets and the data will be extracted manually which will be time-consuming.
The Solution.
The text mining approach can collect data from the reports made available from organizations and it attempts to extract sources and analyze hazards using Pareto analysis. These data are much more accurate and reliable and also from this method a lot of time can be saved.
Recommended by LinkedIn
What is Text Mining?
Text mining is an artificial intelligent technology involving a process where free text in documents is converted to machine-understandable structured data using Natural Language Processing (NLP). NLP is a technique used for interaction between computer and human languages. The process provides high-quality information from text offer conversion.
The Process.
Text mining is done by three main classifications: Support vector machine (SVM), Decision tree (DT), and Random forest (RF). According to the converted data, these three classifiers analyze them according to given attributes and conditions.
After the classification is done the data are processed according to a set of procedures. In the first step, the data which were collected, are separated into words, and a token is created for each and every word. The next step is to stop word removal. This is the step where the most common words in a sentence/paragraph are filtered out to save the unique words. The third process is stemming and lemmatization where the tense of the words is changed when breaking down the sentence. As the last step, all the documents are converted into the corpus and presented as a collection of text.
In addition to this machine categorization, manual categorization can be done. After these steps, for each of these causes an N-gram table is formed where the activity for the caused is described in a single word, two words, and three words. Finally, a validation process is carried out to find the average weighted FI score which evaluates the performance of each cause.