Data Modeling: Profile & Position Analysis

Data Modeling: Profile & Position Analysis

The first two steps in Complete Data Quality’s data modeling solution are the profile and position analysis steps. The data classification engine is incredibly accurate, but not a perfect technology. The data modelers are tasked with reviewing and correcting the classification results. Clean Cloud provides the profile and position analysis capabilities needed to complete this task.

Profile Analysis

Profile analysis allows the data modelers to review the classification results summarized into model data type frequencies. The modelers are able to edit or delete erroneous values from the results and apply the changes across the entire project. This solution is designed for the data modelers, who are given complete control over not only the classification engine but the classification results as well.

The profile analysis step supports the ability to consolidate the classification data types and reassign specific values when the data modelers deem appropriate. The solution also supports the ability to enforce the validated classified values across all of the data to eliminate re-work within the current project. The intention is to provide the data modelers with the tools necessary to correct any classification errors once, then enforce the corrections across all of the project data.

In our demonstration example, we chose to use real emails for job postings because most professionals are familiar with these types of emails and additionally, email data represents all three forms of data: structured, semi-structured, and unstructured. Clean Cloud supports the ability to classify business data across all three forms of data seamlessly.

The following is a subset of the profiling results for the TITLE classification data type from the job posting email example:

Value                                                                                     Frequency

Data Architect with Data Modeler                                     2

Informatica Middleware B2B Solution Lead                     2

Information Management Analyst                                     1

Information Management Analyst San Antonio TX             1

The last title frequency includes the job location. The data modeler corrects the situation by removing the location from the title and saving the location as a JOB LOCATION data type. The adjusted title is saved, which removes the erroneous value and updates the frequency count:

Value                                                                                     Frequency

Data Architect with Data Modeler                                     2

Informatica Middleware B2B Solution Lead                     2

Information Management Analyst                                     2

Once the data modeler completes all of the corrections to the TITLE data type frequencies, the TITLE frequencies are assigned to the JOB TITLE data type to consolidate all of the titles into the primary data type. The data modeler is then able to enforce the consolidated JOB TITLE frequencies across the entire project; meaning that every occurrence of a JOB TITLE frequency is automatically classified when present in a project value.

The data modeler repeats this analysis for all of the classified data type frequencies until all of the valid classifications are enforced across the entire project. The modeler then performs the position analysis step to complete the modeling process.

Position Analysis

Position analysis presents a redacted version of each value analyzed for the data modelers to review. The data modelers are able to visually identify unclassified content that needs to be addressed. The following is an example of a subject value from the job posting emails presented by the position analysis tool:

" Immediate Interview for a || JOB TITLE || - || JOB LOCATION || - || JOB YEARS OF EXPERIENCE || of Experience is Must "

The data modeler is able to see exactly what was and was not classified by the engine. The modeler compares the results with the original value to ensure proper classification of all business values:

"Immediate Interview for a Data Modeler- San Antonio, TX - 10+ Years of Experience is Must"

The data modeler is likely to classify that the job posting has an immediate interview and the experience is a must. Here are the results after the modeler adjusts the classification results:

" || JOB INTERVIEW || for a || JOB TITLE || - || JOB LOCATION || - || JOB WORK EXPERIENCE || "

Where:

JOB INTERVIEW = Immediate Interview

JOB TITLE = Data Modeler

JOB LOCATION = San Antonio, TX

JOB WORK EXPERIENCE = 10+ Years of Experience is Must

JOB YEARS OF EXPERIENCE = 10+ Years

Position analysis provides these capabilities to the data modeler to ensure that all viable business data is properly classified from the original values. The data modeler is able to quickly identify overlooked or erroneously classified values and correct the classification results. The modeler is then able to apply the changes to the current value, all the values in the attribute, or across all the project values. This ensures that the modeler is only required to make the change one time.

The position analysis step is performed against all values from the source attributes. For the job email posting, this includes the subject, body, from name, and from address. The subject and body contain a mixture of unstructured and semi-structured data content. The from name and from address contain structured data. The modeler is able to quickly pass through the data and correct any classification problems.

Modeling the Data

The data modelers have complete control over the classification engine to create and maintain data types, type classes, classes, and domains. Modelers create the classification components specific to the data being modeled from the data source. This is accomplished utilizing supervised machine learning algorithms to create and maintain the classification data types. The modeler literally feeds valid values to the classification engine to create the data types. The modeler is also able to customize the inferred metadata and add additional capabilities to enhance the classification results.

Once satisfied with the development of the classification data types and components, the classification engine is run against the data source. The profile and position analysis tools are provided to the data modelers to correct overlooked and erroneous classifications. The profile analysis tool allows the modeler to correct all of the classified values. The position analysis tool allows the modeler to correct overlooked classifications. The combination of these two analysis capabilities allows the data modelers to drive the classification accuracy to 100%.

In the next article, we will discuss the data quality assessment completed against the classification results.

To view or add a comment, sign in

More articles by Antonio Amorin

  • Overloaded Values & Data Quality Format Aggregates

    An overloaded value contains more information than it should. From my experience, this is the most prevalent data…

    2 Comments
  • Technical View: Data & Data Quality

    All IT professionals need to have a strong understanding of data and data quality. Why? Because everything we do as IT…

    5 Comments
  • Modeling Data Class Types

    One of the primary goals of modernizing data classification is the ability to define a single data class that is…

    2 Comments
  • Modeling Data Classes

    Modernizing data classification to address all types of data requires a more advanced data structure for the data…

    2 Comments
  • Modernizing Data Classification

    Data is finally in the spotlight! Whether we are talking about being data-driven or the impact that low quality data…

    7 Comments
  • Data Classification Evolves Data Governance

    Clean Cloud from blueFlash Software evolves data governance by introducing the ability to create a single set of rules…

  • Data Quality Assessment: Origin Story

    In the summer of 2001, I was in Fishkill, NY engaged as the lead data profiling consultant on a sales data warehouse…

    1 Comment
  • Data Modeling: Certified Data Repository

    The certified data repository captures the results of the data modeling solution. All certified values are loaded into…

  • Data Modeling: Data Certification

    Complete Data Quality’s data modeling solution tasks the data community with certifying the classified values. The data…

  • Data Modeling: Data Quality Analysis

    Data quality analysis is critical in all of the solutions from Complete Data Quality, especially the data modeling…

    1 Comment

Others also viewed

Explore content categories