Creation of Data Quality Rules for Data Profiling

I mentioned the 3-dimensional layout of the data profiling in the last post (Data Profiling).

3D of the data profiling

The column level has been depicted quite detailed. Because the description and the practical example the second dimension caused some questions, I devote a deeper analysis of the intra-table (table-intern) level using the open source Data Profiler of Talend.

Normalization

The goal of the data profiling is not only to detect the user-created errors, but you can also determine the logical design defects, which can cause duplications, redundancy in your database. The intra-table (used as table-intern in the last post) profiling level is related with the dependency checks within the table. With this the existence is of the 2nd normal form of the table design is checked, in other word if there any columns in the table, they are dependent from a non-key column.

Logical Design of the table Address

This table is not fully normalized for the second normal form, because the columns City and Country could be stored in an additional table and they can be saved as foreign keys in the table Address. However it was my decision not to create to many tables in the database, and the creation of the full addressing is also easier and quicker. However it does not fit to the principles of the second normal form. So the column City is dependent from the non-key column Country and it could contain defected entries by the user.

Theoretically the column PostCode is also not normalized for the 2nd normal form, because it contains not unfinite combinations and it is also depending of the column City. In the next example we will check this column by creating the Data Quality Rule in the Talend.

Data Quality Rule for the PostCode

The definition is for this rule is the following: ‘All PostCodes from the City Budapest are beginning with 1 and have the length of 4‘.

For this reason the Where klauzel of the DQ-Rule is: City = ‘Budapest’ and (left(postcode,1) != ‘1′ or length(postcode) != 4)

Data Quality rule

The quality of the table in aspect of this rule is 99.76%, that means we have a few defected rows, they do not match the criterium.

DQ statistics

In the last step let’s check the defected rows, where we can see, there are NULL values for the column PostCode, there are non PostCode-like entries and there are some which begin with a whitespace or end with a dot.

DQ statistics detailed

Summary

The creation and execution of this Data Quality Rule took me 10 minutes. However you can see, for a large database you can define hundreds and tousands of  these rules which can create some man-month work. The bigger effort is the resolution of the data quality issues – where you, as the leader the data migration project, cannot decide alone, which errors are important and which not. But the detecting of such errors is worth it to sacrifice the effort.

6 Responses to “Creation of Data Quality Rules for Data Profiling”

  • h143570 says:

    Interesting, it is good to know that such tools are exist especially in free versions.

  • yes, Talend is open source, what I tried is free, but the Integration Suite with more functionality has of course a price.

  • Comment by Henrik Liliendahl Sørensen from LinkedIn:

    Tibor, if we stick to the Data Quality rules of an attribute as a PostalCode it will somehow be a question of built or buy, as this an attribute that have the same rules in every organisation and these rules has already been examined and implemented in several ready made solutions around.

    It’s a learning journey to do it and I have been down that road myself by introducing more and more elegant solutions as:

    • Defining PostalCode formats in a table with an entry for each country
    • Noticing the Irish didn’t use PostalCodes
    • Realising the British has several formats (the UK PostalCode practice is considered very good in the UK and very strange outside the UK ;-)
    • Profiling/verifying with national PostalCode tables
    • Including street names in the above process
    • …..

  • Comment by Sebastiao Correia, Data Quality and MDM Team Leader of Talend from LinkedIn:

    Tibor, thanks for your tutorial about how to apply a DQ rule by using Talend Open Profiler. It may be helpful to the users.

    Here is my comment.
    As says Henrik, the postal code may not be the most appropriate data for which a DQ rule is necessary and there are several other ways to analyze a postal code.

    In Talend Open Profiler, you may also use “Regex Patterns” to see the matching postal codes. This could be another way to monitor your data quality issues on this kind of column (with what you called a “column level” analysis).

    Another way to follow the data quality on postal codes that I would recommend is to use a table with all the valid postal codes (and other optional data such as the country…) and analyze the matching values between your table and the postal codes table. This can be done with Talend Open Profiler as what is called a “redundancy analysis”. And this is what you would call an “inter-table” analysis.

    I would rather use DQ rules to monitor more functional business rules. For example, a rule that says that your customers must be at least 16 years old: “age > 15″. I mean something that is not as standardized as a postal code.

  • Answer to the comments:

    Henrik, I mentioned in the article, that PostalCodes can be theoretically sourced out into an own table – so You saw this in the practice. It is of course depending on the usage of the database if it is worth it to standardize so much. The database was used by a little company with under 20 users.

    I could more imagine here, as Sebastiao mentioned, a real business case, where we can put some semantic into the DQ rule: i.e. An MS Certified supplier probably wont have linked articles such “Apple”.

  • Osniel says:

    hi, i need to know how i can export a profile builded in Talend Open Profiler

Leave a Reply

Language
Info-Feed
Searching
Testimonials

... Great community player, exceptional knowledge of data migration"

Dylan Jones, Founder & editor of the community sites datamigrationpro.com and dataqualitypro.com

Testimonials

Tibor is a highly motivated and dedicated IT development leader.

Tibor and I worked together at Siemens and Nokia Siemens Networks running a major migration project for a french telecommunications operator where I was the project manager and Tibor the migration expert.

Tibor was an invaluable teammate on this project, and its success was directly related to his participation.

Tibor managed to analyse customers complex business processes in short time. He designed the migration concept and led tool development to ensure a smooth migration. Additionally to his professional qualifications, Tibor is a great person to work with.

When having such a migration project again I would be pleased to get Tibor again into the project team.

Hans-Joachim Kleist, Program Manager Business Support Systems, Nokia Siemens Networks


Testimonials
Working with Tibor was a pleasure as he showed excellent initiative and responded very fast to changing situations.

Tom De Ruysser, Project Manager, Migration Expert, TietoEnator