Creation of Data Quality Rules for Data Profiling
I mentioned the 3-dimensional layout of the data profiling in the last post (Data Profiling).

The column level has been depicted quite detailed. Because the description and the practical example the second dimension caused some questions, I devote a deeper analysis of the intra-table (table-intern) level using the open source Data Profiler of Talend.
Normalization
The goal of the data profiling is not only to detect the user-created errors, but you can also determine the logical design defects, which can cause duplications, redundancy in your database. The intra-table (used as table-intern in the last post) profiling level is related with the dependency checks within the table. With this the existence is of the 2nd normal form of the table design is checked, in other word if there any columns in the table, they are dependent from a non-key column.

This table is not fully normalized for the second normal form, because the columns City and Country could be stored in an additional table and they can be saved as foreign keys in the table Address. However it was my decision not to create to many tables in the database, and the creation of the full addressing is also easier and quicker. However it does not fit to the principles of the second normal form. So the column City is dependent from the non-key column Country and it could contain defected entries by the user.
Theoretically the column PostCode is also not normalized for the 2nd normal form, because it contains not unfinite combinations and it is also depending of the column City. In the next example we will check this column by creating the Data Quality Rule in the Talend.
Data Quality Rule for the PostCode
The definition is for this rule is the following: ‘All PostCodes from the City Budapest are beginning with 1 and have the length of 4‘.
For this reason the Where klauzel of the DQ-Rule is: City = ‘Budapest’ and (left(postcode,1) != ‘1′ or length(postcode) != 4)

The quality of the table in aspect of this rule is 99.76%, that means we have a few defected rows, they do not match the criterium.

In the last step let’s check the defected rows, where we can see, there are NULL values for the column PostCode, there are non PostCode-like entries and there are some which begin with a whitespace or end with a dot.

Summary
The creation and execution of this Data Quality Rule took me 10 minutes. However you can see, for a large database you can define hundreds and tousands of these rules which can create some man-month work. The bigger effort is the resolution of the data quality issues – where you, as the leader the data migration project, cannot decide alone, which errors are important and which not. But the detecting of such errors is worth it to sacrifice the effort.
english
magyar

Interesting, it is good to know that such tools are exist especially in free versions.
yes, Talend is open source, what I tried is free, but the Integration Suite with more functionality has of course a price.
Comment by Henrik Liliendahl Sørensen from LinkedIn:
Tibor, if we stick to the Data Quality rules of an attribute as a PostalCode it will somehow be a question of built or buy, as this an attribute that have the same rules in every organisation and these rules has already been examined and implemented in several ready made solutions around.
It’s a learning journey to do it and I have been down that road myself by introducing more and more elegant solutions as:
• Defining PostalCode formats in a table with an entry for each country
• Noticing the Irish didn’t use PostalCodes
• Realising the British has several formats (the UK PostalCode practice is considered very good in the UK and very strange outside the UK
• Profiling/verifying with national PostalCode tables
• Including street names in the above process
• …..
Comment by Sebastiao Correia, Data Quality and MDM Team Leader of Talend from LinkedIn:
Tibor, thanks for your tutorial about how to apply a DQ rule by using Talend Open Profiler. It may be helpful to the users.
Here is my comment.
As says Henrik, the postal code may not be the most appropriate data for which a DQ rule is necessary and there are several other ways to analyze a postal code.
In Talend Open Profiler, you may also use “Regex Patterns” to see the matching postal codes. This could be another way to monitor your data quality issues on this kind of column (with what you called a “column level” analysis).
Another way to follow the data quality on postal codes that I would recommend is to use a table with all the valid postal codes (and other optional data such as the country…) and analyze the matching values between your table and the postal codes table. This can be done with Talend Open Profiler as what is called a “redundancy analysis”. And this is what you would call an “inter-table” analysis.
I would rather use DQ rules to monitor more functional business rules. For example, a rule that says that your customers must be at least 16 years old: “age > 15″. I mean something that is not as standardized as a postal code.
Answer to the comments:
Henrik, I mentioned in the article, that PostalCodes can be theoretically sourced out into an own table – so You saw this in the practice. It is of course depending on the usage of the database if it is worth it to standardize so much. The database was used by a little company with under 20 users.
I could more imagine here, as Sebastiao mentioned, a real business case, where we can put some semantic into the DQ rule: i.e. An MS Certified supplier probably wont have linked articles such “Apple”.
hi, i need to know how i can export a profile builded in Talend Open Profiler