Archive for the ‘data profiling’ Category
Creation of Data Quality Rules for Data Profiling
I mentioned the 3-dimensional layout of the data profiling in the last post (Data Profiling).

The column level has been depicted quite detailed. Because the description and the practical example the second dimension caused some questions, I devote a deeper analysis of the intra-table (table-intern) level using the open source Data Profiler of Talend.
Normalization
The goal of the data profiling is not only to detect the user-created errors, but you can also determine the logical design defects, which can cause duplications, redundancy in your database. The intra-table (used as table-intern in the last post) profiling level is related with the dependency checks within the table. With this the existence is of the 2nd normal form of the table design is checked, in other word if there any columns in the table, they are dependent from a non-key column.

This table is not fully normalized for the second normal form, because the columns City and Country could be stored in an additional table and they can be saved as foreign keys in the table Address. However it was my decision not to create to many tables in the database, and the creation of the full addressing is also easier and quicker. However it does not fit to the principles of the second normal form. So the column City is dependent from the non-key column Country and it could contain defected entries by the user.
Theoretically the column PostCode is also not normalized for the 2nd normal form, because it contains not unfinite combinations and it is also depending of the column City. In the next example we will check this column by creating the Data Quality Rule in the Talend.
Data Quality Rule for the PostCode
The definition is for this rule is the following: ‘All PostCodes from the City Budapest are beginning with 1 and have the length of 4‘.
For this reason the Where klauzel of the DQ-Rule is: City = ‘Budapest’ and (left(postcode,1) != ‘1′ or length(postcode) != 4)

The quality of the table in aspect of this rule is 99.76%, that means we have a few defected rows, they do not match the criterium.

In the last step let’s check the defected rows, where we can see, there are NULL values for the column PostCode, there are non PostCode-like entries and there are some which begin with a whitespace or end with a dot.

Summary
The creation and execution of this Data Quality Rule took me 10 minutes. However you can see, for a large database you can define hundreds and tousands of these rules which can create some man-month work. The bigger effort is the resolution of the data quality issues – where you, as the leader the data migration project, cannot decide alone, which errors are important and which not. But the detecting of such errors is worth it to sacrifice the effort.
Data Profiling – Remove the spiderweb from the back of the wardrobe
You were probably in the situation when your furniture had to be moved from their old place because of relocation to another city or before flat painting. You move the floor-rooted piece of furniture and you realize that the back is covered with discusting spiderweb and other grime. What will you do: leave it as it is, because it cannot be seen near the wall or you will remove it immadietly?
Next time you will reckon and check the corners, not visible places of the room what should be removed what does not match into the room. That means you execute a dirt profiling from time to time in your flat.
A similar approach must done during the data migration project. Checking the data quality is one of the first steps in the project: with the execution of the data profiling can we build a first impression about the scale of the project. That means this step must be executed before the final effort estimation!
What is the data profiling?
During the data profiling process you will examinate the data in the legacy source to collect statistics and information for building the data quality rules. For the dimensioning of the definition of the data profiling, I have chosen the determination by Informatica: The data profiling has 3 dimensions:
- column level
- table-intern level
- inter-table level
Metadata profiling
The first two levels (column level and table-intern level) can be examined with the metadata profiling. On the column level you can check the
- data types
- domain, range of the values (i.e. post code must be within the interval of 1001-9999 in Hungary)
- pattern (i.e. the phone number has the pattern: +nnWnnWnnnnnn)
- frequency counts (i.e. most of the sells happen on workdays: Tue, Wed, Thu)
- Statistic numbers (min, max, median value, avarage value, etc)
- dependencies
- Redundancy
You can draw information within the table by dependeny checks (this category takes also place in the third dimension: in the inter-table level). You can determinate the dependency between column values by the normalization rules from the logical data model design: i.e. a national code of a phone number is related to the ‘city‘ column.
Finding dependencies between the tables (inter-table level dimension) are based on table model design: i.e. a foreign-key value customer-id in the orders table must appear as primary key in the customer table. After my experience the most the data garbage is coming from the missing referential integrity between the tables which was caused by poor data model design.
Example by an open source tool
I have chosen the Data Profiling Tool by Talend on a database with some tousand records of the ‘Address‘ table. The free downloadable version supports the first two dimension of the data profiling types: column level and table-intern level.
In the picture below you see the example of two columns: Address and AddressID. The meaning of the colors of the column Address.
- Red: number of all records in the table (6575)
- Yellow: NULL values
- Orange: Distinct Count (6180)
- Blue: Uniqe Count (5910)
- Pink: Duplicated Count (270)
- Light Blue: Blank Count (39)

The Primary Key AddressID seems to be OK, because the count of the Unique values are the same with the count of the rows. But what can we do with the column Address? Theoretically it can be also uniqe, however the column contains street and number.

You can see in the picture above the defect rows. There are clearly bad administrated data, instead of the real address we find city name, phone number and duplicated addresses. To eliminate the duplicated rows the connected tables of address must be also checked by the data profiling tool.
In this short example you could see more data profiling types: redundancy for the duplicated rows, range for the recognized phone numbers, frequency counts, etc.
Summary
The data profiling is the anteroom for the creation of the data quality rules. To execute the whole data profiling, you have to check each column, each table and each connections between the tables. If you have the statistics and all information about the defects, then the data migration team, where the right stakeholders from the business side are also member of the team, must decide about the measurements and the invested efforts for the fixing.













english
magyar
