Data Profiling – Remove the spiderweb from the back of the wardrobe
You were probably in the situation when your furniture had to be moved from their old place because of relocation to another city or before flat painting. You move the floor-rooted piece of furniture and you realize that the back is covered with discusting spiderweb and other grime. What will you do: leave it as it is, because it cannot be seen near the wall or you will remove it immadietly?
Next time you will reckon and check the corners, not visible places of the room what should be removed what does not match into the room. That means you execute a dirt profiling from time to time in your flat.
A similar approach must done during the data migration project. Checking the data quality is one of the first steps in the project: with the execution of the data profiling can we build a first impression about the scale of the project. That means this step must be executed before the final effort estimation!
What is the data profiling?
During the data profiling process you will examinate the data in the legacy source to collect statistics and information for building the data quality rules. For the dimensioning of the definition of the data profiling, I have chosen the determination by Informatica: The data profiling has 3 dimensions:
- column level
- table-intern level
- inter-table level
Metadata profiling
The first two levels (column level and table-intern level) can be examined with the metadata profiling. On the column level you can check the
- data types
- domain, range of the values (i.e. post code must be within the interval of 1001-9999 in Hungary)
- pattern (i.e. the phone number has the pattern: +nnWnnWnnnnnn)
- frequency counts (i.e. most of the sells happen on workdays: Tue, Wed, Thu)
- Statistic numbers (min, max, median value, avarage value, etc)
- dependencies
- Redundancy
You can draw information within the table by dependeny checks (this category takes also place in the third dimension: in the inter-table level). You can determinate the dependency between column values by the normalization rules from the logical data model design: i.e. a national code of a phone number is related to the ‘city‘ column.
Finding dependencies between the tables (inter-table level dimension) are based on table model design: i.e. a foreign-key value customer-id in the orders table must appear as primary key in the customer table. After my experience the most the data garbage is coming from the missing referential integrity between the tables which was caused by poor data model design.
Example by an open source tool
I have chosen the Data Profiling Tool by Talend on a database with some tousand records of the ‘Address‘ table. The free downloadable version supports the first two dimension of the data profiling types: column level and table-intern level.
In the picture below you see the example of two columns: Address and AddressID. The meaning of the colors of the column Address.
- Red: number of all records in the table (6575)
- Yellow: NULL values
- Orange: Distinct Count (6180)
- Blue: Uniqe Count (5910)
- Pink: Duplicated Count (270)
- Light Blue: Blank Count (39)

The Primary Key AddressID seems to be OK, because the count of the Unique values are the same with the count of the rows. But what can we do with the column Address? Theoretically it can be also uniqe, however the column contains street and number.

You can see in the picture above the defect rows. There are clearly bad administrated data, instead of the real address we find city name, phone number and duplicated addresses. To eliminate the duplicated rows the connected tables of address must be also checked by the data profiling tool.
In this short example you could see more data profiling types: redundancy for the duplicated rows, range for the recognized phone numbers, frequency counts, etc.
Summary
The data profiling is the anteroom for the creation of the data quality rules. To execute the whole data profiling, you have to check each column, each table and each connections between the tables. If you have the statistics and all information about the defects, then the data migration team, where the right stakeholders from the business side are also member of the team, must decide about the measurements and the invested efforts for the fixing.
english
magyar

Tibor,
This is a great introduction to Data Profiling. I actually just started using Talend Open Profiler and like it so far. It is easy to understand and like you show above it gives you a basic profile of your data that you can use to build your data quality rules.
Good job on the post.
Regards,
Charles
Another question from Charles:
“In your blog you mention:
I have chosen the Data Profiling Tool by Talend on a database with some tousand records of the ‘Address‘ table. The free downloadable version supports the first two dimension of the data profiling types: column level and table-intern level.
Is there a tutorial that shows how to use Talend to do the table-intern level profiling? Actually I would really appreciate it if you could explain what that level is in a little more detail (maybe another blog post?)
I look forward to learning from you as I am just beginning to get into Data Quality. Any free resources you can point me to so I can learn as much as I can, would be very much appreciated.
Thank you for your help.”
Answer:
Thanks for commenting Charles!
To be honest, I have used own terms for the 3 dimensions of the data profiling. It is better to answer the question in a separate blog post for explanation.
Having already mastered the adroitness of menswear for fall, it’s beforehand to judge the most elegant course of action to pack off in requital for when temps leave apropos south. Fortunately, you won’t take to look much farther than your closet to wrap yourself in fashionable goodness.It’s like they were saying, we be aware you make to abrasion your Christian Louboutin shoes a heterogeneity of different ways, we recall these have to deal out a practicality in your survival in regard to the weekend, when you’re with your kids, when you’re at task, when you’re present inoperative at tenebrousness,So here’s a approach you can leave a mark on the most unlit of something that you get in your closet. And if you look at it like that, it’s like, ‘wow, this is gonna zephyr new lifetime into how I sport everything.
There are no rules right just now in fashion. You can be glamorous morning, noon and night. You can incorporate prints. You can wear sequins in the sun. The but declare is to try to look your foremost and to be subjected to fun and judge appropriate there what Christian Louboutin Platforms you’re wearing.