Archive for the ‘data quality’ Category
Data migration project success
How to succeed a (2-day) data migration project?
Contact list data migration part 2

You can laugh about the volume of the project, but if you cannot succeed a short project, how do you think you will be successful in a large project?
This post is the second and last writing of the The true story of contact list data migration, where many contacts from 4 different sources are migrated in a possibly maximum data quality to iPhone.
Transient data store for migration Nr.3
You saw in the last post, that I had problem in the “last name“-”first name” sequence in Google Address List, and there is no way to repair it automatically (which in the end effect faster then manual correction). The next attempt is to export all the contacts into a CSV file, which can be imported into Excel (or in my case OpenOffice Calc), where the name sequences can be repaired relative quickly. This is already the third transient data store.
Note: At this point I can check off 2 legacy sources: contacts from the old Nokia phone and from the Thunderbird contact list. The Excel table with old numbers and an Outlook Express Address Book is remaining.
An obvious choice is copy-paste the Excel contacts to the OpenOffice Calc (in CSV format), of course with merging the right columns.
The contacts in Outlook Express are in the WAB (Windows Address Book) – which can be imported into a CSV format – so I have again a direct path to the end list with copy-paste.
The last steps as importing back the Google-kind CSV format to the Google Contact list and final synchronisation with iPhone through the iTunes is without any problem and I can be happy.
The key element of the migration
After three attempts I have found the safest and quickest method (and transient data store), where the data import, data modification and data quality optimization has the fastest way: using the Google Address Book CSV format in the application OpenOffice Calc. Look at the final workflow which became complicated in the first look.

Data Quality Aspects
The original scope in aspect of Data Quality is the maximum data quality, there are no excuses, I don’t want to use any garbage in the new phone. Here is the hypothetical question: what is maximum data quality? All data records in the target are fine, so I can use them? Or: all data records in the source(s) are in usable form in the target? Of course not, there are many useless e-mail addresses in the sources which have been added automatically after writing a mail – even from 5-10 years ago, where I don’t know any more, who is behind the e-mail address.
Here are the important points in the data quality:
- the sequence “last name”-”first name” which was explained very detailed in the last post
- the using of the Hungarian special characters as: ö, ő, ü, ű, í, á, é, ó – fortunately using the UTF8 coding between the export-imports that was no problem.
- The right source fields should be placed into the right target fields (e.g. it has no sense a phone number in the e-mail column)
- The phone number formats should have this pattern: (+<international_code> <national_code> <phone_number> )
Elements from Project Management
If I were enough schizophrenic, then I could have the following concerns for a meeting:
- The rough effort estimation said: 2 days time for this activity.
- The risk was mentioned in the beginning of the first post: the biggest risk is not to reach the project goal within time, because of unknown functionality of the available tools. And this was the main focus during the 2 days: to find the fastest solution to merge the contacts. The risk evolved to a real problem, but it has been handled by focusing the solution.
- Measuring problem: How could I be certain, that all important contacts have been migrated? The usual counting technique (counting of the contact of each source and comparison with the number of the final migrated contacts) is not efficient because of many useless e-mail addresses (with the pattern info@xyz.com). What do you think, what would be an acceptable method to be sure not to forget any contact?
Summary
I declare this mini-project successful, because the migration has been executed in time and in the target point of view with max. data quality. However there were lions in the path even in this relative simple and short activity which endanger the reaching the main goal.
The true story of contact list data migration
Project contact list migration part 1
My old Nokia phone reached this year the school-age (became 6 years old), and I was satisfied with him in the past. Unfortunately it became deaf and mute recently, so I was forced to purchase a new phone (iPhone).
I possess a mobile phone since 1997 and I carried the contacts from phone to phone. Beside this I have an Address Book of the E-Mail software (Thunderbird) with contacts, an address book of Outlook Express from an old PC, and an Excel table which is a copy of my former business phone.
So I have 4 sources with various contacts (redundancy is there), and a question: How to merge and migrate all contacts within the shortest time and maximum quality? (in this question I have to be maximalist: I do not accept 95% data quality, because what should I do with 20-50 garbage contact in my new phone?)
Migration Planning and Project Initiation
After the scope is defined very clearly, I form the topic of effort estimation generously this time: max. 2 days.
Configuration Management and software tools:
Because typing on the PC is 10 times faster than with the tiny buttons of iPhone, I have to find 1 or 2 transient data store, where the synchronization can happen very fast. For the last step iTunes synchronisation tool is the only appropriate way and it works properly.
Plan of workflow
The key element of the workflow is the right choice of the transient data store. Because the export-import possibilities of the different data sources are unknown at the moment, this key momentum will be clarified during the project.
Because of this we found in this tiny project a risk which can affect the delivery time (in this case the time of full contact list on the iPhone) – and as we know the most of the data migration projects (84%) fail because of delayed delivery time.
Transient data store for migration Nr.1
The extraction of the contacts from the Nokia is theoretically very simple: I have no tool, no data cable to download the contact list – so the manual job remains whatever it hurts. The destination: plain paper or an address book directly. I have chosen the direct entry into the Thunderbird Address Book, because nowadays I type faster than write. And with this move the number of data sources will decrease with one.
Surprise Nr 1: There is no direct connection between iTunes and Thunderbird. The possible contact sources of iTunes are Windows Address Book, Google Contacts and Outlook Express Contacts.
So lets export the data from Thunderbird to Windows Address Book. Of course there is no direct connection in this direction, so I have 2 possibilities: CSV export and LDIF (LDAP server).Note that at this time we have 3 transient data sources: Thunderbird Adress Book, CSV (or LDIF) and WAB.
Short summarized: about 30% of the contacts were simply not migrated, I don’t know the reason. After one or two attempts I dropped this possibility and have chosen the remaining option: Google Contacts
Transient data store for migration Nr 2.
The life is not easy, there is no direct connection between Thunderbird and Google Contacts. After a short research I found the Zindus add-on for Thunderbird, where the synchronisation between Thunderbird and Google Contacts supposed to be solved.
And it worked: the data synchronisation between Thunderbird and Google and also between Google Contacts and iPhone through iTunes.
Surprise Nr 2: However as I checked the application contacts in iPhone, I faced a new problem: the sequence of last name and first name.
For explanation: Hungary is one of the rare places of the world (beside Japan I think), where the official sequence of the name is “Last name” “first name“. So I want to see my contacts in the phone that way, however Google Contacts messed this sequence up.

Of course the Google application does not recognise (even if the settings are for Hungary), what is the right sequence. The fast solution is put a comma between “Last name” and “first name“, like Bossányi Tibor. And this is again manual work for many hundred contacts. (I don’t want to tell long stories about the attempts to set the settings to English (UK),exporting the list in CSV to find the fastest solution to put this comma into the names. It does not work.)
Summary
To not to loose too much time with this simple operation (at the first look), I had to think in a project way: finding a clear scope, a simple effort estimation and a measurable action plan to avoid shifting this activity. In this first part I found the most optimal way between the possibilities by attempting more different ways and measure which is the most effective in aspect of time and data quality.
The next (last) part I will write about the data quality rules which have been set up, and the migration from remaining legacy sources.
Sources
Data Migration Project Checklist from datamigrationpro.com
Creation of Data Quality Rules for Data Profiling
I mentioned the 3-dimensional layout of the data profiling in the last post (Data Profiling).

The column level has been depicted quite detailed. Because the description and the practical example the second dimension caused some questions, I devote a deeper analysis of the intra-table (table-intern) level using the open source Data Profiler of Talend.
Normalization
The goal of the data profiling is not only to detect the user-created errors, but you can also determine the logical design defects, which can cause duplications, redundancy in your database. The intra-table (used as table-intern in the last post) profiling level is related with the dependency checks within the table. With this the existence is of the 2nd normal form of the table design is checked, in other word if there any columns in the table, they are dependent from a non-key column.

This table is not fully normalized for the second normal form, because the columns City and Country could be stored in an additional table and they can be saved as foreign keys in the table Address. However it was my decision not to create to many tables in the database, and the creation of the full addressing is also easier and quicker. However it does not fit to the principles of the second normal form. So the column City is dependent from the non-key column Country and it could contain defected entries by the user.
Theoretically the column PostCode is also not normalized for the 2nd normal form, because it contains not unfinite combinations and it is also depending of the column City. In the next example we will check this column by creating the Data Quality Rule in the Talend.
Data Quality Rule for the PostCode
The definition is for this rule is the following: ‘All PostCodes from the City Budapest are beginning with 1 and have the length of 4‘.
For this reason the Where klauzel of the DQ-Rule is: City = ‘Budapest’ and (left(postcode,1) != ‘1′ or length(postcode) != 4)

The quality of the table in aspect of this rule is 99.76%, that means we have a few defected rows, they do not match the criterium.

In the last step let’s check the defected rows, where we can see, there are NULL values for the column PostCode, there are non PostCode-like entries and there are some which begin with a whitespace or end with a dot.

Summary
The creation and execution of this Data Quality Rule took me 10 minutes. However you can see, for a large database you can define hundreds and tousands of these rules which can create some man-month work. The bigger effort is the resolution of the data quality issues – where you, as the leader the data migration project, cannot decide alone, which errors are important and which not. But the detecting of such errors is worth it to sacrifice the effort.
Data Profiling – Remove the spiderweb from the back of the wardrobe
You were probably in the situation when your furniture had to be moved from their old place because of relocation to another city or before flat painting. You move the floor-rooted piece of furniture and you realize that the back is covered with discusting spiderweb and other grime. What will you do: leave it as it is, because it cannot be seen near the wall or you will remove it immadietly?
Next time you will reckon and check the corners, not visible places of the room what should be removed what does not match into the room. That means you execute a dirt profiling from time to time in your flat.
A similar approach must done during the data migration project. Checking the data quality is one of the first steps in the project: with the execution of the data profiling can we build a first impression about the scale of the project. That means this step must be executed before the final effort estimation!
What is the data profiling?
During the data profiling process you will examinate the data in the legacy source to collect statistics and information for building the data quality rules. For the dimensioning of the definition of the data profiling, I have chosen the determination by Informatica: The data profiling has 3 dimensions:
- column level
- table-intern level
- inter-table level
Metadata profiling
The first two levels (column level and table-intern level) can be examined with the metadata profiling. On the column level you can check the
- data types
- domain, range of the values (i.e. post code must be within the interval of 1001-9999 in Hungary)
- pattern (i.e. the phone number has the pattern: +nnWnnWnnnnnn)
- frequency counts (i.e. most of the sells happen on workdays: Tue, Wed, Thu)
- Statistic numbers (min, max, median value, avarage value, etc)
- dependencies
- Redundancy
You can draw information within the table by dependeny checks (this category takes also place in the third dimension: in the inter-table level). You can determinate the dependency between column values by the normalization rules from the logical data model design: i.e. a national code of a phone number is related to the ‘city‘ column.
Finding dependencies between the tables (inter-table level dimension) are based on table model design: i.e. a foreign-key value customer-id in the orders table must appear as primary key in the customer table. After my experience the most the data garbage is coming from the missing referential integrity between the tables which was caused by poor data model design.
Example by an open source tool
I have chosen the Data Profiling Tool by Talend on a database with some tousand records of the ‘Address‘ table. The free downloadable version supports the first two dimension of the data profiling types: column level and table-intern level.
In the picture below you see the example of two columns: Address and AddressID. The meaning of the colors of the column Address.
- Red: number of all records in the table (6575)
- Yellow: NULL values
- Orange: Distinct Count (6180)
- Blue: Uniqe Count (5910)
- Pink: Duplicated Count (270)
- Light Blue: Blank Count (39)

The Primary Key AddressID seems to be OK, because the count of the Unique values are the same with the count of the rows. But what can we do with the column Address? Theoretically it can be also uniqe, however the column contains street and number.

You can see in the picture above the defect rows. There are clearly bad administrated data, instead of the real address we find city name, phone number and duplicated addresses. To eliminate the duplicated rows the connected tables of address must be also checked by the data profiling tool.
In this short example you could see more data profiling types: redundancy for the duplicated rows, range for the recognized phone numbers, frequency counts, etc.
Summary
The data profiling is the anteroom for the creation of the data quality rules. To execute the whole data profiling, you have to check each column, each table and each connections between the tables. If you have the statistics and all information about the defects, then the data migration team, where the right stakeholders from the business side are also member of the team, must decide about the measurements and the invested efforts for the fixing.














english
magyar
