By Stephen Putman, Senior Consultant
Data Profiling is the act of extracting descriptive and statistical data about what resides in a database in order to facilitate data quality and design efforts, among other things. Tools for performing profiling range from open-source software (DataCleaner, Talend) to database system tools (Microsoft SQL Server, Oracle) to specific and purpose-built products (vendors Trillium and DataFlux both offer data profiling solutions). All of these tools automate and standardize the data profiling task, and integrate with other areas of the design effort to varying degrees. However, most organizations do not use them, relying instead on SQL queries, run manually. Why does this happen?
The main reason is that data profiling is not seen as a high-priority activity. Therefore, the investment in licenses, hardware resources, and training for a profiling system is seen to be an unnecessary expense. Consequently, data and business analysts are forced to use existing tools such as SQL query clients to perform basic profiling tasks.
Many times, this is due to a basic misunderstanding of the power of a data profiling tool, and the data it analyzes. Profiling data is useful in logical and physical data modeling, functional system requirement production, and test script generation. Profiling gives the overall system scope and context, and ensures that the system produced satisfies the requirements of the business by characterizing the values present in source systems.
Manual SQL data profiling is fraught with risk. It relies on the skill and knowledge of the analysts producing the queries for accuracy and effectiveness, and there is no guarantee of consistency of results or standardization.. Modern profiling tools have optimized data gathering technology that is faster than manual SQL and scenario repositories that can be shared between analysts and projects to promote reuse.
Implementing a robust data profiling system is an essential part of an effective data management environment. The cost of putting such a system into place is more than offset by the benefits of deep analysis of data sources in both the transactional and analytic spaces. The essential activity is to discover just how powerful such a system can be in your organization.
Photo by mandolux via Flickr (Creative Commons license)
Stephen
Good post. I find it difficult to convince clients mid-project that data profiling is worth their time (and my expense). I can't imagine too many of them getting data profiling religion and buying a software license.
I think that you're completely right, but lamentably many organizations just don't get it.
Perhaps that will change soon.
Posted by: Phil Simon | April 22, 2010 at 07:15 AM
Stephen,
I definitely agree that data profiling is an essential activity. It is one of the commonly overlooked aspects of enterprise information initiatives -- sometimes even by data quality projects.
Data profiling tools can definitely help with the required analysis, and there is significant diversity in the vendor offerings, including some excellent open source alternatives that can alleviate the associated costs.
However, I would argue that sometimes non-tool options such as SQL queries can occasionally be a legitimate approach because although a data profiling tool can definitely help you automate repeatable processes for some of the grunt work needed to begin your analysis, it is important to remember that the analysis itself can not be automated.
Subject matter expertise and business knowledge of the data's meaning to organization's tactical and strategic goals are integral to understanding the reports produced by data profiling, whether they were created by ad hoc SQL queries or a data profiling tool.
Therefore, I agree that implementing a robust data profiling system is an essential part of an effective data management environment. However, this can be one of the many areas where organizations believe that simply purchasing a tool satisfies the requirement -- which I know is definitely NOT the point you are making.
Best Regards,
Jim
Posted by: Jim Harris | April 22, 2010 at 09:30 AM
@philsimon - At my last client, I was somewhat successful in pointing out that they were actually doing profiling in other contexts. When I pointed out the similarity, the reply was, "Oh, that's all you're talking about> That's easy!"
Posted by: Steve Putman | April 22, 2010 at 09:49 AM
@jim - Point well taken. It's easy to sound exclusionary in 300 or so words, and SQL is great for quick-and-dirty analysis. I'm more interested in the ability to build a reusable library in a tool, which is more than a bunch of text files on a file share ;-)
Posted by: Steve Putman | April 22, 2010 at 10:12 AM
Stephen,
I agree with you on both counts:
1. It is better to use a tool designed to make data profiling easier.
2. It is difficult to convince clients of this need.
I have found this to be the case, especially when there is no history or culture of data content profiling in the organisation.
The worst case are the classic lines from IT "The business is responsible for the quality of the data", and "love the data".
But then IT refuse to provide standard data quality metrics on what is actually in the data - Standard metrics that a data profiling tool could readily provide.
Sorry for the rant.... it's a pet hate of mine.
Yes - Love the data.... but first and foremost, one must "know the data".
Rgds Ken
Posted by: Ken O'Connor | April 22, 2010 at 10:21 AM
I have the best luck positioning data profiling within the context of a specific initiative and/or use case, preferably with a couple of case studies showing how other projects have benefitted. Ideally, they've already attempted to do manual data analysis -- then they're ready to hear the message of using tooling for the stat-gathering and humans to focus on the results.
Subject matter experts (SMEs) are critical. Tooling is most effective when leveraged to drive focused conversations with the SME, preferably on only one data domain at a time.
Posted by: Beth Breidenbach | April 26, 2010 at 11:13 AM
Great blog posting and thanks for mentioning DataCleaner :-) You are absolutely right - while it is possible to go a long way with custom SQL scripts I find that using a tool with a wide set of predefined measures you will almost always find something that you would otherwise never think of asserting through an handwritten query.
Posted by: Kasper Sørensen | August 20, 2010 at 04:34 AM