By Bob Wall, Senior Consultant
I have found that one of the biggest challenges for data migration, data integration and data warehousing projects that I’ve worked on, is the requirement for clean data. Typically, one of a plethora of data quality/cleansing tools is deployed. The tool employs sophisticated heuristic, probabilistic, deterministic, phonetic, linguistic and empirical methods and algorithms to perform data quality analysis. For example, for customer data being integrated into a data warehouse, we want the tool to reconcile Easthartford, Hartford East, Hartford and East Hartford to the same physical address. The data quality process is usually run in a batch process, first for the initial load and then continually after that.
Today, operational MDM hubs present an even bigger challenge. We have to ensure synchronization across multiple source systems and all data must consistently be correct. The data quality checks must be applied at various stages within the master data lifecycle, and must support federated sharing across and between systems, databases and applications. In a sense operational MDM requires real-time data quality.
Digressing for a moment, I have also worked on projects that employed various search engine technologies. On one in particular, a client wanted to have an automated way to categorize structured and non-structured web data (emails, documents, images, etc.); search for certain conditions and do it in an automated fashion. We investigated some technology that used what was referred to as the semantic web technology, which incorporated advanced semantic and linguistic analysis with classification schemes (using Hyper Text Markup Language (HTML), eXtensible Markup Language (XML), Resource Description Framework (RDF), and Web Ontology Language (OWL)) to render web content machine-readable and make it capable of being searched in an automated fashion.
Software tools that support inline SOA data quality services combined with advanced semantic and linguistic analysis/machine learning capabilities are beginning to evolve. Microsoft’s purchase last year of Zoomix is a testimonial to the strategic value of these types of products. The convergence of data quality and semantic web technologies may provide operational MDM projects with the ability to automate ongoing classification, matching, and standardization of master data records. I think it is definitely worth keeping an eye on to see if it leads to real-time data quality capabilities embedded in MDM tools.
photo by Blude (via Flickr)
Bob
Wall is a senior consultant with Baseline Consulting. He is an
information technology specialist with 30 years experience in all areas
of data warehouse administration, data architecture, data resource
management, training, and applications systems development, as well as
in corporate management.

Agree with most of this post. But I would like to see data quality capabilities as true SOA components rather that have them embedded – technically and commercial – into distinct MDM solutions.
More on this here:
http://liliendahl.wordpress.com/2009/07/07/service-oriented-data-quality/
Posted by: Henrik Liliendahl Sørensen | July 09, 2009 at 08:11 AM
Bob,
Data quality vendors do have real time integration with the major MDM vendors.
I work for Trillium Software and I can tell you that you can integrate Trillium with say Oracle CDH or Siperian or Tibco, to name a few, and use them together in a real-time MDM implementation. The solution is often equipped with a load-balancing application server that accepts and manages transactions in a high-volume SOA environment. See http://tinyurl.com/lthy2r
Posted by: Steve Sarsfield | July 09, 2009 at 02:16 PM
Henrik,
If I can digress a moment and share a story: my son was an accomplished woodworker (having made our cherry wood china closet and china closet/hutch, for example). When he took his first object oriented programming class at the university he excitedly told me that on the first day the professor had them learn to access a pre-built component to perform a mathematical function, rather than work on programming the function itself. He had a big smile on his face when he said that object oriented programming was like woodworking and that if you build a drawer you can use it in a desk, a cabinet, or a china closet/hutch. He had grasped the power of reusable components and to this day he is a proponent of building easy to swap, mix and match, plug-and-play software components. Having read your article, you aptly pointed out the power of reusable SOA components for data quality. I agree with your basic tenet that data quality functionality deployed as independent SOA components offers great benefits for reusability, interoperability and composability. Steve has pointed out some already existing components that Trillium offers for real-time data quality. It sounds like we have an opportunity to head in that direction.
Posted by: Bob | July 10, 2009 at 05:14 AM