Wednesday, March 27, 2013

Java/SQL: List Dropped Records

At Melissa Data we purchase many billions of personal contact records from multiple sources representing different facets of the world of commerce.  We combine them, eliminating duplication and obsolete information, much like the goal of database normalization, to distill them into the richest, fullest set of records anywhere -- baked fresh in three week cycles.  Companies who were once our competitors now submit their records to us so they can be both, updated to contain the most current information, and appended to fill in their blanks.

The processing of these records is done in a sequence of phases, starting with a simple conversion from the source formats to our standard. Through our software build process each address acts like a magnet that attracts all of the data associated with it from the batch of original records.  We start with a set of as many as twenty billion records and end up with around one billion product records.

We sometimes encounter a large, but imperceptible loss of records during the transition from one build phase to another after we've made changes to the code that controls the build process. It is not uncommon to find we've lost around 400 million records between steps -- and not realize it unless something unexpected pops into view because that's less than 2% of the data involved. 

The application I wrote, known as "List Dropped Records" tracks every record, whether it has been reduced to an archive or remains active, in every phase. It opens the hundreds of thousands of files, reads the record lists therein, and compares the phases in order to report what has disappeared in the form of a database table that lists the record ID's dropped in each phase.  With that we can quickly learn whether or not we have loss, and how the amount of loss changes from one build to another.

No comments:

Post a Comment