Tuesday, January 8, 2013

Java/SQL: Record Indexer

At Melissa Data we purchase large mailing lists from several different facets of the business world and distill them into a form our customers from many business perspectives easily query to gain valuable information. We start with 15-20 billion records from our suppliers and combine them, eliminating redundancy and outdated information until we end up with a much smaller and fuller set of customer contact records. 

The size of the data set just mentioned makes it very difficult to realize the effect of every rule we code into the software distillation process that converts purchased data into our product. Each small error or misunderstood user story in our code might obscure or destroy a lot of value in the final product. We've seen more than 100 million records lost -- unrealized -- several times due to a minor oversight in the code involved at some distillation stage.  Because the size of the output at each phase is so large, the time necessary to properly review it would deadlock our development efforts and we need a way to quickly locate the text of suspicious records as well as that of their ancestors at every distillation stage.

At the origin of the record distillation process we've purchased thousands of raw text files that, depending on the source, may each contain between hundreds of thousands and millions of records. As the data is processed in four distinct stages including its raw state, two distillation steps, and the final product, all data from all original files remains present at each level. To most efficiently review and test the effects of our software at each development stage we need access to the records' text at all four distillation phases.

The solution we're using for this is a SQLite database named IndexID.db.  It contains two simple tables, the first of which, "recordsIndex", contains three simple fields: a record ID, a file ID, and a file offset.  The second table, "filesIndex", consists of a file ID and the fully-qualified path to the file.  When a developer --or better yet, a program -- seeks to quickly see the contents of a record, the record's ID value is used to locate and read the record text. 

After four paragraphs that tell what is involved, we finally start talking about what I think is the interesting part -- the application that creates and loads the IndexID database, the IdIndexer.  Every record formatted for our system contains a field that holds a list of the record's ancestor record ID's -- ID's of the records whose data found its way into the record. As each stage of the distillation process finishes, every single bit of the original data can be found within one of the records, but always in one with a different ID from the previous or next phase. The results of all build phases are saved to files in their own directory tree. The IdIndexer opens every file in the build directory tree, from which it locates and lists every path, filename, record ID, list of ancestor record ID's, and offset to each record in the files. 

The amount of disk space a database such as IndexID.db consumes is proportional to the number of characters stored in its field values.  Because the number of records is so great in IndexID.db, many characters and digits are required to specify their record ID.  The syntax for our record ID's uses 40 alphanumeric characters that are grouped by three dashes.  The amount of database storage space necessary for a bunch of 40-char ID's alone is 40 times the number of records -- which in our case is around 25 billion -- so we are talking about needing around one trillion bytes.  To minimize the amount of storage for the record ID's I took two approaches, the first of which was to convert the last quarter of the ID from a ten digit string of base-10 numerals to the base-36 equivalent, which is normally a change from ten to around three digits.

In the case of the other three ID segments, as mentioned above, the first contained only four characters, and they spelled one of a very small number of strings. The second and third segments each contain around ten alphanumeric characters -- which happily increment from zero to some alphanumeric value, and therefore contain lots of zeros at the front end.  It was quickly obvious that there was a lot of duplication involved in all three seqments.  What I did is map each seqment's value to a base-36 number that was incremented each time a new segment value was encountered.  So, for instance, the first segment was converted to a numeral 1 in place of the original "ABCD". In the second and third segments a string like, "000001R0469", was converted to a base-36 number like 124 because "000001R0469" was the 48096th unique string encounterd, which translates to 124 in base 36. With those translations to the four segments we now store the record ID's for the earlier mentioned 25 billion records in approximately 8 digits instead of the original 40, which reduces disk space used from one trillion bytes to closer to 200 billion.