When we see the data that we work with at Melissa Data we generally see a huge blob of a few billion records in a few files. As the program which assimilates the billions of records into a set that is merely 10-20% of the size of the orginal records runs it is pretty difficult to perceive a problem from looking at logs or inspections of the records. Even if you devote the time to a well-distributed sampling of a few million records across the dataset you are probably only peeking into less than .1% of the records there. No warm feeling of confidence is found there, even if we had the staff to study the few million records in that .1%. Add to that the fact that the assimilation application sequence uses terrabytes of disk space and runs over two weeks time each time it generates a product dataset and you'll see that we have a practical need for a way to locate and focus on problems.
We needed an application that provides a means to select and process a much smaller set of records so we have a chance to view the source of problems and correct them.
The Test Case Builder is an application I wrote that allows the generation of a Melissa Data dataset that contains perhaps thirty or forty records instead of a few billion. This is valuable because it requires relatively little time to complete, produces a set of data of a size that can be juggled within a person's brain, and allows for sampling that relieves our fear that we've not seen enough to be confident.
The Test Case Builder is a Java class that is started in a script file at a Linux command line. A command line parameter must be specified that names the configuration file that contains the locations for the database, the output location, and where driver files should be placed -- and where they will be daemon monitored. It provides no UI as it is intended to be run on a remote Linux server with no UI capability beyond text-based menus.
Running the Test Case Builder involves preparing a couple of prerequisites so it has the data available to retrieve the data records. First up, a Melissa Data dataset must be generated -- typically containing around 3 billion records. The second prerequisite requires another application I wrote named the IDIndexer to comb through the huge dataset and produce a database that links the record ID of every record in the huge dataset with the name of the file in which the record is stored and the numerical offset into the file where the record's text can be read. From there a user can drop a file into a "requests" directory that contains a list of the record ID's whose record text is wanted for study. These request files usually contain a list of record ID's that were flagged as interesting during the Test department's work.
The Test Case Builder monitors the requests directory and when it finds a request file it opens it and reads the list of record ID's. For each record ID in the file's list the Test Case Builder queries the IDIndex-produced database for the name of the file it came from and the file offset of the record. It next opens that file and reads the record text and writes it to a file that will contain the input for a tiny version of an assimilation build.
Each of the six phases of the assimilation process produces a set of slightly improved and refined records. At each phase any records that were combined to create a record in the new phase are stored within a field of the new child record so the ancestry can be traced. Each time a record is read by the Test Case Builder the list of its parent records is read and they, too, are recursively searched so every source of data that went into a finished product record will be included in the output file.
When the Test Case Builder is done it has generated three files that contain different generations of records, and that can be opened and viewed in the Build Viewer application or can be fed to a very small dataset build process so the effects of each stage of the build can be examined and understood by the development team.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment