Monday, June 27, 2011

C#/SQL: Search Builder: Defining a Fuzzy Search

We assimilate between ten to twelve billion records at Melissa Data that we collect via donation or purchase into just over a billion high quality records that are current, accurate, and more complete. Each record consists of many fields, the contents of which possess a different commercial value to our various customers. Some are interested in a record if one of its fields contains a specific value or its value is a member of a certain set, while others are interested in certain combinations of values in a set of fields.

Typically, customers assemble an SQL query to access the record set of interest in a database, and assembling that approach often becomes long and confusing as the customer's intent is fine-tuned. An ordered list of fuzzy constraints applied to their search is relatively easy to list and understand, and will usually result in better matches to their search intent. These simple fuzzy statements concern which fields are of interest, and also concerns such as: the precedence field matching should be applied when there are multiple fields being considered; the quality of the match levels; the level of match completeness desired; which, and what number of a record's fields must match; is one date range preferable to others; and more. The Search Builder is the application Melissa Data sales engineers use to convert customers' intentions to a data search.

The DataGridview provides add, remove, sort, and modify options.

This is an application designed and orginally coded in C# with WPF and SQLite. I later switched to Winforms when I couldn't find capabilities in the WPF datagridview necessary for user stories that arose later in the project.

The checked or unchecked states of the grid of checkboxes shown above in the DataGridview in both the Rank and Terms tabs are read as binary strings from the SQLite database (e.g. 10010110000101011 & 01110011011101110 correspond to two of the rows with 17 columns).

The Setup tab, seen below, contains a list of the database files previously opened and a list of the user-specified names for saved fuzzy queries which are stored in an SQLite database.

The Setup Tab

The strings seen in the controls in the Candidates and Tests tabs are loaded from a dump of the table and field names in each customer's contact database.  Those names are combined with verbs that help construct good fuzzy statements such as "lastname.required.unaltered".


The Candidates Tab

Java/SQL: A Test Case Builder

When we see the data that we work with at Melissa Data we generally see a huge blob of a few billion records in a few files. As the program which assimilates the billions of records into a set that is merely 10-20% of the size of the orginal records runs it is pretty difficult to perceive a problem from looking at logs or inspections of the records. Even if you devote the time to a well-distributed sampling of a few million records across the dataset you are probably only peeking into less than .1% of the records there. No warm feeling of confidence is found there, even if we had the staff to study the few million records in that .1%. Add to that the fact that the assimilation application sequence uses terrabytes of disk space and runs over two weeks time each time it generates a product dataset and you'll see that we have a practical need for a way to locate and focus on problems.

We needed an application that provides a means to select and process a much smaller set of records so we have a chance to view the source of problems and correct them.
The Test Case Builder is an application I wrote that allows the generation of a Melissa Data dataset that contains perhaps thirty or forty records instead of a few billion. This is valuable because it requires relatively little time to complete, produces a set of data of a size that can be juggled within a person's brain, and allows for sampling that relieves our fear that we've not seen enough to be confident.

The Test Case Builder is a Java class that is started in a script file at a Linux command line. A command line parameter must be specified that names the configuration file that contains the locations for the database, the output location, and where driver files should be placed -- and where they will be daemon monitored. It provides no UI as it is intended to be run on a remote Linux server with no UI capability beyond text-based menus.

Running the Test Case Builder involves preparing a couple of prerequisites so it has the data available to retrieve the data records. First up, a Melissa Data dataset must be generated -- typically containing around 3 billion records. The second prerequisite requires another application I wrote named the IDIndexer to comb through the huge dataset and produce a database that links the record ID of every record in the huge dataset with the name of the file in which the record is stored and the numerical offset into the file where the record's text can be read. From there a user can drop a file into a "requests" directory that contains a list of the record ID's whose record text is wanted for study. These request files usually contain a list of record ID's that were flagged as interesting during the Test department's work.
The Test Case Builder monitors the requests directory and when it finds a request file it opens it and reads the list of record ID's. For each record ID in the file's list the Test Case Builder queries the IDIndex-produced database for the name of the file it came from and the file offset of the record. It next opens that file and reads the record text and writes it to a file that will contain the input for a tiny version of an assimilation build.
Each of the six phases of the assimilation process produces a set of slightly improved and refined records. At each phase any records that were combined to create a record in the new phase are stored within a field of the new child record so the ancestry can be traced. Each time a record is read by the Test Case Builder the list of its parent records is read and they, too, are recursively searched so every source of data that went into a finished product record will be included in the output file.

When the Test Case Builder is done it has generated three files that contain different generations of records, and that can be opened and viewed in the Build Viewer application or can be fed to a very small dataset build process so the effects of each stage of the build can be examined and understood by the development team.