With millions of fields of data to read within a transactional enterprise, accuracy of that data is vital. The utilization of validation procedures and statistical analysis can lend a helping hand to your recognition engines to process more accurately.

Any method available to a user that can inform a recognition system about the kind of documents to which it will be exposed in advance of the classification process will produce a bias toward improved recognition performance. Today's recognition engines are used to classify many types of data, including machine print, handprint, barcodes, cursive writing, check boxes, fill-in ovals and "pay this amount" fields on checks, both numeric and legal. For purposes of discussion, although OCR and ICR data dominate the examples in this article, most of the referred-to techniques can be used to improve recognition of the remaining data types.

Applying the Techniques

When recognition software is used to process structured forms (in which each field position is completely predefined), many of the following techniques can be applied very early in the process, often by the recognition engine itself. The same techniques can be used when processing semi-structured or unstructured forms, but they may have to be used on a post-recognition basis, i.e., after the field data has been located and extracted from the form.

Context Analysis

Context analysis is the most elementary and popular way of improving document recognition accuracy. It involves programming grammatical and lexical rules, edit masks and dictionaries into the forms and document processing systems prior to recognition in order to make the results match the data format of the form being recognized. For example, using a menu of different options, an end user may be given a choice of applying various grammatical rules, such as "I before E except after C," or "look for the letter U after the letter Q," in order to improve recognition of words in an open field on a survey form.

Another device that improves accuracy is the use of various edit masks to a field that informs the ICR engine of the alphanumeric syntax of the field. For example, if a six-digit product code is always composed of an initial three alphabetic characters followed by two numbers and a final alphabetic character, then the edit mask "AAANNA" may be applied to the field. This will prohibit the letter "O" from being confused with the number zero, or the number one from being mistaken for the letter "l" or the letter "S" from being confused with the number five, to name some of the most common substitution errors. Collections of different edit masks, such as time fields, date fields, social security numbers and product codes, can be created by the user and stored in memory for later use when setting up a form for recognition.

By applying a user-defined, application-specific dictionary of words to a given field, accuracy will be improved. Multiple dictionaries can be created and stored for association with different fields on a form. For example, a dictionary of patient names might be applied against the "name" field on a medical claim submitted to an insurance company. A dictionary of medical procedures might be applied against the "procedure" field, and so forth. However, an ordinary English dictionary, often used to boost accuracy in generic text processing, is virtually useless when applied to forms processing. A form is always specific to something — a department, product or procedure — and so customized dictionaries specifically created for particular fields on a given form always work best.

Data Validation Routines

Data validation routines are used to ensure that recognition results are consistent with the universe of data associated with a specific application. Validation routines can be used to spot errors that are created by humans as well as by ICR. Simply put, validation routines are algorithms that examine ICR results for reasonableness against predetermined standards. There are three basic types of validation routines: (1) look-up tables, (2) data/range checks and (3) relationship validation. Most systems use some validation routines to help solve for ICR errors, although not every system uses all three types. Unlike generic text recognition applications that utilize standard English language spell checkers, the content of a forms processing application always contains words and terms which are idiosyncratic to the industry of the form type that is being processed. Consequently, forms processing applications often require more sophisticated validation routines than full text processing applications.

Lookup Tables

Lookup tables are used extensively to improve accuracy. In this case, one or more fields are validated against a database. If there is an exact match, the odds are good that the data is correct since it proved to be congruent with a set of known quantities. Integrated spelling checkers are a common table lookup used for full text applications. Good examples of table lookups would be lists of product codes in mail order entry applications or a patient identification number on a medical claim form. Interestingly enough, there are cases where for some fields — for example, addresses — not all of the characters need to be correct. If 80% of the characters match, the system can correct the remaining 20% and accept the database record as valid. This approach eliminates the manual correction of that 20%.

When it comes to addresses, consider the usefulness of table-based validation in barcoding of outgoing mail in a large insurance company. It is not unusual for a large insurance company to send out over 200,000 letters to its customers every day. If the insurance company chooses to barcode the letters instead of leaving that task to the U.S. Postal Service, then it will receive a reduction in postage, amounting to at least five cents per letter on a First-Class envelope. This process is achieved by scanning each letter, intelligently recognizing each machine-printed address, then converting that information into a standard barcode format and spraying the barcode in the same location on each First-Class envelope that is mailed. Table-based validation is used extensively in this operation. The procedure employs the same database that was used to generate the printed addresses to begin with.

Some argue that table-based validation is only useful when the database is relatively small (and the client population of an insurance company is small compared to the entire US population), but the USPS would disagree. The USPS uses the ICR-based, Remote Computer Reader (RCR) system to handle rejected characters from its letter-sorting equipment. Since these characters tend to be problem characters, RCR typically only receives the worst possible images for recognition. RCR uses the massive USPS database containing the name and address for every person and company in the country. The ICR results are compared against this huge database to eliminate millions of characters for manual review. While large databases like RCR are expensive, their usefulness (in terms of labor savings, time savings and increased productivity) can easily outweigh the costs.

Data/Range Checks

Nearly every system includes data / range checks. In these situations, the field or character information is compared with a model to see that it conforms to a specific type, such as a particular alphanumeric sequence or date — like the date of purchase on a bill of sale used to support a title guarantee application. A validation routine might also check to see that a number is within a specific range or contains a certain number of digits. Specifying the correct number of digits also aids character segmentation in ICR applications, particularly in handprint character recognition.

Relationship Validation

Relationship validation is probably the least commonly used validation type, but the one that, in all likelihood, yields the greatest improvements in recognition performance, especially when combined with a redesign of the form that clarifies the data being validated. A simple, effective relationship test is to see if a column of numbers yields the correct sum when their recognition results are totaled and compared to the recognition results of the total. If there is a match, then the probability is extremely high that all the numbers are recognized correctly. On forms that report items and their values in property insurance — for example, a parking lot full of cars or a list of covered inventory — the item/price/total relationship is a validation routine that can yield spectacular results.

Surprisingly, there are certain types of relationship validation routines that are actually used more often in full text ICR systems than in forms processing. In actuality, they cross over into the realm of context analysis. Full text vendors know that specific letter combinations occur more frequently than others do, so they use them to improve raw ICR accuracy in order to facilitate word recognition. "U", for example, will almost always follow the letter "Q". The validation routines will validate "QVICK" as inferior to "QUICK" since the relationship for "Q" is stronger with the "U" than the "V". Along these lines, ICR and full text engines alike use letter combinations called trigrams to improve recognition accuracy. For instance, the trigram "THQ" is found in only one word in the English language: earthquake.

Check Sum Digits

Check sum digits can add considerable force to ensuring accuracy in applications where strings of numeric data are involved. Take, for example, VISA and MasterCard validation in credit card form processing applications. Credit card numbers are critical since each error can result in delays in processing a customer order. Most systems just capture the characters without any sort of validation other than limiting the field to digits. What appears to be a random sequence of numbers is actually a simple algorithm that can eliminate most ICR errors. The Modulus 10 algorithm is a check sum routine that is very easy to add to applications, such as order processing.

Modern credit cards have 16 digits. A few older credit cards still have only 13 digits. The first four digits correspond to the issuing bank. The next two digits specify the company program the owner is enrolled in (such as gold cards). VISA and MasterCard process enough cards that this is useful, but most ICR applications can make little use of that information.

The last digit of the credit card is the check digit. The rest of the numbers are processed through the algorithm and must result in the correct check digit. If the sequence matches, the credit card is valid. Otherwise, the card is phony — or the ICR engine made a mistake.

The algorithm works like this. Every number except the check digit is multiplied by one or two, alternating, and starting right to left with two.

Remember, the last eight in the card number is not included in the calculation. Notice that in some cases the result is two digits (16, and 18 in this example). Next, add up all the digits in the answer (18 counts as 1+8=9, not 18). This total is 62. The check digit should equal the next highest power of 10 minus the total, for example, 70 - 62 = 8. Since eight is the last digit on the credit card, this sequence is a valid credit card.

One way to use this information is to take the best guess or highest confidence characters in each position and test the second best guess characters until the sequence passes. Since this is an integer-based algorithm, many calculations can be performed in parallel very quickly. Often, powerful validation routines, such as this one, can be easily implemented and yield tremendous benefits for overall system performance.

>> Web Continuation

Statistical Analysis

The use of statistics in conjunction with probability theory is one of the more sophisticated means of improving recognition accuracy. Statistics can be used at four different levels to improve recognition accuracy: (1) primary recognition engines, (2) modern recognition engines, (3) document capture subsystems and (4) document capture applications.

The difference between a primary recognition engine and a modern recognition engine is that a modern recognition engine often combines multiple primary recognition engines, each of which recognizes the selected data from one algorithmic perspective. A modern recognition engine then compares and combines the results from each of its primary engines to improve recognition accuracy. Modern recognition engines are often part of a more comprehensive document capture subsystem that assigns captured data to specific document fields.

1. Primary Recognition Engines

Once a primary recognition engine isolates the image for a specific character (or in specialized cases, entire words), the engine uses the entire isolated image, or extracted features (such as location of lines, curves, their intersections and positions), to decide which is the most likely character to report. Recognition engine manufacturers use large test suites (most proprietary, but some public domain) with known character results to tune both the classification and the confidence level of a recognized character. Often there are other, less likely, values for a particular character, which are then reported as alternate choices with corresponding (lower) confidences.

2. Modern Recognition Engines

Modern recognition engines have adopted the idea of using test suites to measure and evaluate the accuracy of each primary engine. The measurements are then used to tune the algorithm that chooses a specific character among different classification results from primary engines and/or is used to adjust the confidence levels of those alternative character selections.

When processing structured forms where the data field locations are predefined, many engines also use lookup tables to directly flag invalid field data. A few engines even attempt to auto-correct fields by choosing between lower confidence alternate character classifications.

3. Document Capture Subsystems

Document capture subsystems use the statistical likelihood of character combinations to promote "less likely" alternate character results to the status of first choices. The combinatorial likelihoods can be for a language as a whole (such as the likelihood of "zth" or "nth" across words in the English language, by frequency of the words occurring in various text sources). The character combinations can also be field-specific, selected from a lookup table or database of acceptable field entries.

4. Document Capture Applications

Many powerful techniques for improving accuracy can only occur at the document capture application level. They use knowledge of the specific document field semantics to guide operations on the recognized field data to produce significantly more accurate final answers. The procedures listed below represent some of the most effective techniques for applying statistical analyses at the document capture level.

>> Trading acceptance for accuracy

One of the most powerful ways to improve recognition performance is to deliberately trade acceptance for accuracy. The traditional approach is to manually verify only the low confidence characters in each field, as these characters tend to be the ones with the most errors. Each modern recognition engine supplier provides guidelines on how to set the "low confidence" threshold to achieve varying levels of character recognition accuracy. When dealing with applications, however, the appropriate level for each field may vary, due to attributes such as print type, dot shading, font size or characters overwritten on form background features. A statistical analysis of the same test suite run repeatedly at differing thresholds can be used to set the best value for each field.

>> Applying the cost of errors to trading acceptance for accuracy

In some cases, it costs more to correct an error than it does to ignore it. Moreover, for any specific field on a specific document, downstream liabilities can result. Usually there is a different liability for different types of errors in the same field. For example, in a street address "

123 Main St.

", a mis-recognition of "Main" to "Maim" has very little consequence, so there may be no point to expending the labor to correct the error. But a substitution error of "723" instead of "123" could delay mail delivery or worse, misdirect it to a different destination entirely. In this instance, then, a statistical analysis of errors might very well lead to an application-level decision to manually examine and correct low-confidence building numbers but to ignore verification of low-confidence trailing characters in street names. Other ways of applying the acceptance/accuracy trade-off include the following:

§ Tighter range checks: Some fields may have a valid range, such as zero to $999,999.99. However, a statistical analysis of errors may show that very low values, such as 0.01 to 0.09, are almost always errors. Even if rare high-values are not frequently in error, the legal liability that results from ignoring the error may be much higher. Therefore, it is worthwhile to always flag these values, even though a valid result is occasionally re-verified.

§ Exclusion of some valid items from lookup tables: This technique is very similar in principle to tightening range checks. It is appropriate where the lookup table has entries that are very similar to one other. For example, if a local USA company processes invoices, the overwhelming majority of its invoices might very well come from suppliers in the same or nearby state. A lookup table of state codes should, therefore, be limited to those states. Doing this may cause a few long-distance suppliers' state codes to be falsely flagged; however, many others that contain recognition errors that cause a match to a different state code (e.g., MA vs. MD) will be properly flagged.

§ Use of "fudge factors" in mathematical cross-checks: The fudge factor is an amount by which an equation can mis-match and still be deemed acceptable. The IRS commonly uses this to ignore errors so small that the cost of resolving the mistake far exceeds the cost of writing it off.

>> Using alternate character choices effectively

In many cases, a field or set of fields has a limited set of possible values. For example, a "Physician's Last Name" field would be limited to the set of physician's names already on file. A mathematical formula can be used to define the possible values for an invoice line item's "Unit Price", "Units Purchased" and "Line Item Total" fields.

Traditionally, if a field or field-set doesn't match any of its possible values, they are flagged for downstream manual correction. To reduce operator labor, a little-used, though clever, approach involves trying various combinations of application-specific alternate guesses instead to produce a valid match. Unfortunately, doing this blindly can result in false positives — i.e., accepting field(s) that are, in fact, incorrect. Since false positives that "slip through the cracks" can cost more dearly in terms of their business consequences than it does to manually correct the flagged field(s) up front, this approach is rarely put into practice. However, a statistical analysis of the set of possible values, in conjunction with the alternate choices and their confidence levels, can safely increase field accuracy and acceptance.

>> Use of cross-lookup tables

When a single field's lookup table has many similar values, the table is called a dense table. Conversely, if the values are all very different from each other, it is called a sparse table. To achieve efficiency, a single-field dense lookup can be turned into a sparse multi-field cross-lookup. The disadvantage of a cross-lookup is increased flagging of valid fields because a single incorrect character in any field will cause the cross-lookup to fail. A statistical measure of density is used to determine when cross-lookups are required by using the following steps:

§ For each single-field lookup table, count each table value that differs from any other table value by only one character.

§ To get the density rating, divide the count of similar items by the total number of items in the table.

§ A low number indicates a sparse table, high numbers indicate dense tables.

Verifying the Quality

The power of validation procedures and statistical analysis is often overlooked, even when their use can substantially improve overall processing. Of course, the extent to which validation and statistical analysis techniques can improve ICR accuracy on documents varies by application, but as their use improves, raw ICR accuracy has less impact on bottom-line recognition results. This is particularly true when there are high degrees of correlation between and among fields on forms that are densely populated and when multiple validation routines can be run in parallel in a hierarchically structured mode. The net results translate into higher quality data. Although document processing software packages may provide the end user with the tools for writing the necessary rules and procedures, it is usually best to assign the programming task to a systems integrator who is familiar with the application and the database management system that the end user employs.

Arthur Gingrande, ICP, is co-founder and partner of IMERGE Consulting), a document-centric management consulting firm. Mr. Gingrande is a nationally recognized expert in document recognition technology. For more information, email arthur@imergeconsult.com or call 781-258-8181. Paul Traite, ICP, is co-founder and CTO of AliusDoc LLC. AliusDoc offers mathResults for Recognition (pat. pending), vendor-independent, business rule add-on software based upon statistical analyses of recognition engine performance. Mr. Traite can be reached at 781-267-5264 or email him at paultraite@aliusdoc.com.