The Art of Collecting the Right Data

The form is the thing. In fact, it lies at the heart of all document-centric applications. Yet, as science marches on and technology convergence runs its inevitable course, it becomes easier and easier to forget the difficult development path that led to the document processing technology that is taken for granted today. It becomes even easier to forget the original perspective that governed it. Just take forms automation, for instance. In the days when knowledge management (KM), instead of electronic content management (ECM), was the rage, structured data was known as "explicit knowledge,bCrLf and the form was renamed as an "explicit knowledge capture vehicle.bCrLf Along those lines, students of knowledge management were careful to make the distinction between explicit knowledge and tacit knowledge: explicit knowledge is knowledge we know that we know, whereas tacit knowledge is knowledge we don't know that we know. In other words, tacit knowledge is implicit knowledge that has not been formally documented.

Interestingly enough, tacit knowledge can be transformed into explicit knowledge after gathering, structuring and formatting it — by using a form. An example of this would be information learned from company employees about informal practices, methods, tips and shortcuts that they routinely use in carrying out work processes — but never have recorded — that is collected and entered as data into a form template for purposes of issuing a standardized report. The point is that just as it is impossible to create a coherent sentence without using rules of grammar, so it is impossible to process data meaningfully without using either a form or a form template of some sort to structure it. Hence, the notion of an "unstructured formbCrLf notwithstanding, the form is the fundamental vehicle for capturing and communicating structured data.

A distinction regarding explicit knowledge — one as overlooked as it is important — addresses the difference between the content and syntax of knowledge. Content is the "whatbCrLf of knowledge, its subject matter, whereas syntax is the "howbCrLf of knowledge, i.e., the form that it takes. By definition, syntax, like an unfilled form, is devoid of content. Along those lines, logic is pure syntax. Logic contains no content whatsoever and can appear, among other things, as a logical statement or as a mathematical algorithm. Logic by itself has nothing to say about the real world, but it is a powerful tool for organizing empirical content without changing it, while mathematical formulas allow us to quantify that same content while remaining indifferent to it.

Accordingly, relative to a particular form-based application, while the form is not data itself, it is the perfect means through which to define, collect, structure, classify, contain, present and report data. Indeed, in this respect, it can be said without exaggeration that forms automation systems were initially — and still are — true electronic content management systems. So long as there is data to collect, there will exist a need for forms.

In order to design a departmental form and define its data fields, it is first necessary to circumscribe the parameters of the departmental information itself. That is to say, forms design, which includes data field selection, is a critical component of any data collection operation. The first paper forms were the vehicles through which departmental information was specified, collected and deposited within metal file cabinets. Departmental data mirrors departmental functionality; accordingly, the early form designers were defining the information flow, functionality and work processes of the departments for which they made forms. In effect, the first form designers were the pioneer systems analysts and workflow specialists — and for that matter, by extension, the first ECM and BPM practitioners.

On the most fundamental level, then, the lowly paper form was initially responsible for creating and preserving corporate information flow while it simultaneously established the corporate hierarchy. The paper form always leaves a paper trail that encourages and preserves formal territorial boundaries; in so doing, it becomes a major guardian of bureaucracy and with it, a custodian of the current power structure. In the modern digital age, the electronic form serves much the same function.

Forms automation has made its way along the technology adoption curve to the point today where the boundaries separating the various media that support forms processing — paper forms, electronic forms and document images — have blurred to a state of virtual dissolution. Email has replaced paper-based memos as electronic forms have replaced internal forms. At the same time, Internet-based e-forms are replacing the paperwork behind millions of business transactions, while most of the paper forms that remain can be processed using image-based, forms automation software. Indeed, innovations in forms recognition technology have enabled large classes of unstructured forms — including invoices, EOBs, insurance claims and mortgage applications — to be recognized and fulfilled with near-human accuracy. In accordance with Moore's Law, as computing power doubles every 15 months while memory gets cheaper, the arbitrary design and layout of a given form has virtually ceased to be an obstacle to high-recognition accuracy.

Furthermore, forms identifiers and data symbols can be embedded in any type of form via printed DataGlyphs and/or two-dimensional (2D) barcodes. Both are input technologies that permit a perfect description of all the passive form attributes regarding a particular document to be encoded in binary mode, including data that informs an ICR engine of the location coordinates of the active user data fields on that form. In other words, the growing use of these graphical input symbologies make it highly probable that someday all forms will be ICR/OCR-friendly. Indeed, 2D barcodes and DataGlyphs printed on forms, because they are binary code, accept any type of data format, including CSV (Comma-Separated Value), XML (eXtensible Markup Language) or a custom-designed data format. For example, it is possible to wrap data on a filled-in paper form with, say, the proper XML schema and insert the results into a 2D barcode that can then be printed somewhere on that same paper form. The barcode, in turn, can be scanned and converted back to XML code for use in XML-driven software applications.

In the past, documents were organized differently for image capture than for forms automation, particularly in production applications. Typically, with document imaging, scanned images of varied and mixed document types were (and, to a large extent, still are) indexed and organized into folders for on-demand viewing. In comparison, many present-day forms automation operations process highly structured documents by scanning, identifying and sorting their images by form type and then batch-processing them by using recognition templates that automatically find the form data fields through the use of predefined location coordinate information. Of course, the use of recognition templates presumes the existence of well-designed, ICR-friendly forms. In the case of so-called "unstructuredbCrLf form documents, the form templates are defined more by rules, morphological analysis and on-the-fly feature extraction than they are by stored, location-specific information; in both cases, however, an explicitly defined data set is detected, classified and then exported to a database.

It goes without saying that forms processing software does not completely automate all data entry tasks because some intervention by human intelligence is always required to correct the ICR/OCR substitution errors. But technology adoption is accelerating because users now understand that the function of automated document recognition applications is not to replace human labor altogether; rather, it is to ease the arduous task of converting mountains of paper-based information into easily managed, computer-usable data.

Today, combined image capture and forms automation applications sit squarely in the midst of mainstream technology adoption, where customers demand uncomplicated, user-friendly applications with a short payback and a high ROI. Typically, these applications employ a common image capture step that sends scanned images through a forked path to either a predefined indexing or recognition workflow, depending upon the form type and the particular classification criteria in use. Diminishing scanning and computing costs, ubiquitous IT infrastructures and increasingly sophisticated document recognition technology are all combining to drive the growth of forms processing software installations. Although it has been a rough journey at times, it is now clear that forms automation technology, enabled by other intelligent document-centric software, will soon reach its fabled tipping point. Then it will be possible in one stroke to successfully extract and efficiently manage virtually all of the paper-based and digital data received daily by modern businesses — with bulletproof accuracy.

Arthur Gingrande [arthur@imergeconsult.com], ICP, is co-founder and partner of IMERGE Consulting, a document-centric management consulting firm. Mr. Gingrande is a nationally recognized expert in document recognition technology.

The Art of Collecting the Right Data

Most Read