When looking at data capture solutions, the term OCR often pops up. A technology that has been around for years, it is often the ‘go-to’ for companies looking to automate their data capture. In this blog post, the first in a trilogy explaining the CloudTrade data-capture solution, Richard Develyn, CloudTrade CTO, looks at how although OCR may capture some of the data needed, it cannot provide the understanding required to know what to do with that data or what the data means. When it comes to the future of data capture and enabling automation, we need to look at data perception and understanding…
I am often asked to explain the difference between the service that we provide here at CloudTrade and those services which are sold under the banner of “Optical Character Recognition” (OCR).
There is almost a straight answer to this, which is that OCR deals with what we might call “human perception” whereas CloudTrade is more about “human understanding”.
I say “almost” because the waters get muddied on a couple of counts. I shall come to these later; but let me first define exactly what I mean by “perception” and “understanding”.
What do we mean by data perception and data understanding?
Perception is all about recognition in its most basic form. It’s the bit in our brains which translates swirly lines and dots and circles into meaningful letters in the English language. It’s also the bit that has to struggle with differentiating between “i” and “j” or “b” and “h” so that we don’t end up wishing people “bappy hjrthdays” or catching a “fjshes” on a “fjshjng book”.
Understanding, however, is all about meaning. It’s the bit that comes in after perception has done its job (assuming that it gets it right!) and figures out, say, that the word “fishing” in “fishing for compliments” has nothing to do with the word “fishing” when you’re fishing in the sea.
Where the difference in perception and understanding starts to get muddied is that both the providers of OCR based solutions and we, ourselves, at CloudTrade, offer services which are based on a combination of both of these technologies.
You can’t have one without the other
You can’t, after all, have understanding without perception (unless you’re some sort of yogi floating over a mat in the Himalayas), or perception without understanding (imagine trying to find your way around the Tokyo underground system when you don’t speak Japanese). CloudTrade and OCR-based solutions need to use both of these elements because providing this service means not only extracting the right numbers and letters from those documents that are sent to us but also understanding them well enough to explain that, for example, “quantity 1” in an order line next to “car mats” is probably referring to a pack of 4 whereas the same phrase next to “Lamborghini Veneno Roadster” is unlikely to be referring to a pack of 4 of them at all.
Traditionally, OCR-based solutions have focussed on the perception side of the problem because that is where they have invested the bulk of their R&D, leaving the understanding part to be provided mostly by humans.
The value is in the understanding
CloudTrade, on the other hand, has invested all of its R&D efforts on understanding, succeeding in bypassing the perception part completely by focusing on “data” documents such as “data” PDFs (where, for example, the letter “s” is unambiguously stored as the letter “s” rather than as a set of drawing instructions resulting in something which could look like the letter “s” to the human eye).
Data PDFs do not need OCR and can therefore be thought of as producing a “perception” result which is 100% accurate. 100% perception is the key enabler for the process of understanding, as it allows a natural language analysis to take place with high levels of sophistication as there is no fear that all of the logical steps taking place within it will be broken by some stray spanner in the works which changes the word “battery” to a “hattery” or omits a very important decimal point in the phrase “don’t exceed the recommended dose of 1.234 ml every 24 hours”.
Providing the fuel for automation
Sophisticated systems of understanding remove the need for human operators and allow services to operate in a fully automated manner. At the time of writing, CloudTrade is processing ten million documents a year in this fashion. As soon as errors in perception are introduced, such as by using OCR, failures start to occur in the grammatical rules which underpin the process of understanding, and more and more human intervention is needed resulting in less and less automation.
Alternatively, OCR solutions operate in this field because they embrace the human element of document processing. The advantage is that they are not limited to only processing data PDFs. Their disadvantage is that they cannot fully automate.
To assume is to…
The second way in which the difference between perception and understanding has been muddled is in the technology behind OCR, which has now made inroads into the world of understanding. To quote Douglas Hofstadter from his seminal paper on OCR and AI called “on seeing A’s and seeing As”:
“A tacit assumption is thus that the components of sentences–individual words, or the concepts lying beneath them–are not deeply problematical aspects of intelligence, but rather that the mystery of thought is how these small, elemental, “trivial” items work together in large, complex (and perforce nontrivial) structures.”Douglas Hofstadter
This assumption is certainly true with data PDFs, and that “mystery of thought” is clearly where CloudTrade has put in all of its R&D efforts. However, should the need for OCR not disappear completely, as might happen if all interactions become electronic and “data” based documents become the norm, then the most promising future for OCR is likely to come out of a hybridisation of perception and understanding.
Variety is the spice of life? Not for data.
Although as I said earlier, OCR makes mistakes such as reading “fish” for “fjsh”, what it actually does is identify lists of variations rather than hard and fast answers and then present those variations with their individual certainty values to a user for arbitration (i.e. it could be “fjsh” (60%) or perhaps it’s “fish” (50%)). OCR vendors can then use dictionaries to automatically strip out nonsense words like “fjsh” and perhaps narrow down the possibilities to arrive at the right answer. This doesn’t work, however, when the OCR mistakes still result in words present in the dictionary, or when a word being considered is not necessarily an English word at all (like a part number in a catalogue).
A far more sophisticated solution would be to bring in all these variations in perception straight into the “understanding” engine and then allow the latter to crunch through all of the grammatical options.
This is something that we have experimented with at CloudTrade, since it is possible for us to connect to OCR as the “perception” part of our solution. In doing so we have, indeed, found that with a bit of patience and tailoring we can deliver an OCR based service which is just about acceptable and automatic for header-level capture, but it’s too painful and slow to be feasible on complex or not “near-perfect” scanned images.
Dictionary lookups have been a standard feature with OCR vendors for some time. Advances in Machine Learning may well improve matters further in the future. I doubt very much that any improvements will happen with things like invoices and purchase orders, where a lot of the key information doesn’t have very much context to draw upon to allow significant automatic corrections to be made, but there could be mileage in using this technology with historical documents written in proper flowing prose.
OCR may well have an interesting future when it comes to scanning documents that were written in the past, but it’s more than likely to now be a past technology when it comes to documents that are to be written in the future.