CloudTrade Podcast - Episode 4 - Why is data capture a logical problem?

/ by
Reading Time: 11 minutes
Watch the podcast here, if you’d prefer to read about what was discussed, please read on.
In their latest CloudTrade podcast, Richard Develyn, CTO, and Steve Britton, Director of EMEA Sales, discuss the importance of understanding the data that is captured for down-stream processing. Richard goes on to explain how CloudTrade incorporates logical rules-writing into its capture solution and how the company is planning to further develop the rules-writing concept.

Steve: Good morning, welcome to the call, and I am delighted to have one of our founders and our CTO on the call to talk about industry matters, which we are often asked to comment on from various customers and analysts and other people who are active in our market space.

I have chosen the subject for us to discuss today, which I think you are in agreement with, and that is, “Why is data capture a logical problem?” And I raise that because this is the message you say to the company and to the Sales team as one of the reasons why you decided to develop the technology a number of years ago and perhaps we ought to talk about that a little bit to understand your motivation and what you saw was a challenge in the marketplace.

Just by way of background for the listeners, this is a series of blog posts that we are presenting on a weekly basis around the industry, around CloudTrade. Later on this week, because it's International Women's Week this week, I have a conversation with one of our colleagues, Amee Patel, on women in Tech and so it will be interesting to see what Amee has come up with.

But moving back to the subject today, as you know, Richard, for the last 20 odd years I've been working with OCR and various capture technologies, and outsourcers, to try and help people who receive many different document types into their business and they need to extract various fields, elements of those documents, and put them into a back-end system.

And you know it's not just about capture, it's about understanding that data, and how you need to present it to your back-end system. The objective being to stop people having to read and touch documents, and I suppose the panacea is to have end-to-end processing with no human intervention. Maybe that is a possibility, but I know in the conversations you and I've had over the years, you are very passionate about your beliefs and your view on how people should approach the challenge of capturing information off human-readable documents.

Perhaps you could share a bit about your view of the world and why you went down this route with CloudTrade, and indeed got global patent recognition for our approach and technology.

Richard: OK, thank you Steve, indeed. So I think you have to differentiate the difference between unidentified data capture and identified data capture. Unfortunately, they both tend to be called data capture, which confuses things a little bit, but they're quite different beasts and they require quite different technology.

So unidentified data capture is, at its root, like something that OCR does. It goes into an image and it extracts the raw data, the words and the numbers from that image, and puts it into another file. In fact, you do the same thing with PDF data extraction. You could see it as unidentified data, and that's the first stage you have to do anyway, which is where you navigate your way through all the very convoluted file structure which the PDF standard has in it, and you find those bits of data and then you extract them and put them somewhere else. And we tend to call that sort of thing data extraction, but it is unidentified data extraction.

Now unidentified data is really of no use to man nor beast. If you have the number 12345 sitting somewhere and you don't know what that 12345 is, it's of no use to you. The really important part happens next, which is the identification of that data, and to go back to the question which you asked originally - Is it a logical problem? - it is the identification of that data, which is a logical problem. Now, traditionally, people would completely shy away, they still do shy away from that part and they would use the OCR systems or the PDF data extraction system to get that raw data out and then rely on humans to identify it or have some very, very simple mechanisms to identify that data. They would implement fairly simple logic. For example, something might always be in the same place. That's a logical step to say it's always over here on the document, or even you find something unique and you take a little line to it and say, well, it's always there. And that's about as far as it went.

But I started CloudTrade with David and Richard Manson I focused immediately on the problem of the identification. It is rather amusing when I come back and think of it, that originally I saw the extraction of data from the PDF as something that will take me maybe a week at most and all my efforts went into the identification side of it. In fact, the PDF standard is so convoluted that it took probably a year of programming of refinements and refinements and refinements to get us to the stage where we could work with all the strange, odd variations that that standard has in it, but that is still just unidentified data extraction. The whole thing that I built in after that was this rules system which would allow rules writers to, in essence, feed into our engine the means by which data was identified and that I think is our key differentiator, and what it was all about.

Steve: I suppose it begs the question, Richard, that data capture, as you say OCR and unidentified data capture, has been around a long time, and those technologies have developed over the years. Some level of AI or algorithms have been built into them to try and improve the efficiency. It begs the question, “Why didn't you just take one of those engines rather than spending a year dealing with the capture piece, utilize what was on the shelf, as it were, and try and bolt something on top?”

Richard: There wasn't, and there still isn't actually, a sophisticated PDF data-extraction library. The OCR is a different matter. We tried to steer away from OCR because in order to do data identification beyond the simplest possible mechanisms you have to do a lot of navigating and investigating of the document and if you are using OCR with its very raw data uncertainties, that navigation starts to become unfeasible. There are too many sources of error, you have to keep going back to the document to find something about it, to then go somewhere else in the document, to go somewhere else in the document, and if each of those steps has got OCR uncertainty, you pretty soon just fail. With our rules system originally we discovered that it was about 20 times harder. OCR is a lot better now, but it was 20 times harder to write rules for documents which had OCR uncertainty than for those that didn't.

So we narrowed our scope to PDF data documents. And the sad thing about the PDF standard is that the success extracting data from PDFs is dependent not on the standard, but whether or not our Acrobat Reader succeeds. So what you find is there are loads of PDF documents out there which are totally outside of the PDF standard, but Acrobat reader still reads them. And if Acrobat Reader still reads them, then you are expected to be able to read them as part of your process, and that's why it takes such a long time to iron out all these idiosyncrasies. It's very easy to do something that just sticks to the standard, but that is way too short for expectations.

Steve: OK, thank you for that. So we've got the ability, as I mentioned earlier, the unique ability, to extract the technical layers of a document rather than using the image and therefore running OCR. So we don't use OCR, but we have an application-generated document and we can, as we do, deliver 100% accuracy from the data that we extract.

But back to the subject here of the rules. So you gave an analogy that you and I were starting a new job, we've got some documents in front of us, and somebody says you need to extract this information that sits here in the top right, there in the bottom left, and you need to do this with the information, and you need to process it. Over years people are going to become very familiar with those documents, or maybe not years but even weeks, and therefore it becomes semi-automated in terms of how we process them. It still takes time, and technology can do it more accurately and a lot quicker. But as human beings, which they are, it's the third Wednesday of a month, and it's from this supplier and therefore, oh I apply this rule for that supplier, and, actually, what they meant to put on the document was this, because we know.

So how did your rules address that? Or does that impose other challenges and problems in terms of the downstream process?

Richard: The rules system is a logical system and the basic promise that we make in CloudTrade is that if you can articulate the rule, then we can implement it. It is built on a logical programming language, Prolog, and really does allow us to put any sort of logic that you like. The only thing we can't do is if you actually cannot articulate it. If the only way to get information out is some sort of artistic intuition then we're not going to be able to do it, but give us a rule and we will implement it and therefore automate it, and therefore do it at the speed of IT.

Steve: Absolutely. And that takes a lot of the guesswork that typically happens in the process, and I'm sure that from a governance and compliance point of view, utilizing a service like CloudTrade, because A: 100% data accuracy, and B: using the rules that have been agreed for the downstream process means that as the executive of a business, I can now trust that data that's been delivered through this process, whereas before, and Richard you and I hear this all the time, "I have to touch every document" Why? "Because I don't trust the output I'm getting, and therefore I need a human being to address it."

And as we've just discussed, sometimes a human being has had a bad night's sleep, and was distracted doing something else and hit the wrong key, etc. And therefore you've got more work downstream to process that document. So what I'm hearing with CloudTrade and, again back to the opening question around "What drove you in your passion?" was we've got a unique ability in the technology stack you've developed to extract with 100%. Apply those business rules that we've understood by somebody articulating those to us, so we know that's going to run consistently all the time.

And back to my opening question around the challenge that you saw in the market. You must have looked around to say, well, who else does this? Or is the process that CloudTrade has now truly unique in the market?

Richard: I think it always was unique. I've always by nature been an innovator rather than in an imitator. That's just the way I am. I did look around a little bit, but you know, people tend to be imitators, which means what tends to happen in the market is that if there is no solution at present somebody will innovate and then everybody else will just copy. So you then get a little bit fooled by the idea that there might be 10 solutions all doing the same thing, and imagine that everyone's come to the same conclusion independently, and actually, no, they haven't. You know one person came up with the idea and then all the other companies were put under a great deal of pressure not to go out on a limb, just literally do what the other people do, but maybe try and improve on it. But I've never been like that.

First of all, I can see that there's no great reason to respect what's out there in the industry, because like I said, most people would have just copied. And another thing about the problem, the identification problem, was that it's hard. It's not an easy one to solve, so in order to innovate you have to be in a certain position as a company where you are free to do so; you have to have that freedom, perhaps you have to be small, to be able to innovate means you haven't got people breathing down your neck asking you to be accountable for every second of the day. You experiment. So that really is what gave us the ability to produce really just the right solution.

Steve: Right.

Richard: I don't really think there is any other solution that I can think of. I know people are now trying to leapfrog over the whole system by bringing neural networks and machine learning and so on, and I do not believe that that's going to succeed. I think that what we produced actually was what the first innovators should have done, but I think they saw that it was a difficult problem and shied away from it, and said I know what we will do, we'll, just do the unidentified data capture bit and pass the identification side on to humans. We won't try to crack the problem of natural language processing, which is actually what you need to do for data identification.

Steve: Right. Excellent. So we have a unique capability to extract data off an application- generated document. We have the ability to capture rules that have been articulated to us and apply those consistently in our end-to-end process, ensuring that we do truly deliver straight-through processing of documents, which is fantastic. So that begs the question, Richard. No business can stand still. What's the future hold?

Richard: I thought you were going to allude to that. We've got a lot of experience now in writing rules to capture documents, particularly in the domain that we mainly operate in, which is, invoices and orders. And we are now bringing in this new product, which is an evolution of what we do called Grandalf, and the concept of Grandalf is, let's take that rules expertise and see if we can synthesize from that a set of rules, coupled together with analysis of the documents that we processed, and create a new application which basically, for some subset of documents, will go to the sense of the document and present a sort of wizard-style, Q&A-style interface to allow the rules to be written for them without going through the rules-writing process.

So the person who uses Grandalf submits this document in and if it fulfils certain criteria the Grandalf wizard then says, “OK I have identified certain things about this document, which means that I think I can write these rules automatically with help from you. So can you just confirm that this is the invoice number or this is the order number”, and, having done so, (and it may not necessarily have to ask this question if it's obvious anyway but just in case it isn't) it might say, “Can you confirm how I should extract this information in the future because I think it is to the right of this word, or below this word, or somewhere over here?” And as the sender of the document gives their answers, all of this is stored and it effectively becomes a set of rules, but a set of rules written by this assisted mechanism with an end user rather than by our rules writer.

Rules writers are still necessary, of course, for documents which aren't that easy to do, where the logic isn't so easily synthesized and particularly right now with line-level data. But we can grow that.

Steve: So, do you see this being effectively documents-on-the-fly, self-service rules writing?

Richard: Yes. We have to investigate exactly how we're going to position it in the market. Who's going to be doing the Grandalf UI work?

Steve: Right.

Richard: But we can start from that premise, actually, the document come can come in, the sender can submit it, and either the sender or somebody within CloudTrade, or whoever it is, can look at it, go through the wizard, and if it turns out that Grandalf says “OK, is it one of these values?” and it's not, then the person says, “OK, this is a more complicated document”, or if it says - even if it gets the value right – “Well in future cases can I get the information in this manner?” and it gives you a list of choices, and the person in front of it says “No, I'm sorry, it's more complicated than that”, we can hit a little button to say, OK, now we've got to jump out onto our traditional rules writing and do it that way.

Steve: Really exciting, Richard. I'm just conscious of time today. Thank you so much for that and I think in summary, I often hear our happy customers saying this, that what we do is truly magic, and you've just given the answer that with Grandalf coming along, we've taken that to the next level.

Richard: I dislike using the word magic because I want to say to people it's perfectly explainable and we're going to some ends in our FAQ page on the website to make sure that people can see exactly what we do. There's nothing up our sleeves you know, but yes, I get your point.

Steve: Absolutely. Wonderful. Thank you again and I will look forward to the next blog.

Watch the latest podcast episode in full below: