Measuring the accuracy of data-capture solutions

/ by
Reading Time: 3 minutes

In this blog CloudTrade’s CEO, David Cocks, considers the concept of measurement of results. Our wide exposure to government and media analysis of Covid-testing methodology has inspired him to consider the similarities between evaluating the precision of medical testing and measuring the accuracy of data-capture solutions.

During the pandemic of the past year, we’ve all seen the usefulness of testing. However, it is important to understand how the efficacy of tests, and of any other similar type of process, is measured.

There are distinct analogies to be made between how we measure the accuracy of Covid tests (or any other medical test), and how we measure the effectiveness of a data-capture solution.

I would like to consider here how we measure the accuracy of data-capture solutions.

 The terms of testing

Evening television announcements by the UK government and their medical aides over the past year have familiarized us with the terms sensitivity and specificity when applied to medical tests. This was particularly the case with Covid testing, where:

  • sensitivity is a measure of a true positive result (i.e. how effective the test is when detecting a positive result),
  • specificity is a measure of how effective the test is when measuring a true negative result.

Sensitivity and specificity are measures of a true positive and a true negative.

When it comes to data capture, we turn the logic round a little bit and we talk about false positives and false negatives. These can be defined as follows:

  • a false positive is the term used for when you capture a field incorrectly,
  • a false negative is the term used when you fail to capture something that is actually there.

The skill is to design a data-capture solution which has the minimum number of false positives and false negatives to ensure accuracy and document automation efficiency. However, as with sensitivity and specificity, there is very often a compromise required between improving the effectiveness of one measure and reducing the effectiveness of the other.

 Finding a balance

When it comes to medical testing you have to find the right tests with the right balance of sensitivity and specificity, based on the probabilities and consequences of reporting incorrect results. The same applies to data capture. In looking at the compromise between false positives and false negatives, you must consider the probabilities and consequences.

False positives are far worse that false negatives

When you think about this in data-capture terms, particularly if you think about financial documents, false positives are a far worse outcome for you than false negatives. A false positive means you have picked up the wrong piece of information and passed it off as something else, whereas a false negative means you have just failed to capture something. This missing data can generally be picked up by a human operator if you notice that it is missing. So having something missing is a lot better than actually finding the wrong information.

 Erring on the side of caution is unlikely to result in 100% accuracy

The way in which most data-capture solutions deal with this problem is to give preference to false negatives. You will quite often find that people say their systems are 80-90% accurate and that usually means they capture 80-90% of the data on the document, probably very rarely making an actual mistake, erring rather on the side of caution, saying: ‘Let’s not capture it at all rather than capture the wrong thing’.

 How CloudTrade’s new product CIRA offers the optimal solution for data capture

At CloudTrade we set about trying to resolve these contradictory objectives of capture vs caution using our new product - the CloudTrade Intelligent Rules Assistant (CIRA). With CIRA we use the empirical knowledge that we have accumulated over 10 years, together with machine learning capabilities, to do the following:

  • find the widest possible set of values for any particular field, and their method of capture,
  • allow the user to select which one of these values is actually correct,
  • persist the method of capture of the chosen value as a solution for future documents.

In this way, CIRA provides a system that minimizes the number of false negatives by identifying the widest possible set of results and eliminates false positives by the user selecting which one of the possible answers is actually correct.

As a result of using in-built intelligence and a convolutional neural network trained with a large repository of previously processed documents, and combining these with simple user selection, CloudTrade has created the optimal solution to capture data.

Covid testing is designed with the limits of sensitivity and specificity in mind, so that an incorrect result can be resolved by further medical action. Quick PCR tests are followed up when appropriate with the more lengthy lateral flow test. For data capture there really can be a solution that delivers the correct result first time. To see how we eliminate false results, watch a short demonstration of CIRA (formerly known as Grandalf) here.