Why OCR Struggles With Invoice Line Items

One of the most common problems with OCR invoice processing is handling line items. Many OCR tools can extract simple fields such as invoice number, supplier name and totals with reasonable accuracy. However, once invoices contain multiple line items the extraction becomes much more difficult.

Some OCR platforms offer line item extraction as an additional feature. Tools such as Dext include this capability as a higher tier option. In practice though, line item extraction is often inconsistent and expensive because the structure of invoices varies so widely between suppliers.

The core problem is that invoice layouts are rarely standardised. One supplier may use a clean table with clear column headers, while another may use a complex grid with merged cells, inconsistent spacing or unusual formatting. Even invoices from the same supplier can change over time. A goods receipt from one order may look different to the next. This variation makes it extremely difficult for traditional OCR systems to reliably identify each row and column.

Recent advances in AI have helped improve this area of document extraction. AI models are better at recognising patterns within documents and can often identify rows of data even when the layout is messy. However, these systems still lack context. Because they are trying to interpret the structure of the document automatically, they may sometimes hallucinate values or place information in the wrong column. Much like a human trying to guess the structure of a document without clear instructions, the system may make assumptions that are not correct.

This is one of the reasons why OCR invoice processing still requires manual review in many businesses. As explained in our article Why OCR Invoice Processing Still Requires Manual Review, extracting data from documents is only one part of the process. The extracted information still needs to match the structure expected by the accounting or ERP system.

Harold approaches this problem differently. Instead of attempting to guess how every invoice should be structured, the platform allows users to train the system on their exact document formats. If your business regularly receives invoices with large numbers of line items, you can train Harold using those specific invoices.

For example, many businesses receive large Amazon invoices containing dozens of individual items. With Harold you can train the system using one of these invoices and define the exact column names you want to extract. Once the structure has been trained, the system remembers it and applies the same structure to future invoices from that supplier.

The advantage of this approach is consistency. Even if you train Harold on many different invoice formats from different suppliers, the output structure can remain exactly the same. Each invoice is mapped to the columns you defined during training. This ensures that the extracted data is ready for use in your accounting or ERP system.

Rules can also be applied to enrich the extracted information. For example, additional fields can be generated automatically to match the supplier codes, product codes or tax structures required by your ERP. These are the same kinds of decisions that administrators typically make when reviewing invoice data manually.

Our goal with Harold is not to charge extra for capabilities like line item extraction. Instead, the aim is to give users the tools to train the system themselves and automate the repetitive decisions that normally happen during invoice processing. Rather than restricting advanced functionality to higher pricing tiers, the platform is designed to allow businesses to fully automate document processing as they train the system.

When line item extraction works reliably and the output structure matches the needs of the business system, the manual review stage begins to disappear. Instead of correcting rows of invoice data, teams can focus only on the small number of exceptions that require attention.

This is how document automation moves beyond simple OCR. Instead of guessing how a document should be interpreted, the system learns the structure of the documents your business actually receives.

Why OCR Struggles With Invoice Line Items

Ready to automate your supplier documents?