What is DocuTrain? How Harold learns your documents

What is DocuTrain?

DocuTrain is the part of Harold where you teach it about a specific type of document from a specific supplier. It is not a template system. It is not a rules wizard. You upload real examples — PDFs, scanned invoices, whatever your supplier actually sends you — and Harold analyses them to understand the structure.

Once trained, Harold will extract the right data from every future document that matches that type — automatically, every time.

How it works

The training process has four steps.

Step 1 — Upload. You upload between one and ten example documents. More examples improve accuracy, especially when your supplier has slight layout variations across documents. Harold accepts PDFs, images, and scanned files.

Step 2 — Detect. Harold sends each document to its AI layer, which reads the document visually and identifies every field it can find: invoice numbers, dates, supplier names, line items, VAT amounts, totals, purchase order references, and anything else present. It scores each field by importance — how consistently it appears, how critical it is to a business process — and records where on the page each field tends to live.

Step 3 — Review. You see the full list of detected fields alongside sample values pulled from your actual documents. You can rename fields to match your own terminology, remove anything irrelevant, add fields Harold missed, and map each field to a standard Harold output key. This is the most important step — it is where your business logic gets encoded.

Step 4 — Save. Harold saves the trained schema permanently. From this point forward, any document of this type that arrives in Harold — via email inbox, upload, or API — will be extracted using this schema without you touching it again.

What makes it different

Most document processing tools give you a field detection wizard where you click on regions of a PDF and draw boxes. DocuTrain does not work like that. Harold reads documents the way a person does — it understands context, not just coordinates. This means it handles layout changes across suppliers gracefully. A supplier who slightly reformats their invoice from one year to the next will not break your extraction.

The multi-document merge is particularly important. When you train on five examples, Harold does not just learn from the first one. It cross-references all five, takes the most common location hint for each field, averages the importance scores, and only surfaces fields that appear consistently across your examples. You get a cleaner, more reliable schema.

Ease of use

DocuTrain is designed to be operated by someone who knows their business documents — not a developer. If you can identify an invoice number on a page, you can complete a training session. The detect step is fully automated; the review step typically takes five to ten minutes for a new document type.

The main friction point is the first time. Users sometimes upload too few examples — one or two — which makes the review list less reliable. We recommend a minimum of three examples and ideally five for document types with variable layouts.

Limitations to be aware of

DocuTrain works best on structured documents — invoices, purchase orders, delivery notes, credit notes, receipts. It is less suited to freeform documents like emails, contracts, or letters where there is no consistent field layout.

If a supplier sends dramatically different document formats for the same document type — different software versions, different country layouts — you may need to create separate training schemas for each variant.

Training is per document type and per supplier combination. A Purchase Invoice from Supplier A and a Purchase Invoice from Supplier B are two separate training sessions. This is intentional — it is what makes extraction reliable rather than generic.

The aim

DocuTrain exists because the alternative is manual data entry, forever. Every hour a member of your team spends copying numbers from PDFs into a spreadsheet is an hour that could be automated. DocuTrain is the one-time investment that eliminates that work permanently. Train once. Extract forever.

What is DocuTrain? How Harold learns your documents

Ready to automate your supplier documents?