Extracting Invoices using Invoice Splitter Tutorial

Overview

When to use this?

In some instances, a single invoice file can contain more than one invoice. Whether they be from the same provider or a different one, it might be a hassle to check them to be able to enqueue them individually. This is where the Invoice Splitter API comes in.

The Invoice Splitter API takes in a multipage invoice, and gives you the index of each page of the individual invoices.

From there, you can cut your file into sub-invoices based off of these indices.

Note: this API is not to be confused with the Multi Receipts Detector API, which isolates receipts on the same page.

Initial Setup

The goal of this tutorial is to use the Invoice Splitter API to isolate invoices and then parse them on the Invoice OCR.

To do this, we'll be using the following dummy file:

The file is a simple aggregate of same invoices, made up for this tutorial. For best results, we encourage you try your own!

Before uploading, be sure that you have taken the following into account:

  • Invoices are clear, unstained and properly unfolded
  • As little extra useless pages as possible (e.g. terms & condition)
  • Pages from any single invoice are all oriented in the same direction

The Invoice Splitter API is quite robust, but it will not distinguish between other types of documents correctly if they are mixed inside of the PDF file.

Subscribing to the Invoice Splitter API

For this tutorial, you'll need to be in possession of a valid and up-to-date API key (and a valid subscription to the Invoice Splitter API as well as the Invoice OCR).

If you aren't sure whether your subscription is enabled or not, go on the API page of the main interface, click on Utilities:

Then click on the Invoice splitter button. That's it, your subscription to the API should be enabled.

To do the same for the Invoice OCR, simply head to the Document Catalog:

Then, like in the previous step, click on our product of interest (Invoice).

Note: be aware that you need to enable both subscriptions to get this whole tutorial working.

Calling the API

Time to get coding!

For this tutorial, we'll be using the official Mindee client library for Node.js.

Other programming languages are supported, check the list.

Let's create a directory to store our project and install the Mindee client:

mkdir invoice_splitter_tutorial
cd invoice_splitter_tutorial
npm install -s mindee
touch demo_invoice_splitter.js

To get started with the code, we'll head on over to the Documentation page by using the link on the left of the interface.

From there we'll click on your language (NODEJS), then Select an API key, and finally use the copy code button:

validate 1

Using your IDE, open the demo_invoice_splitter file we created, and paste this code. We'll use it as a base.

Make sure the API key is correctly filled in, if not, simply grab one from the Mindee interface and put it into demo_invoice_splitter.js

Find the line:

const inputSource = mindeeClient.docFromPath("/path/to/the/file.ext");

And put in a real file path on your drive. You should test this file in the live interface before checking.

Now run the code:

node demo_invoice_splitter.js

The result should look something like this:

########
Document
########
:Mindee ID: 69eabee4-8f29-4e11-bb24-6a4ed965910a
:Filename: invoice_5p.pdf

Inference
#########
:Product: mindee/invoice_splitter_beta v1.0
:Rotation applied: No

Prediction
==========
:Invoice Page Groups:
  :Page indexes: 0
  :Page indexes: 1, 2, 3
  :Page indexes: 4

Page Predictions
================

Page 0
------
:Invoice Page Groups:

Page 1
------
:Invoice Page Groups:

Page 2
------
:Invoice Page Groups:

Page 3
------
:Invoice Page Groups:

Page 4
------
:Invoice Page Groups:

Image extraction code

The method to call to extract the images is called in the following way:

const { imageOperations } = require("mindee");

//...

const someResult = imageOperations.extractInvoices(myInputFile, myInferenceResult);

We'll add the method call to the response handling part of the script.
You should have something like this:


const { Client, imageOperations } = require("mindee");

const mindeeClient = new Client();
//... rest of the basic implementation

// Handle the response Promise
apiResponse.then((resp) => {
  
	const someResult = imageOperations.extractInvoices(myInputFile, resp.document.inference);
  someResult.then((extractionResp) => {
    //some code...
  });
});

This is not very pretty, though.

We can make this a bit more legible and usable by switching to something like this.

const { Client, product, imageOperations } = require("mindee");
const { setTimeout } = require("node:timers/promises");

async function parseInvoices() {
  // fill in your API key or add it as an environment variable
  const mindeeClient = new Client();

  const invoiceSplitterFile = mindeeClient.docFromPath("path/to/your/file.ext");
  const resp = await mindeeClient.parse(product.InvoiceSplitterV1, invoiceSplitterFile);
  let invoices = await imageOperations.extractInvoices(invoiceSplitterFile, resp.document.inference);
}
parseInvoices();

Parsing the Extracted Files

Now one last piece is missing: what to do with our invoices? Well, parse them, of course!

The code for this part is mostly up to you, but for this tutorial we'll use a simple loop to parse the documents and then display the results:


for (const invoice of invoices) {
  const respInvoice = await mindeeClient.parse(product.InvoiceV4, invoice.asSource());
  console.log(respInvoice.document.toString());
  await setTimeout(1000); // wait some time between requests as to not overload the server
}

This should print something like this:

########
Document
########
:Mindee ID: 53d616ab-b1c3-4dc8-8630-6bf5470b39e0
:Filename: invoice_p_0-0.pdf

Inference
#########
:Product: mindee/invoices v4.3
:Rotation applied: Yes

Prediction
==========
:Locale: fr; fr; EUR;
:Invoice Number: 0042004801351
:Reference Numbers:
:Purchase Date: 2020-02-17
:Due Date: 2020-02-17
:Total Net: 489.97
:Total Amount: 587.95
:Taxes:
  +---------------+--------+----------+---------------+
  | Base          | Code   | Rate (%) | Amount        |
  +===============+========+==========+===============+
  |               |        | 20.00    | 97.98         |
  +---------------+--------+----------+---------------+
:Supplier Payment Details: FR7640254025476501124705368;
:Supplier Name:
:Supplier Company Registrations:
:Supplier Address:
:Customer Name:
:Customer Company Registrations:
:Customer Address:
:Document Type: INVOICE
:Line Items:
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | Description                          | Product code | Quantity | Tax Amount | Tax Rate (%) | Total Amount | Unit Price |
  +======================================+==============+==========+============+==============+==============+============+
  | S)BOIE 5X500 FEUILLES A4             |              |          |            |              | 2.63         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | RAM 500F DCP BLANC A4 100G           |              |          |            |              | 0.98         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | PQ 960 ETIQUETTES L METAL            |              |          |            |              | 4.07         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | CARTOUCHE L NR BROTHER TN247BK       |              |          |            |              | 9.47         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | PQ20 ETIQ ULTRA RESIS METAXXDC       |              |          |            |              | 4.31         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | FO2 IMPRIM MULTIFONCT MFC-L3770CDW   |              |          |            |              | 120.00       |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+

Page Predictions
================

Page 0
------
:Locale: fr; fr; EUR;
:Invoice Number: 0042004801351
:Reference Numbers:
:Purchase Date: 2020-02-17
:Due Date: 2020-02-17
:Total Net: 489.97
:Total Amount: 587.95
:Taxes:
  +---------------+--------+----------+---------------+
  | Base          | Code   | Rate (%) | Amount        |
  +===============+========+==========+===============+
  |               |        | 20.00    | 97.98         |
  +---------------+--------+----------+---------------+
:Supplier Payment Details: FR7640254025476501124705368;
:Supplier Name:
:Supplier Company Registrations:
:Supplier Address:
:Customer Name:
:Customer Company Registrations:
:Customer Address:
:Document Type: INVOICE
:Line Items:
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | Description                          | Product code | Quantity | Tax Amount | Tax Rate (%) | Total Amount | Unit Price |
  +======================================+==============+==========+============+==============+==============+============+
  | S)BOIE 5X500 FEUILLES A4             |              |          |            |              | 2.63         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | RAM 500F DCP BLANC A4 100G           |              |          |            |              | 0.98         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | PQ 960 ETIQUETTES L METAL            |              |          |            |              | 4.07         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | CARTOUCHE L NR BROTHER TN247BK       |              |          |            |              | 9.47         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | PQ20 ETIQ ULTRA RESIS METAXXDC       |              |          |            |              | 4.31         |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
  | FO2 IMPRIM MULTIFONCT MFC-L3770CDW   |              |          |            |              | 120.00       |            |
  +--------------------------------------+--------------+----------+------------+--------------+--------------+------------+
...

Despite showing a filename, the results shouldn't needlessly create temporary files unless you specify it to, as everything is managed through file buffers.

A working implementation of this code can be found in the example/invoiceSplitterTutorial.js file on the Node.js repo.

Custom Implementation

If you are using a non-supported language, or simply want to implement things your own way, here's a gist of how things work under the hood...

But first, if you are not familiar with other products yet, know that a response from the API will contain a document field, which in turns contains the following attributes of interest: inference > prediction and inference > pages. pages being a list of objects, all containing their own prediction attributes. Think of the inference's prediction as an aggregate of these.

Note implementation might change depending on the language you use, but the structure will always follow that of the raw json response, which you can find the scheme of in this page.

Why are these predictions relevant? Because they contain our invoices' page groups. Which in turn contains a list of pages belonging to a single invoice, which is simply a number from 0 to n, where n is the last page index of your PDF.

The next step is pretty straightforward: use the indexes of each page group to cut out a page from the original file, and then create a new PDF file containing those pages.