Automatically Splitting Multi-page Invoices Using the Mindee Client Libraries

📘

The Node.js library implementation differs from our other supported languages, see the Node.js dedicated tutorial instead.

Overview

Just want to get to the full script? Jump to the relevant section.

The Invoice Splitter Auto-Extraction feature allows you to process multipage invoice files, automatically split them into individual invoices, and extract data from each one. This guide demonstrates how to use the Mindee library to accomplish this task across various programming languages.

When to Use This Feature

Use this feature when you have:

  • A single file containing multiple invoices
  • Invoices from the same or different providers in one document
  • The need to process each invoice individually without manual separation

🚧

Note: This API is distinct from the Multi Receipts Detector API, which isolates receipts within individual pages.

Prerequisites

Before you begin, ensure you have:

Sample File

For this tutorial, we'll use the following sample multi-page invoice file:

Basic Setup

  1. Import the necessary classes from the Mindee library.
  2. Initialize the Mindee client with your API key.
  3. Create an input source from your file path.
import os
from mindee import Client
from mindee.extraction.pdf_extractor import PdfExtractor
from mindee.input import PathInput
from mindee.product import InvoiceSplitterV1, InvoiceV4

mindee_client = Client(api_key="my-api-key")
# mindee_client = Client() # Optionally, set from env.

<?php

use Mindee\Client;
use Mindee\Extraction\PdfExtractor;
use Mindee\Input\PathInput;
use Mindee\Product\InvoiceSplitter\InvoiceSplitterV1;
use Mindee\Product\Invoice\InvoiceV4;

$mindeeClient = new Client("my-api-key-here");
// $mindeeClient = new Client(); // Optionally, use an environment variable.
$inputPath = "path/to/your/file.ext";
# frozen_string_literal: true

require 'mindee'

def invoice_splitter_auto_extraction(file_path)
  mindee_client = Mindee::Client.new(api_key: 'my-api-key')
  input_source = mindee_client.source_from_path(file_path)
  # ...
end
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Mindee;
using Mindee.Extraction;
using Mindee.Input;
using Mindee.Product.InvoiceSplitter;
using Mindee.Product.Invoice;

var apiKey = "my-api-key";
var mindeeClient = new MindeeClient(apiKey);
var myFilePath = "path/to/your/file.ext";
import com.mindee.MindeeClient;
import com.mindee.input.LocalInputSource;
import com.mindee.extraction.ExtractedPDF;
import com.mindee.extraction.PDFExtractor;
import com.mindee.parsing.common.AsyncPredictResponse;
import com.mindee.product.invoice.InvoiceV4;
import com.mindee.product.invoicesplitter.InvoiceSplitterV1;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class AutoInvoiceSplitterExtractionExample {
  private static final String API_KEY = "my-api-key";
  private static final MindeeClient mindeeClient = new MindeeClient(API_KEY);
  
  public static void main(String[] args) throws IOException, InterruptedException {
    String filePath = "/path/to/the/file.ext";
    invoiceSplitterAutoExtraction(filePath);
  }
  // ...
}

Processing the Input

Check File Format and Page Count

  1. Check if the file is a PDF.
  2. If it's a PDF, check if it has multiple pages.
def parse_invoice(file_path):
    input_source = PathInput(file_path)

    if input_source.is_pdf() and input_source.count_doc_pages() > 1:
        parse_multi_page(input_source)
    else:
        parse_single_page(input_source)
<?php
  
function parseInvoice(string $filePath, Client $mindeeClient)
{
    $inputSource = new PathInput($filePath);

    if ($inputSource->isPdf() && $inputSource->countDocPages() > 1) {
        parseMultiPage($inputSource, $mindeeClient);
    } else {
        parseSinglePage($inputSource, $mindeeClient);
    }
}
def invoice_splitter_auto_extraction(file_path)
  mindee_client = Mindee::Client.new(api_key: 'my-api-key')
  input_source = mindee_client.source_from_path(file_path)

  if input_source.pdf? && input_source.count_pdf_pages > 1
    parse_multi_page(mindee_client, input_source)
  else
    parse_single_page(mindee_client, input_source)
  end
end
async Task InvoiceSplitterAutoExtraction(string filePath)
{
    var inputSource = new LocalInputSource(filePath);

    if (inputSource.IsPdf() && inputSource.GetPageCount() > 1)
    {
        await ParseMultiPage(inputSource);
    }
    else
    {
        await ParseSinglePage(inputSource);
    }
}
private static void invoiceSplitterAutoExtraction(String filePath) throws IOException, InterruptedException {
  LocalInputSource inputSource = new LocalInputSource(new File(filePath));

  if (inputSource.isPdf() && new PDFExtractor(inputSource).getPageCount() > 1) {
    parseMultiPage(inputSource);
  } else {
    parseSinglePage(inputSource);
  }
}

Process Multi-Page Documents

  1. Use the Invoice Splitter API to get page groups.
  2. Extract individual invoices using the page groups.
  3. Process each extracted invoice with the Invoice OCR API.
def parse_multi_page(input_source):
    pdf_extractor = PdfExtractor(input_source)
    invoice_splitter_response = mindee_client.enqueue_and_parse(
        InvoiceSplitterV1, input_source, close_file=False
    )
    page_groups = (
        invoice_splitter_response.document.inference.prediction.invoice_page_groups
    )
    extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict=False)

    for extracted_pdf in extracted_pdfs:
        # Optional: Save the files locally
        # extracted_pdf.write_to_file("output/path")

        invoice_result = mindee_client.parse(InvoiceV4, extracted_pdf.as_input_source())
        print(invoice_result.document)

<?php

function parseMultiPage(PathInput $inputSource, Client $mindeeClient)
{
    global $mindeeClient;
    $pdfExtractor = new PdfExtractor($inputSource);
    $invoiceSplitterResponse = $mindeeClient->enqueueAndParse(
        InvoiceSplitterV1::class,
        $inputSource
    );
    $pageGroups = $invoiceSplitterResponse->document->inference->prediction->invoicePageGroups;
    $extractedPdfs = $pdfExtractor->extractInvoices($pageGroups);

    foreach ($extractedPdfs as $extractedPdf) {
        // Optional: Save the files locally
        // $extractedPdf->writeToFile("output/path");

        $invoiceResult = $mindeeClient->parse(
            InvoiceV4::class,
            $extractedPdf->asInputSource()
        );
        echo $invoiceResult->document;
    }
}
def parse_multi_page(mindee_client, input_source)
  pdf_extractor = Mindee::Extraction::PdfExtractor::PdfExtractor.new(input_source)
  invoice_splitter_response = mindee_client.enqueue_and_parse(
    input_source,
    Mindee::Product::InvoiceSplitter::InvoiceSplitterV1,
    close_file: false
  )
  page_groups = invoice_splitter_response.document.inference.prediction.invoice_page_groups
  extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict: false)

  extracted_pdfs.each do |extracted_pdf|
    # Optional: Save the files locally
    # extracted_pdf.write_to_file("output/path")

    invoice_result = mindee_client.parse(
      extracted_pdf.as_input_source,
      Mindee::Product::Invoice::InvoiceV4,
      close_file: false
    )
    puts invoice_result.document
  end
end
async Task ParseMultiPage(LocalInputSource inputSource)
{
    PdfExtractor extractor = new PdfExtractor(inputSource);
    var invoiceSplitterResponse = await mindeeClient.EnqueueAndParseAsync<InvoiceSplitterV1>(inputSource);
    List<ExtractedPdf> extractedPdfs = extractor.ExtractInvoices(
        invoiceSplitterResponse.Document.Inference.Prediction.PageGroups,
        false
    );

    foreach (var extractedPdf in extractedPdfs)
    {
        // Optional: Save the files locally
        // extractedPdf.WriteToFile("output/path");

        var invoiceResult = await mindeeClient.ParseAsync<InvoiceV4>(extractedPdf.AsInputSource());
        Console.WriteLine(invoiceResult.Document);
    }
}
private static void parseMultiPage(LocalInputSource inputSource) throws IOException, InterruptedException {
  PDFExtractor extractor = new PDFExtractor(inputSource);
  AsyncPredictResponse<InvoiceSplitterV1> invoiceSplitterResponse =
    mindeeClient.enqueueAndParse(InvoiceSplitterV1.class, inputSource);

  List<ExtractedPDF> extractedPdfs = extractor.extractInvoices(
    invoiceSplitterResponse.getDocumentObj().getInference().getPrediction().getInvoicePageGroups(),
    false
  );

  for (ExtractedPDF extractedPdf : extractedPdfs) {
    // Optional: Save the files locally
    // extractedPdf.writeToFile("output/path");

    AsyncPredictResponse<InvoiceV4> invoiceResult =
      mindeeClient.enqueueAndParse(InvoiceV4.class, extractedPdf.asInputSource());
    System.out.println(invoiceResult.getDocumentObj().toString());
  }
}

Process Single-Page Documents

For single-page documents or non-PDFs, process the document directly with the Invoice OCR API.

def parse_single_page(input_source):
    invoice_result = mindee_client.parse(InvoiceV4, input_source)
    print(invoice_result.document)

<?php

function parseSinglePage(PathInput $inputSource, Client $mindeeClient)
{
    $invoiceResult = $mindeeClient->parse(InvoiceV4::class, $inputSource);
    echo $invoiceResult->document;
}
def parse_single_page(mindee_client, input_source)
  invoice_result = mindee_client.parse(
    input_source,
    Mindee::Product::Invoice::InvoiceV4
  )
  puts invoice_result.document
end
private static void parseSinglePage(LocalInputSource inputSource) throws IOException, InterruptedException {
  AsyncPredictResponse<InvoiceV4> invoiceResult = mindeeClient.enqueueAndParse(InvoiceV4.class, inputSource);
  System.out.println(invoiceResult.getDocumentObj().toString());
}
AsyncPredictResponse<InvoiceV4> invoiceResult =
  mindeeClient.enqueueAndParse(InvoiceV4.class, inputSource);
System.out.println(invoiceResult.getDocument().toString());

Example Output

After processing, you'll receive detailed information about each invoice. Here's a sample output:

######## Document ########
:Mindee ID: 409dc446-855a-43eb-9630-9d71dd72c5ba
:Filename: default_sample_001-001.pdf
Inference #########
:Product: mindee/invoices v4.6
:Rotation applied: Yes
Prediction ==========
:Locale: fr; fr; EUR;
:Invoice Number: 0042004801351
:Reference Numbers:
:Purchase Date: 2020-02-17
:Due Date:
:Total Net: 489.97
:Total Amount: 587.95
:Total Tax: 97.98
:Taxes:
+---------------+--------+----------+---------------+
| Base          | Code   | Rate (%) | Amount        |
+===============+========+==========+===============+
| 489.97        |        | 20.00    | 97.98         |
+---------------+--------+----------+---------------+
:Supplier Payment Details: FR7640254025476501124705368;
:Supplier Name:
:Supplier Company Registrations: Type: CF, Value: 72544370017
:Supplier Address:
:Supplier Phone Number: 0505444490
:Supplier Website:
:Supplier Email:
:Customer Name:
:Customer Company Registrations:
:Customer Address:
:Customer ID:
:Shipping Address:
:Billing Address:
:Document Type: INVOICE
:Line Items:
...

Full Script

from mindee import Client
from mindee.extraction.pdf_extractor import PdfExtractor
from mindee.input import PathInput
from mindee.product import InvoiceSplitterV1, InvoiceV4

mindee_client = Client(api_key="my-api-key")
# mindee_client = Client()  # Optionally, set from env.


def parse_invoice(file_path):
    input_source = PathInput(file_path)

    if input_source.is_pdf() and input_source.count_doc_pages() > 1:
        parse_multi_page(input_source)
    else:
        parse_single_page(input_source)


def parse_single_page(input_source):
    invoice_result = mindee_client.parse(InvoiceV4, input_source)
    print(invoice_result.document)


def parse_multi_page(input_source):
    pdf_extractor = PdfExtractor(input_source)
    invoice_splitter_response = mindee_client.enqueue_and_parse(
        InvoiceSplitterV1, input_source, close_file=False
    )
    page_groups = (
        invoice_splitter_response.document.inference.prediction.invoice_page_groups
    )
    extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict=False)

    for extracted_pdf in extracted_pdfs:
        # Optional: Save the files locally
        # extracted_pdf.write_to_file("output/path")

        invoice_result = mindee_client.parse(InvoiceV4, extracted_pdf.as_input_source())
        print(invoice_result.document)


if __name__ == "__main__":
    parse_invoice("path/to/my/file.ext")

<?php

use Mindee\Client;
use Mindee\Extraction\PdfExtractor;
use Mindee\Input\PathInput;
use Mindee\Product\Invoice\InvoiceV4;
use Mindee\Product\InvoiceSplitter\InvoiceSplitterV1;

function parseInvoice(string $filePath, Client $mindeeClient)
{
    $inputSource = new PathInput($filePath);

    if ($inputSource->isPdf() && $inputSource->countDocPages() > 1) {
        parseMultiPage($inputSource, $mindeeClient);
    } else {
        parseSinglePage($inputSource, $mindeeClient);
    }
}

function parseSinglePage(PathInput $inputSource, Client $mindeeClient)
{
    $invoiceResult = $mindeeClient->parse(InvoiceV4::class, $inputSource);
    echo $invoiceResult->document;
}

function parseMultiPage(PathInput $inputSource, Client $mindeeClient)
{
    global $mindeeClient;
    $pdfExtractor = new PdfExtractor($inputSource);
    $invoiceSplitterResponse = $mindeeClient->enqueueAndParse(
        InvoiceSplitterV1::class,
        $inputSource
    );
    $pageGroups = $invoiceSplitterResponse->document->inference->prediction->invoicePageGroups;
    $extractedPdfs = $pdfExtractor->extractInvoices($pageGroups);

    foreach ($extractedPdfs as $extractedPdf) {
        // Optional: Save the files locally
        // $extractedPdf->writeToFile("output/path");

        $invoiceResult = $mindeeClient->parse(
            InvoiceV4::class,
            $extractedPdf->asInputSource()
        );
        echo $invoiceResult->document;
    }
}

$mindeeClient = new Client("my-api-key-here");
// $mindeeClient = new Client(); // Optionally, use an environment variable.
$inputPath = "path/to/your/file.ext";
parseInvoice($inputPath, $mindeeClient);

# frozen_string_literal: true

require 'mindee'

def invoice_splitter_auto_extraction(file_path)
  mindee_client = Mindee::Client.new(api_key: 'my-api-key')
  input_source = mindee_client.source_from_path(file_path)

  if input_source.pdf? && input_source.count_pdf_pages > 1
    parse_multi_page(mindee_client, input_source)
  else
    parse_single_page(mindee_client, input_source)
  end
end

def parse_single_page(mindee_client, input_source)
  invoice_result = mindee_client.parse(
    input_source,
    Mindee::Product::Invoice::InvoiceV4
  )
  puts invoice_result.document
end

def parse_multi_page(mindee_client, input_source)
  pdf_extractor = Mindee::Extraction::PdfExtractor::PdfExtractor.new(input_source)
  invoice_splitter_response = mindee_client.enqueue_and_parse(
    input_source,
    Mindee::Product::InvoiceSplitter::InvoiceSplitterV1,
    close_file: false
  )
  page_groups = invoice_splitter_response.document.inference.prediction.invoice_page_groups
  extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict: false)

  extracted_pdfs.each do |extracted_pdf|
    # Optional: Save the files locally
    # extracted_pdf.write_to_file("output/path")

    invoice_result = mindee_client.parse(
      extracted_pdf.as_input_source,
      Mindee::Product::Invoice::InvoiceV4,
      close_file: false
    )
    puts invoice_result.document
  end
end

my_file_path = '/path/to/the/file.ext'
invoice_splitter_auto_extraction(my_file_path)

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Mindee;
using Mindee.Extraction;
using Mindee.Input;
using Mindee.Product.InvoiceSplitter;
using Mindee.Product.Invoice;

var apiKey = "my-api-key";
var mindeeClient = new MindeeClient(apiKey);
var myFilePath = "path/to/your/file.ext";

await InvoiceSplitterAutoExtraction(myFilePath);

async Task InvoiceSplitterAutoExtraction(string filePath)
{
    var inputSource = new LocalInputSource(filePath);

    if (inputSource.IsPdf() && inputSource.GetPageCount() > 1)
    {
        await ParseMultiPage(inputSource);
    }
    else
    {
        await ParseSinglePage(inputSource);
    }
}

async Task ParseSinglePage(LocalInputSource inputSource)
{
    var invoiceResult = await mindeeClient.ParseAsync<InvoiceV4>(inputSource);
    Console.WriteLine(invoiceResult.Document);
}

async Task ParseMultiPage(LocalInputSource inputSource)
{
    PdfExtractor extractor = new PdfExtractor(inputSource);
    var invoiceSplitterResponse = await mindeeClient.EnqueueAndParseAsync<InvoiceSplitterV1>(inputSource);
    List<ExtractedPdf> extractedPdfs = extractor.ExtractInvoices(
        invoiceSplitterResponse.Document.Inference.Prediction.PageGroups,
        false
    );

    foreach (var extractedPdf in extractedPdfs)
    {
        // Optional: Save the files locally
        // extractedPdf.WriteToFile("output/path");

        var invoiceResult = await mindeeClient.ParseAsync<InvoiceV4>(extractedPdf.AsInputSource());
        Console.WriteLine(invoiceResult.Document);
    }
}

import com.mindee.MindeeClient;
import com.mindee.input.LocalInputSource;
import com.mindee.extraction.ExtractedPDF;
import com.mindee.extraction.PDFExtractor;
import com.mindee.parsing.common.AsyncPredictResponse;
import com.mindee.product.invoice.InvoiceV4;
import com.mindee.product.invoicesplitter.InvoiceSplitterV1;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class AutoInvoiceSplitterExtractionExample {
  private static final String API_KEY = "my-api-key";
  private static final MindeeClient mindeeClient = new MindeeClient(API_KEY);

  public static void main(String[] args) throws IOException, InterruptedException {
    String filePath = "/path/to/the/file.ext";
    invoiceSplitterAutoExtraction(filePath);
  }

  private static void invoiceSplitterAutoExtraction(String filePath) throws IOException, InterruptedException {
    LocalInputSource inputSource = new LocalInputSource(new File(filePath));

    if (inputSource.isPdf() && new PDFExtractor(inputSource).getPageCount() > 1) {
      parseMultiPage(inputSource);
    } else {
      parseSinglePage(inputSource);
    }
  }

  private static void parseSinglePage(LocalInputSource inputSource) throws IOException, InterruptedException {
    AsyncPredictResponse<InvoiceV4> invoiceResult = mindeeClient.enqueueAndParse(InvoiceV4.class, inputSource);
    System.out.println(invoiceResult.getDocumentObj().toString());
  }

  private static void parseMultiPage(LocalInputSource inputSource) throws IOException, InterruptedException {
    PDFExtractor extractor = new PDFExtractor(inputSource);
    AsyncPredictResponse<InvoiceSplitterV1> invoiceSplitterResponse =
      mindeeClient.enqueueAndParse(InvoiceSplitterV1.class, inputSource);

    List<ExtractedPDF> extractedPdfs = extractor.extractInvoices(
      invoiceSplitterResponse.getDocumentObj().getInference().getPrediction().getInvoicePageGroups(),
      false
    );

    for (ExtractedPDF extractedPdf : extractedPdfs) {
      // Optional: Save the files locally
      // extractedPdf.writeToFile("output/path");

      AsyncPredictResponse<InvoiceV4> invoiceResult =
        mindeeClient.enqueueAndParse(InvoiceV4.class, extractedPdf.asInputSource());
      System.out.println(invoiceResult.getDocumentObj().toString());
    }
  }
}

Best Practices

  • Handle potential errors and exceptions in your code.
  • Implement retry logic for API calls to handle temporary network issues.
  • Store extracted data securely and in compliance with relevant data protection regulations.
  • When uploading files, ensure that:
    • Invoices are clear, unstained, and properly unfolded
    • There are minimal extra pages (e.g., terms & conditions)
    • Pages from any single invoice are all oriented in the same direction

Troubleshooting

If you encounter issues:

  1. Verify your API key and subscription status for both Invoice Splitter and Invoice OCR APIs.
  2. Check the input file format and ensure it's supported.
  3. Review the API response for any error messages.
  4. Consult the Mindee API documentation for more detailed information.