Automatically Splitting Multi-page Invoices Using the Mindee Client Libraries
The Node.js library implementation differs from our other supported languages, see the Node.js dedicated tutorial instead.
Overview
Just want to get to the full script? Jump to the relevant section.
The Invoice Splitter Auto-Extraction feature allows you to process multipage invoice files, automatically split them into individual invoices, and extract data from each one. This guide demonstrates how to use the Mindee library to accomplish this task across various programming languages.
When to Use This Feature
Use this feature when you have:
- A single file containing multiple invoices
- Invoices from the same or different providers in one document
- The need to process each invoice individually without manual separation
Note: This API is distinct from the Multi Receipts Detector API, which isolates receipts within individual pages.
Prerequisites
Before you begin, ensure you have:
- A valid and up-to-date Mindee API key
- An active subscription to the Invoice Splitter API
- An active subscription to the Invoice OCR API
- The Mindee client library installed for your programming language
Sample File
For this tutorial, we'll use the following sample multi-page invoice file:
Basic Setup
- Import the necessary classes from the Mindee library.
- Initialize the Mindee client with your API key.
- Create an input source from your file path.
import os
from mindee import Client
from mindee.extraction.pdf_extractor import PdfExtractor
from mindee.input import PathInput
from mindee.product import InvoiceSplitterV1, InvoiceV4
mindee_client = Client(api_key="my-api-key")
# mindee_client = Client() # Optionally, set from env.
<?php
use Mindee\Client;
use Mindee\Extraction\PdfExtractor;
use Mindee\Input\PathInput;
use Mindee\Product\InvoiceSplitter\InvoiceSplitterV1;
use Mindee\Product\Invoice\InvoiceV4;
$mindeeClient = new Client("my-api-key-here");
// $mindeeClient = new Client(); // Optionally, use an environment variable.
$inputPath = "path/to/your/file.ext";
# frozen_string_literal: true
require 'mindee'
def invoice_splitter_auto_extraction(file_path)
mindee_client = Mindee::Client.new(api_key: 'my-api-key')
input_source = mindee_client.source_from_path(file_path)
# ...
end
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Mindee;
using Mindee.Extraction;
using Mindee.Input;
using Mindee.Product.InvoiceSplitter;
using Mindee.Product.Invoice;
var apiKey = "my-api-key";
var mindeeClient = new MindeeClient(apiKey);
var myFilePath = "path/to/your/file.ext";
import com.mindee.MindeeClient;
import com.mindee.input.LocalInputSource;
import com.mindee.extraction.ExtractedPDF;
import com.mindee.extraction.PDFExtractor;
import com.mindee.parsing.common.AsyncPredictResponse;
import com.mindee.product.invoice.InvoiceV4;
import com.mindee.product.invoicesplitter.InvoiceSplitterV1;
import java.io.File;
import java.io.IOException;
import java.util.List;
public class AutoInvoiceSplitterExtractionExample {
private static final String API_KEY = "my-api-key";
private static final MindeeClient mindeeClient = new MindeeClient(API_KEY);
public static void main(String[] args) throws IOException, InterruptedException {
String filePath = "/path/to/the/file.ext";
invoiceSplitterAutoExtraction(filePath);
}
// ...
}
Processing the Input
Check File Format and Page Count
- Check if the file is a PDF.
- If it's a PDF, check if it has multiple pages.
def parse_invoice(file_path):
input_source = PathInput(file_path)
if input_source.is_pdf() and input_source.count_doc_pages() > 1:
parse_multi_page(input_source)
else:
parse_single_page(input_source)
<?php
function parseInvoice(string $filePath, Client $mindeeClient)
{
$inputSource = new PathInput($filePath);
if ($inputSource->isPdf() && $inputSource->countDocPages() > 1) {
parseMultiPage($inputSource, $mindeeClient);
} else {
parseSinglePage($inputSource, $mindeeClient);
}
}
def invoice_splitter_auto_extraction(file_path)
mindee_client = Mindee::Client.new(api_key: 'my-api-key')
input_source = mindee_client.source_from_path(file_path)
if input_source.pdf? && input_source.count_pdf_pages > 1
parse_multi_page(mindee_client, input_source)
else
parse_single_page(mindee_client, input_source)
end
end
async Task InvoiceSplitterAutoExtraction(string filePath)
{
var inputSource = new LocalInputSource(filePath);
if (inputSource.IsPdf() && inputSource.GetPageCount() > 1)
{
await ParseMultiPage(inputSource);
}
else
{
await ParseSinglePage(inputSource);
}
}
private static void invoiceSplitterAutoExtraction(String filePath) throws IOException, InterruptedException {
LocalInputSource inputSource = new LocalInputSource(new File(filePath));
if (inputSource.isPdf() && new PDFExtractor(inputSource).getPageCount() > 1) {
parseMultiPage(inputSource);
} else {
parseSinglePage(inputSource);
}
}
Process Multi-Page Documents
- Use the Invoice Splitter API to get page groups.
- Extract individual invoices using the page groups.
- Process each extracted invoice with the Invoice OCR API.
def parse_multi_page(input_source):
pdf_extractor = PdfExtractor(input_source)
invoice_splitter_response = mindee_client.enqueue_and_parse(
InvoiceSplitterV1, input_source, close_file=False
)
page_groups = (
invoice_splitter_response.document.inference.prediction.invoice_page_groups
)
extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict=False)
for extracted_pdf in extracted_pdfs:
# Optional: Save the files locally
# extracted_pdf.write_to_file("output/path")
invoice_result = mindee_client.parse(InvoiceV4, extracted_pdf.as_input_source())
print(invoice_result.document)
<?php
function parseMultiPage(PathInput $inputSource, Client $mindeeClient)
{
global $mindeeClient;
$pdfExtractor = new PdfExtractor($inputSource);
$invoiceSplitterResponse = $mindeeClient->enqueueAndParse(
InvoiceSplitterV1::class,
$inputSource
);
$pageGroups = $invoiceSplitterResponse->document->inference->prediction->invoicePageGroups;
$extractedPdfs = $pdfExtractor->extractInvoices($pageGroups);
foreach ($extractedPdfs as $extractedPdf) {
// Optional: Save the files locally
// $extractedPdf->writeToFile("output/path");
$invoiceResult = $mindeeClient->parse(
InvoiceV4::class,
$extractedPdf->asInputSource()
);
echo $invoiceResult->document;
}
}
def parse_multi_page(mindee_client, input_source)
pdf_extractor = Mindee::Extraction::PdfExtractor::PdfExtractor.new(input_source)
invoice_splitter_response = mindee_client.enqueue_and_parse(
input_source,
Mindee::Product::InvoiceSplitter::InvoiceSplitterV1,
close_file: false
)
page_groups = invoice_splitter_response.document.inference.prediction.invoice_page_groups
extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict: false)
extracted_pdfs.each do |extracted_pdf|
# Optional: Save the files locally
# extracted_pdf.write_to_file("output/path")
invoice_result = mindee_client.parse(
extracted_pdf.as_input_source,
Mindee::Product::Invoice::InvoiceV4,
close_file: false
)
puts invoice_result.document
end
end
async Task ParseMultiPage(LocalInputSource inputSource)
{
PdfExtractor extractor = new PdfExtractor(inputSource);
var invoiceSplitterResponse = await mindeeClient.EnqueueAndParseAsync<InvoiceSplitterV1>(inputSource);
List<ExtractedPdf> extractedPdfs = extractor.ExtractInvoices(
invoiceSplitterResponse.Document.Inference.Prediction.PageGroups,
false
);
foreach (var extractedPdf in extractedPdfs)
{
// Optional: Save the files locally
// extractedPdf.WriteToFile("output/path");
var invoiceResult = await mindeeClient.ParseAsync<InvoiceV4>(extractedPdf.AsInputSource());
Console.WriteLine(invoiceResult.Document);
}
}
private static void parseMultiPage(LocalInputSource inputSource) throws IOException, InterruptedException {
PDFExtractor extractor = new PDFExtractor(inputSource);
AsyncPredictResponse<InvoiceSplitterV1> invoiceSplitterResponse =
mindeeClient.enqueueAndParse(InvoiceSplitterV1.class, inputSource);
List<ExtractedPDF> extractedPdfs = extractor.extractInvoices(
invoiceSplitterResponse.getDocumentObj().getInference().getPrediction().getInvoicePageGroups(),
false
);
for (ExtractedPDF extractedPdf : extractedPdfs) {
// Optional: Save the files locally
// extractedPdf.writeToFile("output/path");
AsyncPredictResponse<InvoiceV4> invoiceResult =
mindeeClient.enqueueAndParse(InvoiceV4.class, extractedPdf.asInputSource());
System.out.println(invoiceResult.getDocumentObj().toString());
}
}
Process Single-Page Documents
For single-page documents or non-PDFs, process the document directly with the Invoice OCR API.
def parse_single_page(input_source):
invoice_result = mindee_client.parse(InvoiceV4, input_source)
print(invoice_result.document)
<?php
function parseSinglePage(PathInput $inputSource, Client $mindeeClient)
{
$invoiceResult = $mindeeClient->parse(InvoiceV4::class, $inputSource);
echo $invoiceResult->document;
}
def parse_single_page(mindee_client, input_source)
invoice_result = mindee_client.parse(
input_source,
Mindee::Product::Invoice::InvoiceV4
)
puts invoice_result.document
end
private static void parseSinglePage(LocalInputSource inputSource) throws IOException, InterruptedException {
AsyncPredictResponse<InvoiceV4> invoiceResult = mindeeClient.enqueueAndParse(InvoiceV4.class, inputSource);
System.out.println(invoiceResult.getDocumentObj().toString());
}
AsyncPredictResponse<InvoiceV4> invoiceResult =
mindeeClient.enqueueAndParse(InvoiceV4.class, inputSource);
System.out.println(invoiceResult.getDocument().toString());
Example Output
After processing, you'll receive detailed information about each invoice. Here's a sample output:
######## Document ########
:Mindee ID: 409dc446-855a-43eb-9630-9d71dd72c5ba
:Filename: default_sample_001-001.pdf
Inference #########
:Product: mindee/invoices v4.6
:Rotation applied: Yes
Prediction ==========
:Locale: fr; fr; EUR;
:Invoice Number: 0042004801351
:Reference Numbers:
:Purchase Date: 2020-02-17
:Due Date:
:Total Net: 489.97
:Total Amount: 587.95
:Total Tax: 97.98
:Taxes:
+---------------+--------+----------+---------------+
| Base | Code | Rate (%) | Amount |
+===============+========+==========+===============+
| 489.97 | | 20.00 | 97.98 |
+---------------+--------+----------+---------------+
:Supplier Payment Details: FR7640254025476501124705368;
:Supplier Name:
:Supplier Company Registrations: Type: CF, Value: 72544370017
:Supplier Address:
:Supplier Phone Number: 0505444490
:Supplier Website:
:Supplier Email:
:Customer Name:
:Customer Company Registrations:
:Customer Address:
:Customer ID:
:Shipping Address:
:Billing Address:
:Document Type: INVOICE
:Line Items:
...
Full Script
from mindee import Client
from mindee.extraction.pdf_extractor import PdfExtractor
from mindee.input import PathInput
from mindee.product import InvoiceSplitterV1, InvoiceV4
mindee_client = Client(api_key="my-api-key")
# mindee_client = Client() # Optionally, set from env.
def parse_invoice(file_path):
input_source = PathInput(file_path)
if input_source.is_pdf() and input_source.count_doc_pages() > 1:
parse_multi_page(input_source)
else:
parse_single_page(input_source)
def parse_single_page(input_source):
invoice_result = mindee_client.parse(InvoiceV4, input_source)
print(invoice_result.document)
def parse_multi_page(input_source):
pdf_extractor = PdfExtractor(input_source)
invoice_splitter_response = mindee_client.enqueue_and_parse(
InvoiceSplitterV1, input_source, close_file=False
)
page_groups = (
invoice_splitter_response.document.inference.prediction.invoice_page_groups
)
extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict=False)
for extracted_pdf in extracted_pdfs:
# Optional: Save the files locally
# extracted_pdf.write_to_file("output/path")
invoice_result = mindee_client.parse(InvoiceV4, extracted_pdf.as_input_source())
print(invoice_result.document)
if __name__ == "__main__":
parse_invoice("path/to/my/file.ext")
<?php
use Mindee\Client;
use Mindee\Extraction\PdfExtractor;
use Mindee\Input\PathInput;
use Mindee\Product\Invoice\InvoiceV4;
use Mindee\Product\InvoiceSplitter\InvoiceSplitterV1;
function parseInvoice(string $filePath, Client $mindeeClient)
{
$inputSource = new PathInput($filePath);
if ($inputSource->isPdf() && $inputSource->countDocPages() > 1) {
parseMultiPage($inputSource, $mindeeClient);
} else {
parseSinglePage($inputSource, $mindeeClient);
}
}
function parseSinglePage(PathInput $inputSource, Client $mindeeClient)
{
$invoiceResult = $mindeeClient->parse(InvoiceV4::class, $inputSource);
echo $invoiceResult->document;
}
function parseMultiPage(PathInput $inputSource, Client $mindeeClient)
{
global $mindeeClient;
$pdfExtractor = new PdfExtractor($inputSource);
$invoiceSplitterResponse = $mindeeClient->enqueueAndParse(
InvoiceSplitterV1::class,
$inputSource
);
$pageGroups = $invoiceSplitterResponse->document->inference->prediction->invoicePageGroups;
$extractedPdfs = $pdfExtractor->extractInvoices($pageGroups);
foreach ($extractedPdfs as $extractedPdf) {
// Optional: Save the files locally
// $extractedPdf->writeToFile("output/path");
$invoiceResult = $mindeeClient->parse(
InvoiceV4::class,
$extractedPdf->asInputSource()
);
echo $invoiceResult->document;
}
}
$mindeeClient = new Client("my-api-key-here");
// $mindeeClient = new Client(); // Optionally, use an environment variable.
$inputPath = "path/to/your/file.ext";
parseInvoice($inputPath, $mindeeClient);
# frozen_string_literal: true
require 'mindee'
def invoice_splitter_auto_extraction(file_path)
mindee_client = Mindee::Client.new(api_key: 'my-api-key')
input_source = mindee_client.source_from_path(file_path)
if input_source.pdf? && input_source.count_pdf_pages > 1
parse_multi_page(mindee_client, input_source)
else
parse_single_page(mindee_client, input_source)
end
end
def parse_single_page(mindee_client, input_source)
invoice_result = mindee_client.parse(
input_source,
Mindee::Product::Invoice::InvoiceV4
)
puts invoice_result.document
end
def parse_multi_page(mindee_client, input_source)
pdf_extractor = Mindee::Extraction::PdfExtractor::PdfExtractor.new(input_source)
invoice_splitter_response = mindee_client.enqueue_and_parse(
input_source,
Mindee::Product::InvoiceSplitter::InvoiceSplitterV1,
close_file: false
)
page_groups = invoice_splitter_response.document.inference.prediction.invoice_page_groups
extracted_pdfs = pdf_extractor.extract_invoices(page_groups, strict: false)
extracted_pdfs.each do |extracted_pdf|
# Optional: Save the files locally
# extracted_pdf.write_to_file("output/path")
invoice_result = mindee_client.parse(
extracted_pdf.as_input_source,
Mindee::Product::Invoice::InvoiceV4,
close_file: false
)
puts invoice_result.document
end
end
my_file_path = '/path/to/the/file.ext'
invoice_splitter_auto_extraction(my_file_path)
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Mindee;
using Mindee.Extraction;
using Mindee.Input;
using Mindee.Product.InvoiceSplitter;
using Mindee.Product.Invoice;
var apiKey = "my-api-key";
var mindeeClient = new MindeeClient(apiKey);
var myFilePath = "path/to/your/file.ext";
await InvoiceSplitterAutoExtraction(myFilePath);
async Task InvoiceSplitterAutoExtraction(string filePath)
{
var inputSource = new LocalInputSource(filePath);
if (inputSource.IsPdf() && inputSource.GetPageCount() > 1)
{
await ParseMultiPage(inputSource);
}
else
{
await ParseSinglePage(inputSource);
}
}
async Task ParseSinglePage(LocalInputSource inputSource)
{
var invoiceResult = await mindeeClient.ParseAsync<InvoiceV4>(inputSource);
Console.WriteLine(invoiceResult.Document);
}
async Task ParseMultiPage(LocalInputSource inputSource)
{
PdfExtractor extractor = new PdfExtractor(inputSource);
var invoiceSplitterResponse = await mindeeClient.EnqueueAndParseAsync<InvoiceSplitterV1>(inputSource);
List<ExtractedPdf> extractedPdfs = extractor.ExtractInvoices(
invoiceSplitterResponse.Document.Inference.Prediction.PageGroups,
false
);
foreach (var extractedPdf in extractedPdfs)
{
// Optional: Save the files locally
// extractedPdf.WriteToFile("output/path");
var invoiceResult = await mindeeClient.ParseAsync<InvoiceV4>(extractedPdf.AsInputSource());
Console.WriteLine(invoiceResult.Document);
}
}
import com.mindee.MindeeClient;
import com.mindee.input.LocalInputSource;
import com.mindee.extraction.ExtractedPDF;
import com.mindee.extraction.PDFExtractor;
import com.mindee.parsing.common.AsyncPredictResponse;
import com.mindee.product.invoice.InvoiceV4;
import com.mindee.product.invoicesplitter.InvoiceSplitterV1;
import java.io.File;
import java.io.IOException;
import java.util.List;
public class AutoInvoiceSplitterExtractionExample {
private static final String API_KEY = "my-api-key";
private static final MindeeClient mindeeClient = new MindeeClient(API_KEY);
public static void main(String[] args) throws IOException, InterruptedException {
String filePath = "/path/to/the/file.ext";
invoiceSplitterAutoExtraction(filePath);
}
private static void invoiceSplitterAutoExtraction(String filePath) throws IOException, InterruptedException {
LocalInputSource inputSource = new LocalInputSource(new File(filePath));
if (inputSource.isPdf() && new PDFExtractor(inputSource).getPageCount() > 1) {
parseMultiPage(inputSource);
} else {
parseSinglePage(inputSource);
}
}
private static void parseSinglePage(LocalInputSource inputSource) throws IOException, InterruptedException {
AsyncPredictResponse<InvoiceV4> invoiceResult = mindeeClient.enqueueAndParse(InvoiceV4.class, inputSource);
System.out.println(invoiceResult.getDocumentObj().toString());
}
private static void parseMultiPage(LocalInputSource inputSource) throws IOException, InterruptedException {
PDFExtractor extractor = new PDFExtractor(inputSource);
AsyncPredictResponse<InvoiceSplitterV1> invoiceSplitterResponse =
mindeeClient.enqueueAndParse(InvoiceSplitterV1.class, inputSource);
List<ExtractedPDF> extractedPdfs = extractor.extractInvoices(
invoiceSplitterResponse.getDocumentObj().getInference().getPrediction().getInvoicePageGroups(),
false
);
for (ExtractedPDF extractedPdf : extractedPdfs) {
// Optional: Save the files locally
// extractedPdf.writeToFile("output/path");
AsyncPredictResponse<InvoiceV4> invoiceResult =
mindeeClient.enqueueAndParse(InvoiceV4.class, extractedPdf.asInputSource());
System.out.println(invoiceResult.getDocumentObj().toString());
}
}
}
Best Practices
- Handle potential errors and exceptions in your code.
- Implement retry logic for API calls to handle temporary network issues.
- Store extracted data securely and in compliance with relevant data protection regulations.
- When uploading files, ensure that:
- Invoices are clear, unstained, and properly unfolded
- There are minimal extra pages (e.g., terms & conditions)
- Pages from any single invoice are all oriented in the same direction
Troubleshooting
If you encounter issues:
- Verify your API key and subscription status for both Invoice Splitter and Invoice OCR APIs.
- Check the input file format and ensure it's supported.
- Review the API response for any error messages.
- Consult the Mindee API documentation for more detailed information.
Updated about 2 months ago