Financial document OCR
Keep track of the changes and updates for the Financial document OCR API.
Version 1
⚡️ Features and Changes (April 25th, 2024)
-
🚀 Extended latin alphabet support for invoices
We released new models for our generic text detection and recognition pipeline. This release has increased the overall performances on all fields and supports extended latin alphabet characters:{'`', '¡', '¥', '¿', 'Á', 'Ã', 'Ä', 'Å', 'Æ', 'Ì', 'Í', 'Ð', 'Ñ', 'Ò', 'Ó', 'Õ', 'Ö', 'Ø', 'Ú', 'Ü', 'Ý', 'Þ', 'ß', 'á', 'ã', 'ä', 'å', 'æ', 'ì', 'í', 'ð', 'ñ', 'ò', 'ó', 'õ', 'ö', 'ø', 'ú', 'ü', 'ý', 'þ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ğ', 'ğ', 'Ģ', 'ģ', 'Ī', 'ī', 'Į', 'į', 'İ', 'ı', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'Ō', 'ō', 'Ő', 'ő', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ş', 'ş', 'Š', 'š', 'Ť', 'ť', 'Ū', 'ū', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'Ș', 'ș', 'Ț', 'ț', 'ẞ', '₿'}
-
🔥 Strong improvement on
due_date
,line_items
for invoices
We have observed a reduction in error rates as follows:- 20% for
due_date
- 25% for
line_items
- 20% for
-
✨ New fields for invoices:
The API is now extracting the following fields:
customer_id
: The identifier of the customer in the supplier’s referential. It can also refer to the client ID, client / customer account number…
supplier_phone_number
: The phone number of the supplier
supplier_email
: The supplier email address
supplier_website
: The supplier website URL -
🔥 General accuracy improvement
Thanks to the improvement done on our generic text detection and recognition algorithms, we measured a reduction in error rates on all fields, especially for supplier and customer information. -
✨ New field for receipts:
Thelocale
field now contains the following subfields when the document sent to the endpoint is a receipt:
country
: country code of the country where the receipt was issued (ex: US)
value
: concatenation of language and country codes in ISO format (ex: en-US)
⚡️ Features and Changes (March 11th, 2024)
-
🚀 Integration of company ID & logo database for invoices
We have integrated a company ID database and a vector database featuring millions of logos. This enhancement enables our R&D team to efficiently rectify any issues with non-functional supplier names. -
🔥 Strong improvement on invoices for
supplier_name
,customer_name
, andinvoice_number
We have observed a reduction in error rates as follows:- 20% for
customer_name
- 15% for
supplier_name
- 10% for
invoice_number
The improvement in
supplier_name
was achieved by incorporating information from the databases. Thecustomer_name
algorithm now mirrors thesupplier_name
one.invoice_number
now employs an NLP modality to boost its precision. - 20% for
⚡️ Features and Changes (January 30th, 2024)
- 🚀 Integration of a proprietary language model in the algorithm pipeline: LiLT
LILT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding.
LiLT's design combines textual content with layout structure. This means it doesn't just read the text but also understands how the text is organized within the document. For instance, it recognizes headings, paragraphs, tables, and other structural elements, which is a crucial aspect of context awareness in document processing.
The integration of this new language model in our pipeline helps us achieve better accuracy, and more flexibility when adding new supported fields. - 🔥 Strong improvement on
supplier_name
,supplier_address
, andsupplier_company_registrations
on invoices
The main focus of this release was to improve drastically the supplier information extraction.
We measured a decrease in error rates of:- 42% for
**supplier_name**
- 10% for
**supplier_address**
- 10% for
**supplier_company_registrations**
Moreover, the integration of the LILT offers more robustness in terms of languages thanks to its language-independent component and will help us improve all other fields in the next releases.
- 42% for
- ✨ New field:
total_tax
on invoices
The API is now extracting the total tax information, returned as a number. It corresponds to the total tax explicitly written in the document. - 🔥 General improvement for all fields on invoices
More training data was added to our training set, including different geographies and more variability. We’ve measured an improvement in accuracy for all extracted fields.
⚡️ Features and Changes (September 1st, 2023)
- New feature: Raw Value available for both Supplier Name and Customer Name. The Raw Value extracts the name without post processing nor formatting. It can thus be different from the Value.
⚡️ Features and Changes (May 23rd, 2023)
New extracted field:
- supplier_phone_number
Updated fields for receipts:
- supplier_address is now available for receipts
- supplier_company_registrations is now available for receipts
- line items is now available for receipts but limited to the following features: description, unit_price quantity, total_amount
⚡️Feature: First Release (January 17th, 2023)
Extracted fields:
- total_amount
- total_net
- taxes
- supplier_address
- supplier_name
- payment_details ( Null for receipt)
- orientation
- locale (currency, language)
- invoice_number (Null for receipt)
- reference_numbers ( Empty list for receipt)
- due_date
- document_type
- date
- customer_company_registration (Null for receipt)
- customer_address (Null for receipt)
- customer_name (Null for receipt)
- supplier_company_registration (Null for receipt)
- category
- Subcategory
- time (Null for Invoices)
- tip (Null for Invoices)
- total_tax (sum of taxes for Invoices)
- line_items (empty list for Receipt) :
product_code
,description
,quantity
,unit_price
,total_amount
,tax_amount
,tax_rate
Questions?
Join our Slack
Updated 3 days ago