Extracting Line Items Tutorial

Overview

The goal of this tutorial is show a complete example on extracting line items (or tabular data) from a document.

To do this we'll be using the following dummy file:

example line items

Some things to note about the document, which are important to obtain best results.
Those most important are in bold:

  • Each column has a title, or column positions are fixed.
  • The document is perfectly straight, not skewed at all.
  • There are a fixed number of columns.
  • The text is clear and has a high contrast compared to the background.

This is because the general approach is to identify the columns first, then to break them down into lines and words. Column identification is therefore critical.

Note that this doesn't mean the approach described here won't work if these criteria don't match your document.
Just that results will probably not be ideal, however it's not an all or nothing deal.

For example a receipt which has always the same 3 columns in the same general position but without a title will work.

We'll also need to determine our anchor column(s). This is the column which is used to determine the line positions. It needs to be present on every line/row. So in the example file above, the last two columns cannot be used, as there some lines with empty values.

In typical cases, it will be a column containing a title, description, or code.
In this example we'll be using the "Expense Category" as our anchor column.

Creating the Model

You'll need to create a new model using the API Builder (Studio).

On the API page of the main interface, click on Studio:

api studio

Then click on the Create a new API button:

new api

Next, enter in the basic information for your API and click Next:

api studio

OK, now the fun part!

We'll need to define the fields to extract from the document.

Since the idea is to work mainly with columns, we'll create a field for each column.

Choose the field type that is best adapted to your document, such as String, Amount, Date, &c.

In order to keep things simple for this tutorial, we'll just use a String field for all columns:

string field

The first column we'll create is for the "Expense Category", which in our case has only alphabetical characters.

We'll therefore check the It never contains numeric characters checkbox, then add the field:

alpha field

For the remaining 3 columns, they only contain amounts.
While there is the Amount field, this is better suited when using the API in a no-code situation.
When using code, we'll have more flexibility in parsing a string into floats or integers as needed.

We do want to specify that the strings only contain numerical characters, as follows:

numeric field

We'll do this two more times for the remaining columns.

When all the columns are added, it should look something like this:

finished

Note: for this tutorial we're only extracting line items, but it's perfectly possible to extract other fields on the document, using the same model.

Click on the Create API button at the bottom of the screen to start training the model.

Training the Model

To access the training screen, click on the link in the center of the page that appears after having created the API:

train link

This will bring up the training interface, which we'll use to annotate the files.

Use the Add documents button to upload your training files.

You can send file by file, but this gets tedious very quickly, much better to send a .zip archive containing your training files.

In this toy example, we'll only annotate 20 files, but in reality you'll need at least 60 files for any kind of real-world document containing line items, possibly much more depending on the complexity of document. You can annotate up to 1000 files.

Training Set: Make sure to keep at least 20 files which you do not train on, this will be important for the validation step later. Your validation set should be similar to your training set.

Multi-Page PDFs: If your files are mainly multi-page PDFs, and in particular if the same columns are present on multiple pages in the same PDF, you'll need to split into several files. In other words, each page in the PDF should be annotated as a separate file.

In any case, annotating this type of model consists of selecting all the words that belong to a particular column.

Press the Ctrl key on your keyboard to select multiple words at a time. Do this for all 4 columns:

model train 1
model train 2

When all columns are selected, click on the Validate button to save your annotations:

model train 3

Repeat this for all the files uploaded.

Once you reach 20 files annotated, a new model training will start automatically.

You can check on this in the Models page, accessed using the link on the left side of the interface.

Model training time increases with the number of files annotated.

model train 4

When the model reaches 100% training completed, it's time to validate it.

Validating the Model

Every time a new model is trained, you'll want to validate the performance.

Use your training set for this. You did keep at least 20 files for validating the model, right? ;-)

For this tutorial we're going to cheat and use one of the files we trained on for demonstrative purposes.

Go the Live Interface page, accessed from the link on the left side of the interface.

Then upload a file, it will be processed automatically.

The result should look something like this:

validate 1

The main thing to check is that each column is properly extracted.
Here it's working perfectly (because we cheated), despite some of the column colors being very similar.

It will most likely take a few rounds of this, training and validating until results are satisfactory.

Once you're satisfied with the model's performance, it's time to get into the code side of things.

Calling the API

Time to whip out your favorite IDE and get coding!

For this tutorial we'll be using the official Mindee client library for Node.js/Python.

Other programming languages are supported, check the list.

Let's create a directory to store our project and install the Mindee client:

mkdir line_items_tutorial
cd line_items_tutorial
npm install -s mindee
touch demo.js
mkdir line_items_tutorial
cd line_items_tutorial
pip install mindee
touch demo.py

To get started with the code we'll head on over to the Documentation page by using the link on the left of the interface.

From there we'll click on your language (NODEJSor PYTHON), then Select an API key, and finally use the copy code button:

validate 1

Using your IDE, open the demo file we created, and paste this code. We'll use it as base.

Make sure the API key is correctly filled in, if not, simply grab one from the Mindee interface.

Find the line:

const inputSource = mindeeClient.docFromPath("/path/to/the/file.ext");
input_source = mindeeClient.source_from_path("/path/to/the/file.ext")

And put in a real file path on your drive, this should be a file from your validation set.

Now run the code:

node demo.js
python3 demo.py

The result should look something like this:

########
Document
########
:Mindee ID: 4f5f93f0-28f5-4eb2-ac1f-bc65a37bc948
:Filename: line_items_sample.png

Inference
#########
:Product: ianare/line_items_tutorial v1.1
:Rotation applied: Yes

Prediction
==========
:category: Hand Tools Power Tools Tool Accessories Seeds Annuals Perennials Trees Soil Mulch Horse Manure Chicken Feed Building Supplies
:previous_year_actual: 55.25 350.90 12.00 20.00 25.25 102.00 28.00 125.00 35.45 50.00 200.50 57.60
:year_actual: 5.95 195.95 78.90 40.32 106.15 260.42 75.00 0.00 80.00 220.67 36.10
:year_projection: 0.00 200.00 50.00 45.00 100.00 200.00 75.00 15.00 75.00 250.00 60.00

Page Predictions
================

Page 0
------
:category: Hand Tools Power Tools Tool Accessories Seeds Annuals Perennials Trees Soil Mulch Horse Manure Chicken Feed Building Supplies
:previous_year_actual: 55.25 350.90 12.00 20.00 25.25 102.00 28.00 125.00 35.45 50.00 200.50 57.60
:year_actual: 5.95 195.95 78.90 40.32 106.15 260.42 75.00 0.00 80.00 220.67 36.10
:year_projection: 0.00 200.00 50.00 45.00 100.00 200.00 75.00 15.00 75.00 250.00 60.0

You'll notice that each column field has the correct contents.

Once this is working properly, it's time to modify the code to reconstruct the line items into something more useful.

Line Reconstruction Code

We are now ready to convert our column fields into line items.

We'll need our anchor column(s) and columns we want to extract, these are the field names as outputted in the previous section.

In our case the anchor column name is category.

The anchor columns are not necessarily extracted, this can give some flexibility as needed.

In our case we do want to extract the anchor column, and so the column names are:
category, previous_year_actual, year_actual, year_projection.

The method to call is columnsToLineItems, it's available at both document and page level objects.

We'll add the method call to the response handling part of the script.
You should have something like this:

// Handle the response Promise
apiResponse.then((resp) => {
  // print a string summary
  console.log(resp.document.toString());

  // handle line items
  const lineItems = resp.document.inference.prediction.columnsToLineItems(
    ["category",],
    ["category", "previous_year_actual", "year_actual", "year_projection"]
  );
  lineItems.map((line) => {console.log(line)});
});
# Handle the response
line_items = resp.document.inference.prediction.columns_to_line_items(
  ["category",],
  ["category", "previous_year_actual", "year_actual", "year_projection"]
)
for line_item in line_items:
  print(line_item)

This is not very pretty though.

We can further develop our test script to reproduce the original table:

// Get field content as string
function getFieldContent(line, field) {
  if (line.fields.get(field) !== undefined) {
    return line.fields.get(field).content;
  }
  return "";
}

// Pretty print a line
function printLine(line) {
  const category = getFieldContent(line, "category");
  const previousYearActual = getFieldContent(line, "previous_year_actual");
  const yearActual = getFieldContent(line, "year_actual");
  const yearProjection = getFieldContent(line, "year_projection");

  const stringLine = category.padEnd(20)
    + previousYearActual.padEnd(10)
    + yearProjection.padEnd(10)
    + yearActual

  console.log(stringLine)
}

// Handle the response Promise
apiResponse.then((resp) => {
  // print a string summary
  console.log(resp.document.toString());

  // handle line items
  const lineItems = resp.document.inference.prediction.columnsToLineItems(
    ["category",],
    ["category", "previous_year_actual", "year_actual", "year_projection"]
  );
  lineItems.map((line) => printLine(line))
});
def get_field_content(line, field) -> str:
    if field in line.fields:
        return str(line.fields[field].content)
    return ""


def print_line(line) -> None:
    category = get_field_content(line, "category")
    previous_year_actual = get_field_content(line, "previous_year_actual")
    year_actual = get_field_content(line, "year_actual")
    year_projection = get_field_content(line, "year_projection")
    # here ljust() fills the rest of the given size with spaces
    string_line = (
        category.ljust(20, " ")
        + previous_year_actual.ljust(10, " ")
        + year_projection.ljust(10, " ")
        + year_actual
    )

    print(string_line)


line_items = response.document.inference.prediction.columns_to_line_items(
    ["category"],
    ["category", "previous_year_actual", "year_actual", "year_projection"],
)

for line in line_items:
    print_line(line)

Which should print something like this:

Hand Tools          55.25     0.00      5.95
Power Tools         350.90    200.00    195.95
Tool Accessories    12.00     50.00     78.90
Seeds               20.00               
Annuals             25.25     45.00     40.32
Perennials          102.00    100.00    106.15
Trees               28.00     200.00    260.42
Soil                125.00    75.00     75.00
Mulch               35.45     15.00     0.00
Horse Manure        50.00     75.00     80.00
Chicken Feed        200.50    250.00    220.67
Building Supplies   57.60     60.00     36.10