What is Document Classification?
Workflows often involve document processing and sometimes, you need to classify those documents automatically in your software. One reason can be that your users upload a bunch of different data in a single flow or a single pdf containing different documents. Depending on your use case, automating this might be tricky.
However, with Mindee, you can build an accurate document classification API that meets your particular requirements, have it up and running, and process millions of documents simultaneously.
Create Your Classification Document API
To create your classification document API, you need to have different documents which you'll want to classify into different categories. For this example, let's consider these document classes:
- 1040 Forms
Note: As it’s a dummy example, we put random documents for the “other” class, but in your own use case, it’s better to use real data from your flow that you’d consider as “other”.
Log into your Mindee account. You'll land on the APIs hub page.
Click the Create a new API button on the left.
On the Set up your API section, fill in the required information. Give the API a name, a description, and a cover image(optional) and click on Next.
Next, Define your data model by defining the different classes within the classification field.
- Click on the Classification field
- In the popup, input the different possible classes by filling the form with the classes defined earlier.
- Click the Add this classification field button, we are all set.
- You can now click the Start training your model button.
Train Your Document Classifier API
Your API is now just deployed! Next we'll train the model.
To do so, we’ll need data, 15 samples for each type should be enough to get very high performances, but it’s up to you to train with more if you want to. It’s going to take you no more than 10 minutes to annotate your data once it’s uploaded. The training interface looks like this:
- On the left part of the training interface, you can upload images, pdf, or zip archives. If you have all the different documents you want to classify in a folder on your laptop, zip it and drag and drop on the upload interface. This opens the data management pane where you have the uploaded documents.
Note: You can mix pdfs and images, it’s not a problem as our backend will take care of this. Gathering your samples for training is actually the most boring part of the process.
- Each data will appear automatically in the pane until it’s ready for annotation. To make the annotation process easier, on the right-top of the screen, click on the Setting icon in the header and check the automatic data loading.
Click on Your data set on the left bottom of the interface to view the data uploaded.
Select the first document you see in the list.
Click on the desired class on the right part of the interface for each document.
Click on Validate at the right bottom of the interface.
Repeat this process until you have trained 20 documents to create a trained model.
A model is trained every 20 document, and each of them is automatically deployed on your API under new versions:
- V1.0 = no model
- V1.1 = 1st model (20 data)
- V1.2 = 2nd model (40 data)
You get an email when a model is deployed. To know the performances or your model, ask the support.
See Model training to know more on how your models are trained.
Use the API
Your API is now ready to be used in your coding environment. Once your first model is deployed you can test it right away with new data.
Hit the Live interface button on the sidebar, drag and drop a document. You should see something like this:
The latest version of your API (i.e the latest trained model) is automatically set for the live interface. You can then click the Documentation button in the sidebar. And follow these steps to integrate your API in your application.
Join our Slack
Updated 6 months ago