Extracting Invoice Data Made Easy: Introducing our Template Free Image-to-Text Interactive Model

Extracting Invoice Data Made Easy: Introducing our Template Free Image-to-Text Interactive Model

Hackathon Finalist

As businesses continue to rely on digital documentation, the need for automated data extraction from images and PDFs has become increasingly important. That's why we're excited to introduce our image-to-text interactive model. Our solution offers a fast and accurate way to extract information from images and respond to product-specific queries.

How It Works

Our image-to-text interactive model uses multiple pre-processing steps to extract relevant information from images. It is trained to distinguish between text and tables and uses OCR and entity recognition to extract values from word cells and label them according to data fields. The key-value pairs are then processed through closed-domain Q/A training using the Huggingface transformer model.

The model is designed to answer queries related to specific dates, maximum amounts spent, and mathematical operations. For non-mathematical queries, the model uses an ensemble of rule-based keyword searching and a trained QA-based transformer model to retrieve relevant answers. For mathematical queries, the model passes the query through a pre-trained T5 transformer model.


Our solution is template-free and offers information retrieval in a Q/A format. It can answer questions related to expenditures on a particular date or maximum amount spent and is not pre-trained. The model generates key-pair values and has a simple interface.


As we continue to improve our image-to-text interactive model, we see opportunities to include speech assistants and natural human language processing for better user interaction. We are also considering building an automated spreadsheet maker and introducing APIs for batch processing. Additionally, we plan to monetize our APIs and services provided.


There are a few limitations to our model. We rely on third-party CPU and GPU providers, and the quality of the image provided by the user can affect accuracy. Currently, our model is only available in English.

Architecture in detail

The architecture of our solution involves several stages to process and extract information from the uploaded invoice image and respond to user queries in a fast and efficient way. Let's dive into each stage of the process in detail.

  1. User uploads the invoice image in the portal: The process begins with the user uploading the invoice image in the portal. The image could be in any format like PNG, JPEG or PDF.

  2. PyTesseract extracts the text: Once the image is uploaded, the PyTesseract OCR engine comes into play and extracts the text from the image. OCR stands for Optical Character Recognition and is a technology that enables the computer to read the text from the image.

  3. Extraction of keywords and storing them as key-value pairs: After the text has been extracted, we use techniques like Named Entity Recognition (NER), Topic Modelling and YAKE to extract the important keywords from the text. These keywords are then stored as key-value pairs in our database. This process helps in identifying the relevant information from the invoice and making it easy to retrieve.

  4. Invoice processing is done: At this stage, all the necessary information from the invoice is extracted and stored in the database as key-value pairs. The user has to wait till this process is complete.

  5. The user enters query: Once the processing of the invoice is complete, the user can enter the query in the query input field.

  6. Preprocessing of input text: The input query from the user undergoes preprocessing techniques like spacy and text blob to clean the text and remove any irrelevant information.

  7. Fetching relevant details from the database: Next, we fetch the relevant details and values from the database based on the keywords extracted from the user query. The key-value pairs stored in the database come in handy at this stage.

  8. Classifying the intent: We classify the user's intent based on the type of query asked. We divide the queries into two categories:

    a. Mathematical queries: If the query is mathematical in nature, we use the T5 model to perform the required math operations. T5 stands for Text-To-Text Transfer Transformer and is a state-of-the-art language model from Google.

    b. General queries: If the query is a general question, we use the BERT model to answer the query. BERT stands for Bidirectional Encoder Representations from Transformers and is a pre-trained language model that is fine-tuned to perform several natural language processing tasks.

  9. Paraphrasing the answer: Lastly, we paraphrase the answer using GPT3 in a human-readable sentence form and return it to the user. GPT3 stands for Generative Pre-trained Transformer 3 and is an AI language model that can generate text, answer questions and perform other language-related tasks.

In conclusion, the architecture of our solution involves multiple steps, right from image processing to answering user queries in a human-readable format. The key-value pairs stored in our database make it easy to retrieve the relevant information from the invoice, and the use of state-of-the-art language models like BERT, T5 and GPT3 ensures fast and accurate responses to user queries.


Converting POC to MVP

We are continuously improving our image-to-text interactive model. In our next steps, we plan to shift OCR from Pytesseract to Keras OCR for better accuracy. We also plan to build a proper frontend using React/TypeScript and a backend using Nodejs/Typescript. To anticipate heavy traffic and requests, we plan to scale our server using multiple instances and use a load balancer like Nginx. Upgrading AWS EC2 instances will provide better payload delivery, and hosting our ML model in AWS Sagemaker API endpoints will improve scalability. Finally, we plan to host our front end in scalable AWS instances for faster webpage load time and better CDN quality.

In addition to our portal, we plan to release an API for batch processing, which companies can use to process multiple invoices at a time using predefined or instant questions. Third-party apps will also be able to integrate with our closed system to ensure their data security.


Our image-to-text interactive model offers a faster and more accurate way to extract data from images and PDFs. We are continuously improving our model and exploring new opportunities to improve user interaction and scalability. With our model, businesses can automate data extraction, save time, and focus on their core operations.


The project is not open-sourced yet but will be soon. Stay Tuned.

For more such content, follow me on GitHub. Thanks!

Follow my fellow project developer, Anurag Chakraborty

My profile links: linktr.ee/soumyajit.d