Case Study: Leveraging Multimodal OCR to Achieve 90% Reduction in Document Processing Time

Executive Summary

This case study presents an innovative solution to the challenge of manual data entry in personal finance management. By harnessing the power of Google Gemini Multimodal Large Language Model (LLM), we developed an advanced Optical Character Recognition (OCR) system that not only extracts transaction data but also performs intelligent categorization. The result is a remarkable 90% reduction in document processing time, from 10 minutes to just 10 seconds per document, while maintaining a high accuracy rate of 95%.

Before and After Comparison

Input: Receipt Image

Output: Structured JSON

{
  "status": 200,
  "success": true,
  "message": "Success",
  "data": {
    "merchant": {
      "name": "Maketh",
      "address": "Jl. Danau Agung 2 Blok C18 No 15A, Kota Jakarta Utara, DKI Jakarta, 14350",
      "phone": "+6281291685555",
      "tax_id": null,
      "category": "cafe"
    },
    "transaction": {
      "date": "2024-11-06",
      "time": "16:09",
      "invoice_number": "9XU9Z6",
      "subtotal": 325000,
      "tax": 52000,
      "total": 377000,
      "payment_method": "BCA",
      "currency": "Rp"
    },
    "items": [
      {
        "description": "Hot Cappuccino",
        "quantity": 1,
        "unit_price": 45000,
        "total": 45000,
        "coa_category": "cafe"
      },
      {
        "description": "Yakiniku Don",
        "quantity": 1,
        "unit_price": 65000,
        "total": 65000,
        "coa_category": "cafe"
      },
      {
        "description": "Chicken Katsu Curry",
        "quantity": 1,
        "unit_price": 65000,
        "total": 65000,
        "coa_category": "cafe"
      },
      {
        "description": "Fish Matah",
        "quantity": 1,
        "unit_price": 65000,
        "total": 65000,
        "coa_category": "cafe"
      },
      {
        "description": "London Fog",
        "quantity": 1,
        "unit_price": 45000,
        "total": 45000,
        "coa_category": "cafe"
      },
      {
        "description": "Iced Tea",
        "quantity": 2,
        "unit_price": 20000,
        "total": 40000,
        "coa_category": "cafe"
      }
    }
    "summary": {
      "coa_totals": {
        "cafe": 377000
      }
    }
  },
  "error": null,
  "metadata": {
    "has_logo": true,
    "receipt_type": "receipt",
    "image_quality": "good",
    "confidence_score": 0.95
  },
}

Technical Architecture

Core AI Model: Google Gemini Multimodal LLM
Backend Framework: FastAPI (Python-based, asynchronous)
Server: Uvicorn (ASGI server implementation)
Programming Language: Python 3.9+
Image Processing: OpenCV and Pillow libraries
Data Serialization: JSON
API Documentation: Swagger UI (via FastAPI)

System Capabilities

Intelligent Amount Detection: Utilizes context-aware algorithms to accurately identify and extract transaction amounts.
Account Classification: Automatically categorizes transactions into appropriate account types (e.g., checking, savings, credit card).
Transaction Categorization: Employs machine learning to classify expenses into predefined or custom categories.
Credit/Debit Classification: Distinguishes between income and expenses based on transaction context.
High Accuracy: Achieves 95%+ accuracy in data extraction and categorization.
Real-time Processing: Provides near-instantaneous results, enabling immediate financial insights.
Multi-format Support: Processes various document types including receipts, invoices, and bank statements.
Multilingual Support: Able to process documents in multiple languages
Custom Classification Rules: Able to cater customized rule engine for user-defined categorization logic

Technical Workflow: From Document to Structured Data

Our system employs a sophisticated multi-stage process to transform raw document images into structured, actionable financial data:

1. Document Preprocessing

The system optimizes the input image for analysis through:

Adaptive thresholding for enhanced contrast
Gaussian denoising to reduce image noise
Affine transformations for perspective correction and deskewing
Resolution standardization to 300 DPI for consistent processing

2. Intelligent Document Analysis

Gemini processes and understands documents through a comprehensive approach:

Visual Layout Analysis: Identifies document structure, key regions, and information hierarchy
Contextual Understanding: Uses natural language processing to comprehend the semantic meaning of text and its relationships
Intelligent Matching: Employs fuzzy matching algorithms to handle variations in merchant names and item descriptions
Financial Classification: Automatically categorizes transactions into appropriate Charts of Accounts and wallet destinations
Error Detection: Validates extracted data against expected patterns and flags potential inconsistencies

3. Structured Data Output

The system generates a standardized JSON output as shown in the Before and After Comparison section above.

Performance Metrics and Business Impact

Key Performance Indicators

Processing Time: Reduced from 10 minutes to 10 second per document
Accuracy Rate: 95% (compared to manual entry)
Error Reduction: 95% decrease in data entry errors
Throughput: Capacity increased from 6 to 60 documents per hour
Scalability: Linear scaling with added computational resources
Price: Estimated Rp 1,000 per 50 pages

Business Benefits

90% reduction in manual data entry costs
Improved data consistency and reliability
Real-time financial insights enabling faster decision-making
Enhanced customer satisfaction through quicker processing
Freed up human resources for higher-value tasks

Conclusion

The implementation of our Multimodal OCR system, powered by Google Gemini LLM, has revolutionized the document processing workflow in personal finance management. The 90% reduction in processing time, coupled with high accuracy and automated classification, demonstrates the transformative impact of advanced AI technologies in financial document processing. This case study underscores the potential of multimodal AI models to solve complex, real-world challenges, paving the way for more efficient and accurate financial management systems.