AI Document Processing: Extracting and Structuring Data from Unstructured Documents

Invoices, contracts, receipts, forms, and reports—most business data lives in unstructured documents. AI-powered document processing can extract structured data from these documents with accuracy that rivals human data entry, at a fraction of the cost. Here's how to build a pipeline that turns messy documents into clean data.

The Document Processing Pipeline

A production pipeline has five stages:

Input → Preprocessing → OCR/Extraction → Parsing → Validation → Output

Each stage has specific optimization opportunities.

Preprocessing: Garbage In, Garbage Out

Document quality varies wildly. Preprocessing normalizes inputs before extraction:

import cv2
import numpy as np

def preprocess_document(image_path):
    """Normalize document images for optimal OCR."""
    img = cv2.imread(image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Deskew - correct page rotation
    coords = np.column_stack(np.where(img > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    h, w = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    img = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC, 
                         borderMode=cv2.BORDER_REPLICATE)
    
    # Adaptive thresholding for varying lighting
    img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                cv2.THRESH_BINARY, 31, 2)
    
    # Denoise
    img = cv2.fastNlMeansDenoising(img, h=30)
    
    return img

Key preprocessing steps:

Deskewing: Even 2° of rotation degrades OCR accuracy by 15-20%
Contrast enhancement: Use CLAHE for low-contrast scans
Noise reduction: Critical for faxed documents
Dilation/erosion: Connect broken characters in poor-quality scans
Page splitting: Detect and separate multi-page layouts

OCR: Choosing the Right Engine

| Engine | Accuracy | Speed | Structured Output | Cost | |--------|----------|-------|-------------------|------| | Tesseract 5 | 85-92% (clean) | Fast | Text + bounding boxes | Free | | Azure Document Intelligence | 95-98% | Medium | Key-value, tables, checkboxes | $0.01-0.05/page | | AWS Textract | 93-97% | Medium | Forms, tables, signatures | $0.015/page | | Google Document AI | 94-98% | Medium | Entity extraction, classification | $0.015-0.065/page | | LlamaParse (LlamaIndex) | 95-98% | Slow | Markdown, tables, images | Free tier + API |

Recommendation: Use Azure Document Intelligence or Google Document AI for production workloads where accuracy matters. Use Tesseract for internal tools or high-volume, low-value documents. LlamaParse is excellent for ingesting complex PDFs into LLM-friendly formats for RAG systems.

LLM-Based Extraction

After OCR extracts raw text, LLMs extract structured data:

from pydantic import BaseModel, Field
from typing import Optional
from datetime import date

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice number or ID")
    vendor_name: str = Field(description="Company issuing the invoice")
    vendor_address: Optional[str] = None
    invoice_date: date = Field(description="Date on the invoice")
    due_date: Optional[date] = None
    total_amount: float = Field(description="Total amount due")
    tax_amount: Optional[float] = None
    line_items: list[dict] = Field(description="Individual line items with descriptions and amounts")
    currency: str = Field(default="USD")

def extract_invoice_data(ocr_text):
    """Extract structured invoice data using an LLM."""
    client = OpenAI()
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Extract invoice data from the provided OCR text. Return as structured JSON."},
            {"role": "user", "content": ocr_text}
        ],
        response_format=Invoice
    )
    
    return completion.choices[0].message.parsed

Structured output via Pydantic eliminates the need for regex-based extraction that breaks when layouts change. The LLM handles layout variations naturally.

Handling Complex Documents

Tables: Tables are the hardest document element. Use specialized table extraction:

from unstructured.partition.auto import partition

# Unstructured library handles complex table extraction
elements = partition("complex_report.pdf", strategy="hi_res")
tables = [el for el in elements if el.category == "Table"]
for table in tables:
    # Convert table to structured format
    df = table.metadata.text_as_html  # HTML table representation
    print(df)

Multi-page forms: Maintain page context across extractions. Track page numbers in your output metadata to help validation.

Handwritten text: Vision-language models like GPT-4o and Claude 3.5 can read handwriting with surprising accuracy (85-95%) in the document analysis context:

handwriting_response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all handwritten information from this claim form, including dates, amounts, and signatures."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
            ]
        }
    ]
)

Validation and Confidence Scoring

Every extracted field should include a confidence score:

class ExtractionResult(BaseModel):
    field_name: str
    value: str
    confidence: float  # 0.0 to 1.0
    extraction_method: str  # "ocr", "llm", "vision", "regex"
    alternates: list[str] = []
    
def validate_invoice(invoice_data):
    """Apply business rules to validate extracted data."""
    issues = []
    
    if invoice_data.total_amount <= 0:
        issues.append(ExtractionResult(
            field_name="total_amount",
            value=str(invoice_data.total_amount),
            confidence=0.0,
            extraction_method="validation",
            alternates=[]
        ))
    
    if invoice_data.invoice_date > date.today():
        issues.append(ExtractionResult(
            field_name="invoice_date",
            value=str(invoice_data.invoice_date),
            confidence=0.3,
            extraction_method="validation",
            alternates=["Future date may be an error"]
        ))
    
    return issues

Route low-confidence extractions to human review automatically. Define confidence thresholds per field type.

Performance and Scale

# Document processing pipeline architecture
services:
  preprocessing:  # FastAPI microservice
    image: doc-processor:latest
    scale: 3
    environment:
      PARALLEL_WORKERS: 8
      OCR_ENGINE: "azure"
  
  extraction-queue:  # RabbitMQ for buffering
    image: rabbitmq:3
  
  llm-extractor:
    image: llm-extractor:latest
    scale: 5
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      BATCH_SIZE: 10
  
  validation-service:
    image: validator:latest
    scale: 2
  
  human-review-queue:
    image: n8n:latest
    environment:
      SLACK_WEBHOOK: ${SLACK_WEBHOOK}

At scale, queue-based architecture prevents LLM API rate limits from blocking the pipeline. Preprocess and OCR run in parallel; LLM extraction batches documents for efficient API usage.

SoniNow builds end-to-end document processing pipelines that turn unstructured documents into clean, queryable data. Our AI automation services include OCR integration, LLM extraction, validation logic, and human review workflows.

Stop manually entering data from documents. Contact us to automate your document processing.

AI Document Processing: Extracting and Structuring Data from Unstructured Documents

The Document Processing Pipeline

Preprocessing: Garbage In, Garbage Out

OCR: Choosing the Right Engine

LLM-Based Extraction

Handling Complex Documents

Validation and Confidence Scoring

Performance and Scale

Related Insights

AI Content Optimization for Search Rankings: Beyond Keyword Density

AI Content Personalization Engines: Delivering Tailored Digital Experiences

AI-Powered Customer Segmentation: From Clusters to Personalized Experiences