AI Document Processing: Extracting and Structuring Data from Unstructured Documents

Invoices, contracts, receipts, forms, and reports—most business data lives in unstructured documents. AI-powered document processing can extract structured data from these documents with accuracy that rivals human data entry, at a fraction of the cost. Here's how to build a pipeline that turns messy documents into clean data.
The Document Processing Pipeline
A production pipeline has five stages:
Input → Preprocessing → OCR/Extraction → Parsing → Validation → Output
Each stage has specific optimization opportunities.
Preprocessing: Garbage In, Garbage Out
Document quality varies wildly. Preprocessing normalizes inputs before extraction:
import cv2
import numpy as np
def preprocess_document(image_path):
"""Normalize document images for optimal OCR."""
img = cv2.imread(image_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Deskew - correct page rotation
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
h, w = img.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
img = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
# Adaptive thresholding for varying lighting
img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 31, 2)
# Denoise
img = cv2.fastNlMeansDenoising(img, h=30)
return img
Key preprocessing steps:
- Deskewing: Even 2° of rotation degrades OCR accuracy by 15-20%
- Contrast enhancement: Use CLAHE for low-contrast scans
- Noise reduction: Critical for faxed documents
- Dilation/erosion: Connect broken characters in poor-quality scans
- Page splitting: Detect and separate multi-page layouts
OCR: Choosing the Right Engine
| Engine | Accuracy | Speed | Structured Output | Cost | |--------|----------|-------|-------------------|------| | Tesseract 5 | 85-92% (clean) | Fast | Text + bounding boxes | Free | | Azure Document Intelligence | 95-98% | Medium | Key-value, tables, checkboxes | $0.01-0.05/page | | AWS Textract | 93-97% | Medium | Forms, tables, signatures | $0.015/page | | Google Document AI | 94-98% | Medium | Entity extraction, classification | $0.015-0.065/page | | LlamaParse (LlamaIndex) | 95-98% | Slow | Markdown, tables, images | Free tier + API |
Recommendation: Use Azure Document Intelligence or Google Document AI for production workloads where accuracy matters. Use Tesseract for internal tools or high-volume, low-value documents. LlamaParse is excellent for ingesting complex PDFs into LLM-friendly formats for RAG systems.
LLM-Based Extraction
After OCR extracts raw text, LLMs extract structured data:
from pydantic import BaseModel, Field
from typing import Optional
from datetime import date
class Invoice(BaseModel):
invoice_number: str = Field(description="Invoice number or ID")
vendor_name: str = Field(description="Company issuing the invoice")
vendor_address: Optional[str] = None
invoice_date: date = Field(description="Date on the invoice")
due_date: Optional[date] = None
total_amount: float = Field(description="Total amount due")
tax_amount: Optional[float] = None
line_items: list[dict] = Field(description="Individual line items with descriptions and amounts")
currency: str = Field(default="USD")
def extract_invoice_data(ocr_text):
"""Extract structured invoice data using an LLM."""
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract invoice data from the provided OCR text. Return as structured JSON."},
{"role": "user", "content": ocr_text}
],
response_format=Invoice
)
return completion.choices[0].message.parsed
Structured output via Pydantic eliminates the need for regex-based extraction that breaks when layouts change. The LLM handles layout variations naturally.
Handling Complex Documents
Tables: Tables are the hardest document element. Use specialized table extraction:
from unstructured.partition.auto import partition
# Unstructured library handles complex table extraction
elements = partition("complex_report.pdf", strategy="hi_res")
tables = [el for el in elements if el.category == "Table"]
for table in tables:
# Convert table to structured format
df = table.metadata.text_as_html # HTML table representation
print(df)
Multi-page forms: Maintain page context across extractions. Track page numbers in your output metadata to help validation.
Handwritten text: Vision-language models like GPT-4o and Claude 3.5 can read handwriting with surprising accuracy (85-95%) in the document analysis context:
handwriting_response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all handwritten information from this claim form, including dates, amounts, and signatures."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
]
}
]
)
Validation and Confidence Scoring
Every extracted field should include a confidence score:
class ExtractionResult(BaseModel):
field_name: str
value: str
confidence: float # 0.0 to 1.0
extraction_method: str # "ocr", "llm", "vision", "regex"
alternates: list[str] = []
def validate_invoice(invoice_data):
"""Apply business rules to validate extracted data."""
issues = []
if invoice_data.total_amount <= 0:
issues.append(ExtractionResult(
field_name="total_amount",
value=str(invoice_data.total_amount),
confidence=0.0,
extraction_method="validation",
alternates=[]
))
if invoice_data.invoice_date > date.today():
issues.append(ExtractionResult(
field_name="invoice_date",
value=str(invoice_data.invoice_date),
confidence=0.3,
extraction_method="validation",
alternates=["Future date may be an error"]
))
return issues
Route low-confidence extractions to human review automatically. Define confidence thresholds per field type.
Performance and Scale
# Document processing pipeline architecture
services:
preprocessing: # FastAPI microservice
image: doc-processor:latest
scale: 3
environment:
PARALLEL_WORKERS: 8
OCR_ENGINE: "azure"
extraction-queue: # RabbitMQ for buffering
image: rabbitmq:3
llm-extractor:
image: llm-extractor:latest
scale: 5
environment:
OPENAI_API_KEY: ${OPENAI_API_KEY}
BATCH_SIZE: 10
validation-service:
image: validator:latest
scale: 2
human-review-queue:
image: n8n:latest
environment:
SLACK_WEBHOOK: ${SLACK_WEBHOOK}
At scale, queue-based architecture prevents LLM API rate limits from blocking the pipeline. Preprocess and OCR run in parallel; LLM extraction batches documents for efficient API usage.
SoniNow builds end-to-end document processing pipelines that turn unstructured documents into clean, queryable data. Our AI automation services include OCR integration, LLM extraction, validation logic, and human review workflows.
Stop manually entering data from documents. Contact us to automate your document processing.
Related Insights

Accessibility Testing Automation: axe-core, Lighthouse, and CI Integration
Learn automated accessibility testing with axe-core, Lighthouse CI, and integration into CI/CD pipelines for catching issues before they reach production.

Building AI Chatbots for Customer Support: A Complete Technical Guide
A technical guide to building AI-powered customer support chatbots including LLM integration, RAG architecture, conversation design, escalation workflows, and performance monitoring.

AI Content Generation for SEO: Strategy, Tools, and Quality Control
A strategic guide to using AI for SEO content generation including topic clustering, human oversight, quality scoring, EEAT compliance, and avoiding AI content penalties.