Document Intelligence for Real Estate

What it is

A document intelligence system that validates property title cleanliness across 1,200 properties in six Indian metros — Bengaluru, Hyderabad, Chennai, Mumbai, Pune, and NCR. The system processed approximately 18,000 documents (roughly 55,000-65,000 pages) covering sale deeds, encumbrance certificates, revenue records, mortgage documents, and mutation records in English, Hindi, Kannada, Telugu, Tamil, and Marathi.

What makes it different

Multi-language processing as the norm. A single property's document set routinely contains content in two or three languages, sometimes mixed within a single page. Language detection runs per page, not per document.
Three-tier OCR. AWS Textract for 85% of pages, multimodal LLMs (Claude, Gemini) for degraded scans, human-in-the-loop for the remaining 12-13% — handwritten deeds from the 1970s-80s, damaged documents, and registrar stamps.
Hybrid search. pgvector for semantic retrieval combined with Elasticsearch BM25 for exact matches on survey numbers, registration numbers, and party names — merged through reciprocal rank fusion and cross-encoder re-ranking.
Title verification as a graph problem. Neo4j models the property ownership graph — parties, properties, transactions, and their relationships — making chain-of-title analysis dramatically cleaner than SQL workarounds.
Entity resolution across scripts. The same person appears as "Lakshmi Narayana," "ಲಕ್ಷ್ಮೀ ನಾರಾಯಣ," and "Laxmi Narayan" across documents. Phonetic matching tuned for Indian naming conventions, Jaro-Winkler similarity, and relationship qualifiers (S/o, W/o, D/o) resolve 90% of name pairs automatically.

What we learned building it

OCR consumed more engineering effort than retrieval and generation combined. Scan quality degrades with document age. A 2020 sale deed is clean digital text; a 1990 deed is a photocopy of a photocopy.
Naive RAG chunking breaks immediately on legal documents. Sentence boundaries don't map to semantic boundaries. Table headers get lost across pages. Cross-references become dead links between chunks.
Pre-retrieval access control, not post-retrieval filtering. Property documents contain Aadhaar numbers, PAN details, bank account information. If the vector database returns a restricted chunk and the system filters it afterward, the retrieval pattern still leaks that restricted content exists.
The debug hierarchy: data preparation first, retrieval second, generation last. 60% of debugging time was spent upstream of the LLM — on OCR accuracy and chunking boundaries, not prompt engineering.

Under the hood

Data stores: pgvector (PostgreSQL) for 200-300K chunks with composite indexes, Elasticsearch for BM25 keyword search, Neo4j for ownership graph, PostgreSQL for audit logs and metadata
Ingestion: S3 + Lambda (ClamAV scanning) → ECS Fargate (OCR, classification) → Step Functions (orchestration) → embeddings → vector store + graph
LLM stack: Claude 3.5 Sonnet (bulk structured extraction), Claude 3 Opus (complex legal reasoning), Gemini 1.5 Pro (selective cross-document consistency checks)
Language: AI4Bharat IndicTrans2 for Indic-to-English translation before embedding
Observability: CloudWatch + Langfuse + custom Grafana dashboards
Security: Pre-retrieval metadata filtering, full audit trail, encryption at rest and in transit

Read the full architecture deep-dive →