Back to Track Record
2024Real Estate / Legal Tech

Document Intelligence for Real Estate

Where we learned that OCR is the real bottleneck, hybrid search is non-negotiable, and title verification is fundamentally a graph problem

AIRAGDocument AnalysisMulti-LanguageHybrid SearchNeo4jAWSRAG Quality Recovery

What it is

A document intelligence system that validates property title cleanliness across 1,200 properties in six Indian metros — Bengaluru, Hyderabad, Chennai, Mumbai, Pune, and NCR. The system processed approximately 18,000 documents (roughly 55,000-65,000 pages) covering sale deeds, encumbrance certificates, revenue records, mortgage documents, and mutation records in English, Hindi, Kannada, Telugu, Tamil, and Marathi.

What makes it different

  • Multi-language processing as the norm. A single property's document set routinely contains content in two or three languages, sometimes mixed within a single page. Language detection runs per page, not per document.
  • Three-tier OCR. AWS Textract for 85% of pages, multimodal LLMs (Claude, Gemini) for degraded scans, human-in-the-loop for the remaining 12-13% — handwritten deeds from the 1970s-80s, damaged documents, and registrar stamps.
  • Hybrid search. pgvector for semantic retrieval combined with Elasticsearch BM25 for exact matches on survey numbers, registration numbers, and party names — merged through reciprocal rank fusion and cross-encoder re-ranking.
  • Title verification as a graph problem. Neo4j models the property ownership graph — parties, properties, transactions, and their relationships — making chain-of-title analysis dramatically cleaner than SQL workarounds.
  • Entity resolution across scripts. The same person appears as "Lakshmi Narayana," "ಲಕ್ಷ್ಮೀ ನಾರಾಯಣ," and "Laxmi Narayan" across documents. Phonetic matching tuned for Indian naming conventions, Jaro-Winkler similarity, and relationship qualifiers (S/o, W/o, D/o) resolve 90% of name pairs automatically.

What we learned building it

  • OCR consumed more engineering effort than retrieval and generation combined. Scan quality degrades with document age. A 2020 sale deed is clean digital text; a 1990 deed is a photocopy of a photocopy.
  • Naive RAG chunking breaks immediately on legal documents. Sentence boundaries don't map to semantic boundaries. Table headers get lost across pages. Cross-references become dead links between chunks.
  • Pre-retrieval access control, not post-retrieval filtering. Property documents contain Aadhaar numbers, PAN details, bank account information. If the vector database returns a restricted chunk and the system filters it afterward, the retrieval pattern still leaks that restricted content exists.
  • The debug hierarchy: data preparation first, retrieval second, generation last. 60% of debugging time was spent upstream of the LLM — on OCR accuracy and chunking boundaries, not prompt engineering.

Under the hood

  • Data stores: pgvector (PostgreSQL) for 200-300K chunks with composite indexes, Elasticsearch for BM25 keyword search, Neo4j for ownership graph, PostgreSQL for audit logs and metadata
  • Ingestion: S3 + Lambda (ClamAV scanning) → ECS Fargate (OCR, classification) → Step Functions (orchestration) → embeddings → vector store + graph
  • LLM stack: Claude 3.5 Sonnet (bulk structured extraction), Claude 3 Opus (complex legal reasoning), Gemini 1.5 Pro (selective cross-document consistency checks)
  • Language: AI4Bharat IndicTrans2 for Indic-to-English translation before embedding
  • Observability: CloudWatch + Langfuse + custom Grafana dashboards
  • Security: Pre-retrieval metadata filtering, full audit trail, encryption at rest and in transit

Read the full architecture deep-dive →