Scanned PDF Extraction
Document AI supports scanned (image-only) PDFs through two pipeline types that produce the same structured output as the digital pipeline. The existing formatter and extractor work unchanged — only the parser differs.
VLM Pipeline (recommended)
The VLM pipeline sends page images to a vision-capable LLM that performs layout detection and OCR in a single call. No additional models or components needed — it reuses the same LLM used for extraction.
PDFProcessor
from doc_intelligence import PDFProcessor, ParseStrategy, ScannedPipelineType
processor = PDFProcessor(
provider="openai",
strategy=ParseStrategy.SCANNED,
)
result = processor.extract("scanned_invoice.pdf", Invoice)
DocumentProcessor (full control)
from doc_intelligence import DocumentProcessor
from doc_intelligence.pdf.parser import PDFParser
from doc_intelligence.pdf.formatter import PDFFormatter
from doc_intelligence.pdf.extractor import PDFExtractor
from doc_intelligence.pdf.types import ParseStrategy, ScannedPipelineType
from doc_intelligence.llm import OpenAILLM
llm = OpenAILLM(model="gpt-4o")
processor = DocumentProcessor(
parser=PDFParser(
strategy=ParseStrategy.SCANNED,
scanned_pipeline=ScannedPipelineType.VLM,
llm=llm,
),
formatter=PDFFormatter(),
extractor=PDFExtractor(llm),
)
result = processor.extract("scanned_invoice.pdf", Invoice)
Batch size
Control how many pages are sent per VLM call. Higher values reduce the number of API calls but increase per-call token usage.
processor = PDFProcessor(
provider="openai",
strategy=ParseStrategy.SCANNED,
scanned_pipeline=ScannedPipelineType.VLM,
vlm_batch_size=5, # 5 pages per call
)
Two-Stage Pipeline (not yet implemented)
Warning
The two-stage pipeline (ScannedPipelineType.TWO_STAGE) is not yet
implemented. Selecting it will raise NotImplementedError. Use the
VLM pipeline above instead.
The two-stage pipeline will use separate layout detection and OCR models. You
will supply your own BaseLayoutDetector and BaseOCREngine implementations.
See the Custom OCR Components guide for the planned contracts.
Options
DPI
Higher DPI improves OCR accuracy on dense documents at the cost of more memory and
processing time. The default is 150.
processor = PDFProcessor(
provider="openai",
strategy=ParseStrategy.SCANNED,
scanned_pipeline=ScannedPipelineType.VLM,
dpi=300,
)
How It Works
VLM Pipeline (Type 2)
- Render — each PDF page is rasterised to a numpy array using
pypdfium2. - Encode — page images are converted to base64 PNG data URLs.
- VLM call — pages are batched and sent to a vision-capable LLM with a prompt requesting structured layout + OCR JSON output.
- Parse — the JSON response is parsed into
Pageobjects with typedContentBlockitems (TextBlock, TableBlock, etc.).
Two-Stage Pipeline (Type 1) — planned
- Render — each PDF page will be rasterised to a numpy array using
pypdfium2. - Layout detection — your
BaseLayoutDetectorwill segment the page image into typed regions (text, table, figure, …). - OCR — your
BaseOCREnginewill read text from each region concurrently. Text regions becomeTextBlocks; table regions becomeTableBlocks.
The resulting PDFDocument is identical in structure to what the digital strategy of
PDFParser produces, so citation tracking and multi-pass extraction work without any changes.