Custom OCR Components (planned)
Warning
The two-stage scanned pipeline (ScannedPipelineType.TWO_STAGE) is not
yet implemented. The contracts below are defined and ready for use, but
selecting TWO_STAGE will raise NotImplementedError. Use the VLM
pipeline for scanned PDF extraction today.
The scanned PDF pipeline will be fully pluggable. Implement the
BaseLayoutDetector and BaseOCREngine abstract base class contracts
to provide your own layout detection and OCR capabilities.
Contracts
BaseLayoutDetector
from doc_intelligence import BaseLayoutDetector, LayoutRegion
import numpy as np
class MyLayoutDetector(BaseLayoutDetector):
def detect(self, page_image: np.ndarray) -> list[LayoutRegion]:
"""Segment a page image into typed regions.
Args:
page_image: HxWxC uint8 numpy array of the full page.
Returns:
List of LayoutRegion with pixel-coordinate bounding boxes,
region type labels, and confidence scores.
"""
...
LayoutRegion fields:
| Field | Type | Description |
|---|---|---|
bounding_box |
BoundingBox |
Pixel coordinates within the page image |
region_type |
str |
Any label, e.g. "text", "table", "figure" |
confidence |
float |
Detection confidence in [0, 1] |
Regions with region_type == "table" become TableBlocks in the output;
all other types become TextBlocks.
BaseOCREngine
from doc_intelligence import BaseOCREngine
from doc_intelligence.pdf.schemas import Line
import numpy as np
class MyOCREngine(BaseOCREngine):
def ocr(self, region_image: np.ndarray) -> list[Line]:
"""Read text from a single cropped region image.
Args:
region_image: HxWxC uint8 numpy array of a cropped page region.
Returns:
List of Line with text and bounding boxes normalized to [0, 1]
relative to the region image dimensions.
"""
...
Line fields:
| Field | Type | Description |
|---|---|---|
text |
str |
The recognized text |
bounding_box |
BoundingBox \| None |
Normalized coordinates in [0, 1] |
Example: Custom Detector + Engine
import numpy as np
from doc_intelligence import (
BaseLayoutDetector,
BaseOCREngine,
BoundingBox,
PDFProcessor,
ParseStrategy,
ScannedPipelineType,
LayoutRegion,
)
from doc_intelligence.schemas.core import Line
class MyLayoutDetector(BaseLayoutDetector):
def detect(self, page_image: np.ndarray) -> list[LayoutRegion]:
# Treat the entire page as a single text region
h, w = page_image.shape[:2]
return [
LayoutRegion(
bounding_box=BoundingBox(x0=0, top=0, x1=w, bottom=h),
region_type="text",
confidence=1.0,
)
]
class MyOCREngine(BaseOCREngine):
def ocr(self, region_image: np.ndarray) -> list[Line]:
# Call your OCR service here
return [Line(text="Hello, world!", bounding_box=BoundingBox(x0=0, top=0, x1=1, bottom=1))]
processor = PDFProcessor(
provider="openai",
strategy=ParseStrategy.SCANNED,
scanned_pipeline=ScannedPipelineType.TWO_STAGE,
layout_detector=MyLayoutDetector(),
ocr_engine=MyOCREngine(),
dpi=150,
)
result = processor.extract("scanned.pdf", MySchema)
Using PDFParser Directly
You can also instantiate PDFParser directly and pass it to
DocumentProcessor alongside any formatter and extractor:
from doc_intelligence import DocumentProcessor
from doc_intelligence.pdf.parser import PDFParser
from doc_intelligence.pdf.formatter import PDFFormatter
from doc_intelligence.pdf.extractor import PDFExtractor
from doc_intelligence.pdf.types import ParseStrategy, ScannedPipelineType
from doc_intelligence.llm import OpenAILLM
llm = OpenAILLM()
parser = PDFParser(
strategy=ParseStrategy.SCANNED,
scanned_pipeline=ScannedPipelineType.TWO_STAGE,
layout_detector=MyLayoutDetector(),
ocr_engine=MyOCREngine(),
dpi=200,
)
processor = DocumentProcessor(
parser=parser,
formatter=PDFFormatter(),
extractor=PDFExtractor(llm),
)