Architecture¶
Deep dive into Hyper-Extract's system design and data flow.
System Overview¶
graph LR
A[Input Text] --> B[Text Processor]
B --> C[LLM Extractor]
C --> D[Merger]
D --> E[Auto-Type Instance]
E --> F[Operations]
subgraph "Text Processor"
B1[Chunking] --> B2[Parallel Processing]
end
subgraph "LLM Extractor"
C1[Prompt] --> C2[LLM Call]
C2 --> C3[Structured Output]
end
subgraph "Operations"
F1[Search]
F2[Chat]
F3[Visualize]
F4[Save/Load]
end
Data Flow¶
1. Input Processing¶
# Input: Raw text
text = "Nikola Tesla was an inventor..."
# If text > chunk_size, split into chunks
chunks = [
"Nikola Tesla was an inventor...",
"He developed the AC system...",
# ...
]
Chunking Strategy: - Default size: 2048 characters - Overlap: 256 characters - Separators: Paragraphs, sentences, words
2. Extraction¶
# Each chunk processed in parallel
for chunk in chunks:
# LLM extracts structured data
data = llm.extract(chunk, schema=AutoTypeSchema)
Process: 1. Format prompt with chunk 2. Call LLM with structured output 3. Parse response into Pydantic model
3. Merging¶
Merge Operations: - Deduplicate entities - Combine relations - Resolve conflicts
4. Result¶
Component Details¶
Template Engine¶
Components: - Gallery — Template discovery and listing - Parser — YAML parsing and validation - Factory — Instance creation
Auto-Type Base Class¶
class BaseAutoType:
# Core functionality
def parse(text) -> AutoType
def feed_text(text) -> AutoType
# Query
def build_index()
def search(query) -> List[Item]
def chat(query) -> Response
# Persistence
def dump(path)
def load(path)
# Visualization
def show()
Method Registry¶
# Methods are registered at import
_METHOD_REGISTRY = {
"light_rag": {
"class": Light_RAG,
"type": "graph",
"description": "..."
},
# ...
}
Extension Points¶
Custom Templates¶
Create domain-specific templates:
# my_template.yaml
name: custom_extraction
type: graph
output:
entities:
fields:
- name: name
type: str
# ...
Custom Auto-Types¶
Extend base classes:
from hyperextract.types import AutoGraph
class MyCustomGraph(AutoGraph):
def _default_prompt(self):
return "Custom prompt..."
Custom Methods¶
Implement extraction algorithms:
from hyperextract.methods import register_method
class MyMethod:
def extract(self, text):
# Custom extraction logic
pass
register_method("my_method", MyMethod, "graph", "Description")
Performance Considerations¶
Chunking¶
| Document Size | Chunks | Processing Time |
|---|---|---|
| < 2KB | 1 | Fast |
| 2-10KB | 2-5 | Medium |
| 10-50KB | 5-25 | Slow |
| > 50KB | 25+ | Very Slow |
Parallelization¶
Memory Usage¶
Design Principles¶
- Type Safety — Pydantic schemas throughout
- Extensibility — Plugin architecture for methods
- Usability — Templates for common tasks
- Performance — Chunking and parallelization
- Interoperability — Standard formats (JSON, YAML)