Working with Auto-Types¶
Level 2 - Intermediate
This guide covers configured Auto-Type usage. Before reading, please complete Level 1: Using Templates and Level 1: Using Methods.
Learn how to configure Auto-Types directly to customize schemas, deduplication logic, and extraction behavior.
What Are Auto-Types?¶
Auto-Types are the core data structures in Hyper-Extract that extract, organize, and store structured knowledge from text. They provide:
- Type-safe schemas — Pydantic-based validation
- LLM-powered extraction — Automatic content processing
- Built-in operations — Search, merge, visualize
- Serialization — Save/load to disk
All Auto-Types inherit from BaseAutoType, so they share a common set of capabilities (e.g., parse, feed_text, build_index, search, chat, dump, load).
Three-Level Usage Architecture¶
Hyper-Extract provides three levels of control. This guide focuses on Level 2.
| Level | Approach | When to Use | Reference |
|---|---|---|---|
| Level 1 | Templates / Methods | Quick start, standard use cases | Using Templates, Using Methods |
| Level 2 | Configured Auto-Type | Custom schemas, same extraction logic | This guide |
| Level 3 | Fully custom methods | Complete control over extraction | Methods Concepts |
Level 2: Configured Auto-Type Usage¶
When you need custom schemas but don't want to implement full extraction logic from scratch, instantiate an Auto-Type directly and pass configuration parameters.
When to Use Level 2¶
- You need custom node/edge schemas
- You want to control deduplication logic
- Template output doesn't match your exact needs
- You're building reusable Python components
Complete Example: Custom Graph¶
from hyperextract import AutoGraph
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from pydantic import BaseModel, Field
# Step 1: Define custom schemas
class Person(BaseModel):
"""Custom node schema"""
name: str = Field(description="Person's full name")
role: str = Field(description="Job title or role")
expertise: list[str] = Field(default=[], description="Areas of expertise")
class Collaboration(BaseModel):
"""Custom edge schema"""
source: str = Field(description="First person's name")
target: str = Field(description="Second person's name")
project: str = Field(description="Project they worked on together")
year: int = Field(description="Year of collaboration")
# Step 2: Configure LLM clients
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
# Step 3: Create configured AutoGraph
graph = AutoGraph[Person, Collaboration](
node_schema=Person,
edge_schema=Collaboration,
# Define how to extract unique keys for deduplication
node_key_extractor=lambda p: p.name,
edge_key_extractor=lambda c: f"{c.source}-{c.target}-{c.project}",
# Define how to extract node references from edges
nodes_in_edge_extractor=lambda c: (c.source, c.target),
# LLM clients
llm_client=llm,
embedder=embedder,
)
# Step 4: Extract
text = """
Dr. Sarah Chen and Dr. Michael Wang collaborated on the Climate AI project in 2023.
Sarah is a machine learning researcher with expertise in neural networks and climate modeling.
Michael specializes in data engineering and distributed systems.
"""
graph.feed_text(text)
# Step 5: Access results
print(f"Extracted {len(graph.nodes)} people")
for person in graph.nodes:
print(f"- {person.name}: {person.role}")
print(f" Expertise: {', '.join(person.expertise)}")
print(f"\nExtracted {len(graph.edges)} collaborations")
for collab in graph.edges:
print(f"- {collab.source} & {collab.target}: {collab.project} ({collab.year})")
# Step 6: Use built-in features
graph.build_index()
# Search
nodes, edges = graph.search("machine learning", top_k=2)
print(f"\nSearch found: {len(nodes)} people")
# Visualize
graph.show()
Key Configuration Parameters¶
| Parameter | Required | Description |
|---|---|---|
node_schema / edge_schema |
Yes | Pydantic model defining the structure |
node_key_extractor |
Yes | Function to extract unique key from node |
edge_key_extractor |
Yes | Function to extract unique key from edge |
nodes_in_edge_extractor |
Yes | Function to get (source, target) from edge |
llm_client |
Yes | LangChain-compatible LLM client |
embedder |
Yes | LangChain-compatible embeddings client |
Comparison: Template vs Configured Auto-Type¶
| Aspect | Template | Configured Auto-Type |
|---|---|---|
| Schema definition | YAML file | Python Pydantic classes |
| Extraction logic | Pre-built | Same pre-built logic |
| Deduplication | Pre-configured | You define key extractors |
| Language support | Built-in | You provide prompts |
| Reusability | Share YAML files | Package as Python module |
More Configuration Examples¶
Temporal Graph¶
from hyperextract import AutoTemporalGraph
from pydantic import BaseModel, Field
class Event(BaseModel):
"""Node: A historical event"""
name: str = Field(description="Event name")
category: str = Field(description="Type of event")
class CausalLink(BaseModel):
"""Edge: Time-aware causal relationship"""
source: str = Field(description="Earlier event")
target: str = Field(description="Later event")
relationship: str = Field(description="How they connect")
time: str = Field(description="When the link occurred")
timeline = AutoTemporalGraph[Event, CausalLink](
node_schema=Event,
edge_schema=CausalLink,
node_key_extractor=lambda e: e.name,
edge_key_extractor=lambda l: f"{l.source}-{l.target}-{l.time}",
nodes_in_edge_extractor=lambda l: (l.source, l.target),
llm_client=llm,
embedder=embedder,
# Temporal-specific: extract time from edge
time_extractor=lambda l: l.time,
)
timeline.feed_text(historical_text)
Common Operations¶
Checking if Empty¶
Clearing Data¶
Merging Instances¶
# Two separate extractions
result1 = ka.parse(text1)
result2 = ka.parse(text2)
# Merge into new instance
combined = result1 + result2
Accessing Data¶
Property Access¶
result = ka.parse(text)
# Direct property access (Pydantic model)
nodes = result.nodes
edges = result.edges
# Dictionary conversion
data_dict = result.data.model_dump()
JSON Export¶
import json
# Export to JSON
json_data = result.data.model_dump_json(indent=2)
# Save to file
with open("output.json", "w") as f:
f.write(json_data)
Working with Results¶
Iteration Patterns¶
# Iterate nodes
for node in result.nodes:
print(f"Name: {node.name}")
print(f"Type: {node.type}")
if hasattr(node, 'description'):
print(f"Description: {node.description}")
# Iterate edges with filtering
for edge in result.edges:
if edge.type == "worked_with":
print(f"{edge.source} worked with {edge.target}")
Filtering¶
# Filter nodes by type
people = [n for n in result.nodes if n.type == "person"]
organizations = [n for n in result.nodes if n.type == "organization"]
# Filter edges
inventions = [e for e in result.edges if "invent" in e.type.lower()]
Statistics¶
# Basic counts
node_count = len(result.nodes)
edge_count = len(result.edges)
# Type distribution
from collections import Counter
node_types = Counter(n.type for n in result.nodes)
edge_types = Counter(e.type for e in result.edges)
print(f"Nodes: {node_types}")
print(f"Edges: {edge_types}")
Advanced Usage¶
Accessing the Schema¶
Raw Data Access¶
Type Checking¶
from hyperextract import AutoGraph, AutoList
# Check instance type
if isinstance(result, AutoGraph):
print("Graph extraction")
elif isinstance(result, AutoList):
print("List extraction")
Best Practices¶
Level 2 (Configured Auto-Types)¶
- Design schemas carefully — Fields become LLM extraction targets
- Use clear Field descriptions — Help LLM understand expectations
- Choose good key extractors — Critical for deduplication
- Test with small samples — Iterate on schema design
General¶
- Check
hasattrbefore optional fields — Schema fields may vary - Handle empty results — Always check
empty() - Use
model_dumpfor serialization — Proper JSON conversion - Leverage type hints — IDE support for autocomplete
See Also¶
Prerequisites: - Using Templates — Level 1 basics - Using Methods — Level 1 basics - Auto-Types Concepts — What each type does
Next Steps: - Creating Custom Templates — Package your configurations - Incremental Updates — Add more documents - Search and Chat — Use extracted knowledge - Saving and Loading — Persist your results