Typical Methods¶
Direct extraction methods that process text without retrieval.
iText2KG¶
hyperextract.methods.typical.iText2KG
¶
Bases: AutoGraph[NodeSchema, EdgeSchema]
iText2KG: A specialized AutoGraph for extracting high-quality triple-based KGs.
Features: - Fixed Schema (NodeSchema, EdgeSchema) optimized for triple extraction - Customized prompts from original iText2KG* implementation - Automatic deduplication and consistency checking via AutoGraph's OMem - Two-stage extraction: entities first, then relationships
Example
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = iText2KG(llm_client=llm, embedder=embedder) kg.feed_text(text) print(f"Extracted {len(kg.nodes)} entities and {len(kg.edges)} relationships.")
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize iText2KG.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
Language model for extraction |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing |
required |
chunk_size
|
int
|
Characters per chunk |
2048
|
chunk_overlap
|
int
|
Overlapping characters between chunks |
256
|
max_workers
|
int
|
Max concurrent extraction workers |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information |
False
|
iText2KG_Star¶
hyperextract.methods.typical.iText2KG_Star
¶
Bases: AutoGraph[NodeSchema, EdgeSchema]
iText2KG_Star: A specialized AutoGraph for extracting high-quality triple-based KGs.
Features:
- One-stage extraction: Extracts edges directly, then derives nodes automatically
- Customized prompts from original iText2KG_Star implementation
- Semantic Deduplication: Includes match_nodes_and_update_edges using SemHash with embeddings
- Automatic date tracking: Observation date set to extraction time
- Nested node schema: Maintains rich semantic information (name + label)
Example
from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = iText2KG_Star(llm_client=llm, embedder=embedder)
1. Extract relationships and derive nodes¶
kg.feed_text("Elon Musk is the CEO of SpaceX. Musk leads Tesla.") print(f"Before dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")
2. Deduplicate: Merges 'Elon Musk' and 'Musk'¶
kg.match_nodes_and_update_edges(threshold=0.85) print(f"After dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")
Attributes¶
observation_date = observation_date
instance-attribute
¶
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, observation_date: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize iText2KG_Star.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
Language model for extraction |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing |
required |
observation_date
|
str | None
|
Date when the extraction was performed, like '1997-10-10' or '1997-10-10 23:59:59'. If None, uses current date and time. |
None
|
chunk_size
|
int
|
Characters per chunk |
2048
|
chunk_overlap
|
int
|
Overlapping characters between chunks |
256
|
max_workers
|
int
|
Max concurrent extraction workers |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information |
False
|
_extract_data(text: str) -> AutoGraphSchema
¶
Extract edges directly, then derive nodes (One-stage extraction).
Process: 1. Split text into chunks. 2. Batch extract edges directly from chunks (Edges contain node info). 3. Post-process: Set observation date to current date and time. 4. Derive nodes from edges. 5. Merge all partial graphs into one global graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
AutoGraphSchema
|
Extracted and validated graph. |
match_nodes_and_update_edges(threshold: float = 0.8) -> iText2KG_Star
¶
Match nodes in the graph and update edges accordingly using SemHash with embeddings.
This method identifies and merges similar nodes based on semantic similarity using embeddings from the instance's embedder. It updates edges to reflect any changes in node identities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
Similarity threshold for matching nodes (0.0 to 1.0). Defaults to 0.8. |
0.8
|
Returns:
| Type | Description |
|---|---|
iText2KG_Star
|
The updated iText2KG_Star instance with matched nodes and updated edges. |
KG_Gen¶
hyperextract.methods.typical.KG_Gen
¶
Bases: AutoGraph[NodeSchema, EdgeSchema]
Knowledge Graph Generator: A specialized AutoGraph for extracting simple triple-based KGs.
Features: - Fixed Schema (NodeSchema, EdgeSchema) optimized for triple extraction - Customized prompts from original kg_gen implementation - Automatic deduplication and consistency checking via AutoGraph's OMem - Two-stage extraction: entities first, then relationships
Example
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = KG_Gen(llm_client=llm, embedder=embedder) kg.feed_text(text) print(f"Extracted {len(kg.nodes)} entities and {len(kg.edges)} relationships.")
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize KGGenGraph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
Language model for extraction |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing |
required |
chunk_size
|
int
|
Characters per chunk |
2048
|
chunk_overlap
|
int
|
Overlapping characters between chunks |
256
|
max_workers
|
int
|
Max concurrent extraction workers |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information |
False
|
_deduplicate_graph(graph_data: AutoGraphSchema[NodeSchema, EdgeSchema], threshold: float = 0.9) -> AutoGraphSchema[NodeSchema, EdgeSchema]
¶
Internal helper to apply SemHash deduplication on a graph data object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
graph_data
|
AutoGraphSchema[NodeSchema, EdgeSchema]
|
The graph data object (nodes/edges) to process in-place. |
required |
threshold
|
float
|
SemHash similarity threshold (0.0 to 1.0). |
0.9
|
Returns:
| Type | Description |
|---|---|
AutoGraphSchema[NodeSchema, EdgeSchema]
|
The modified graph_data object. |
deduplicate(threshold: float = 0.9) -> KG_Gen
¶
Return a NEW KG_Gen instance with deduplicated entities and edges. Does not modify the current instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
Similarity threshold for SemHash (0.0 to 1.0). Higher means stricter. |
0.9
|
Returns:
| Type | Description |
|---|---|
KG_Gen
|
A new, deduplicated KG_Gen instance. |
Atom¶
hyperextract.methods.typical.Atom
¶
Bases: AutoGraph[NodeSchema, EdgeSchema]
Atom: A specialized AutoGraph for extracting high-quality triple-based KGs.
Features:
- Two-stage extraction: Extracts atomic facts first, then derives edges and nodes from facts
- Customized prompts from original Atom implementation
- Semantic Deduplication: Includes match_nodes_and_update_edges using SemHash with embeddings
- Temporal tracking: t_start, t_end, and t_obs fields capture relationship timing and extraction metadata
- Evidence attribution: atomic_facts field traces each extracted edge to source facts
- Nested node schema: Maintains rich semantic information (name + label)
Example
from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = Atom(llm_client=llm, embedder=embedder)
1. Extract: Facts -> Edges -> Nodes¶
kg.feed_text("Elon Musk is the CEO of SpaceX. Musk leads Tesla.") print(f"Before dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")
2. Deduplicate: Merges 'Elon Musk' and 'Musk'¶
kg.match_nodes_and_update_edges(threshold=0.85) print(f"After dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")
Attributes¶
facts_per_chunk = facts_per_chunk
instance-attribute
¶
observation_time = observation_time
instance-attribute
¶
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, observation_time: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, facts_per_chunk: int = 10, max_workers: int = 10, verbose: bool = False)
¶
Initialize Atom.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
Language model for extraction |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing |
required |
observation_time
|
str | None
|
Date when the extraction was performed, like '1997-10-10' or '1997-10-10 23:59:59'. If None, uses current date and time. |
None
|
chunk_size
|
int
|
Characters per chunk |
2048
|
chunk_overlap
|
int
|
Overlapping characters between chunks |
256
|
facts_per_chunk
|
int
|
Max number of atomic facts to group into a single extraction batch (default: 10) |
10
|
max_workers
|
int
|
Max concurrent extraction workers |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information |
False
|
_extract_data(text: str) -> AutoGraphSchema
¶
Extract atomic facts first, then extract edges (Two-stage extraction).
Process: 1. Split text into chunks. 2. Batch extract atomic facts from chunks. 3. Consolidate facts into a unified context. 4. Split consolidated facts into chunks. 5. Batch extract edges from fact chunks. 6. Post-process: Set t_obs timestamp & Derive nodes. 7. Merge all partial graphs into one global graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
AutoGraphSchema
|
Extracted and validated graph. |
match_nodes_and_update_edges(threshold: float = 0.8) -> Atom
¶
Match nodes in the graph and update edges accordingly using SemHash with embeddings.
This method identifies and merges similar nodes based on semantic similarity using embeddings from the instance's embedder. It updates edges to reflect any changes in node identities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
Similarity threshold for matching nodes (0.0 to 1.0). Defaults to 0.8. |
0.8
|
Returns:
| Type | Description |
|---|---|
Atom
|
The updated Atom instance with matched nodes and updated edges. |