Typical Methods¶

Direct extraction methods that process text without retrieval.

iText2KG¶

`hyperextract.methods.typical.iText2KG` ¶

Bases: AutoGraph[NodeSchema, EdgeSchema]

iText2KG: A specialized AutoGraph for extracting high-quality triple-based KGs.

Features: - Fixed Schema (NodeSchema, EdgeSchema) optimized for triple extraction - Customized prompts from original iText2KG* implementation - Automatic deduplication and consistency checking via AutoGraph's OMem - Two-stage extraction: entities first, then relationships

Example

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = iText2KG(llm_client=llm, embedder=embedder) kg.feed_text(text) print(f"Extracted {len(kg.nodes)} entities and {len(kg.edges)} relationships.")

Functions¶

`init(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)` ¶

Initialize iText2KG.

Parameters:

Name	Type	Description	Default
`llm_client`	`BaseChatModel`	Language model for extraction	required
`embedder`	`Embeddings`	Embedding model for vector indexing	required
`chunk_size`	`int`	Characters per chunk	`2048`
`chunk_overlap`	`int`	Overlapping characters between chunks	`256`
`max_workers`	`int`	Max concurrent extraction workers	`10`
`verbose`	`bool`	Display detailed execution logs and progress information	`False`

iText2KG_Star¶

`hyperextract.methods.typical.iText2KG_Star` ¶

Bases: AutoGraph[NodeSchema, EdgeSchema]

iText2KG_Star: A specialized AutoGraph for extracting high-quality triple-based KGs.

Features: - One-stage extraction: Extracts edges directly, then derives nodes automatically - Customized prompts from original iText2KG_Star implementation - Semantic Deduplication: Includes match_nodes_and_update_edges using SemHash with embeddings - Automatic date tracking: Observation date set to extraction time - Nested node schema: Maintains rich semantic information (name + label)

Example

from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = iText2KG_Star(llm_client=llm, embedder=embedder)

1. Extract relationships and derive nodes¶

kg.feed_text("Elon Musk is the CEO of SpaceX. Musk leads Tesla.") print(f"Before dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

2. Deduplicate: Merges 'Elon Musk' and 'Musk'¶

kg.match_nodes_and_update_edges(threshold=0.85) print(f"After dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

Attributes¶

`observation_date = observation_date` `instance-attribute` ¶

Functions¶

`init(llm_client: BaseChatModel, embedder: Embeddings, observation_date: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)` ¶

Initialize iText2KG_Star.

Parameters:

Name	Type	Description	Default
`llm_client`	`BaseChatModel`	Language model for extraction	required
`embedder`	`Embeddings`	Embedding model for vector indexing	required
`observation_date`	`str \| None`	Date when the extraction was performed, like '1997-10-10' or '1997-10-10 23:59:59'. If None, uses current date and time.	`None`
`chunk_size`	`int`	Characters per chunk	`2048`
`chunk_overlap`	`int`	Overlapping characters between chunks	`256`
`max_workers`	`int`	Max concurrent extraction workers	`10`
`verbose`	`bool`	Display detailed execution logs and progress information	`False`

`_extract_data(text: str) -> AutoGraphSchema` ¶

Extract edges directly, then derive nodes (One-stage extraction).

Process: 1. Split text into chunks. 2. Batch extract edges directly from chunks (Edges contain node info). 3. Post-process: Set observation date to current date and time. 4. Derive nodes from edges. 5. Merge all partial graphs into one global graph.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`AutoGraphSchema`	Extracted and validated graph.

`match_nodes_and_update_edges(threshold: float = 0.8) -> iText2KG_Star` ¶

Match nodes in the graph and update edges accordingly using SemHash with embeddings.

This method identifies and merges similar nodes based on semantic similarity using embeddings from the instance's embedder. It updates edges to reflect any changes in node identities.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Similarity threshold for matching nodes (0.0 to 1.0). Defaults to 0.8.	`0.8`

Returns:

Type	Description
`iText2KG_Star`	The updated iText2KG_Star instance with matched nodes and updated edges.

KG_Gen¶

`hyperextract.methods.typical.KG_Gen` ¶

Bases: AutoGraph[NodeSchema, EdgeSchema]

Knowledge Graph Generator: A specialized AutoGraph for extracting simple triple-based KGs.

Features: - Fixed Schema (NodeSchema, EdgeSchema) optimized for triple extraction - Customized prompts from original kg_gen implementation - Automatic deduplication and consistency checking via AutoGraph's OMem - Two-stage extraction: entities first, then relationships

Example

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = KG_Gen(llm_client=llm, embedder=embedder) kg.feed_text(text) print(f"Extracted {len(kg.nodes)} entities and {len(kg.edges)} relationships.")

Functions¶

`init(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)` ¶

Initialize KGGenGraph.

Parameters:

Name	Type	Description	Default
`llm_client`	`BaseChatModel`	Language model for extraction	required
`embedder`	`Embeddings`	Embedding model for vector indexing	required
`chunk_size`	`int`	Characters per chunk	`2048`
`chunk_overlap`	`int`	Overlapping characters between chunks	`256`
`max_workers`	`int`	Max concurrent extraction workers	`10`
`verbose`	`bool`	Display detailed execution logs and progress information	`False`

`_deduplicate_graph(graph_data: AutoGraphSchema[NodeSchema, EdgeSchema], threshold: float = 0.9) -> AutoGraphSchema[NodeSchema, EdgeSchema]` ¶

Internal helper to apply SemHash deduplication on a graph data object.

Parameters:

Name	Type	Description	Default
`graph_data`	`AutoGraphSchema[NodeSchema, EdgeSchema]`	The graph data object (nodes/edges) to process in-place.	required
`threshold`	`float`	SemHash similarity threshold (0.0 to 1.0).	`0.9`

Returns:

Type	Description
`AutoGraphSchema[NodeSchema, EdgeSchema]`	The modified graph_data object.

`deduplicate(threshold: float = 0.9) -> KG_Gen` ¶

Return a NEW KG_Gen instance with deduplicated entities and edges. Does not modify the current instance.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Similarity threshold for SemHash (0.0 to 1.0). Higher means stricter.	`0.9`

Returns:

Type	Description
`KG_Gen`	A new, deduplicated KG_Gen instance.

`self_deduplicate(threshold: float = 0.9) -> KG_Gen` ¶

Deduplicate the current graph IN-PLACE using SemHash.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Similarity threshold (0.0 to 1.0).	`0.9`

Returns:

Type	Description
`KG_Gen`	self (modified in-place)

Atom¶

`hyperextract.methods.typical.Atom` ¶

Bases: AutoGraph[NodeSchema, EdgeSchema]

Atom: A specialized AutoGraph for extracting high-quality triple-based KGs.

Features: - Two-stage extraction: Extracts atomic facts first, then derives edges and nodes from facts - Customized prompts from original Atom implementation - Semantic Deduplication: Includes match_nodes_and_update_edges using SemHash with embeddings - Temporal tracking: t_start, t_end, and t_obs fields capture relationship timing and extraction metadata - Evidence attribution: atomic_facts field traces each extracted edge to source facts - Nested node schema: Maintains rich semantic information (name + label)

Example

from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = Atom(llm_client=llm, embedder=embedder)

1. Extract: Facts -> Edges -> Nodes¶

kg.feed_text("Elon Musk is the CEO of SpaceX. Musk leads Tesla.") print(f"Before dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

2. Deduplicate: Merges 'Elon Musk' and 'Musk'¶

kg.match_nodes_and_update_edges(threshold=0.85) print(f"After dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

Attributes¶

`facts_per_chunk = facts_per_chunk` `instance-attribute` ¶

`observation_time = observation_time` `instance-attribute` ¶

Functions¶

`init(llm_client: BaseChatModel, embedder: Embeddings, observation_time: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, facts_per_chunk: int = 10, max_workers: int = 10, verbose: bool = False)` ¶

Initialize Atom.

Parameters:

Name	Type	Description	Default
`llm_client`	`BaseChatModel`	Language model for extraction	required
`embedder`	`Embeddings`	Embedding model for vector indexing	required
`observation_time`	`str \| None`	Date when the extraction was performed, like '1997-10-10' or '1997-10-10 23:59:59'. If None, uses current date and time.	`None`
`chunk_size`	`int`	Characters per chunk	`2048`
`chunk_overlap`	`int`	Overlapping characters between chunks	`256`
`facts_per_chunk`	`int`	Max number of atomic facts to group into a single extraction batch (default: 10)	`10`
`max_workers`	`int`	Max concurrent extraction workers	`10`
`verbose`	`bool`	Display detailed execution logs and progress information	`False`

`_extract_data(text: str) -> AutoGraphSchema` ¶

Extract atomic facts first, then extract edges (Two-stage extraction).

Process: 1. Split text into chunks. 2. Batch extract atomic facts from chunks. 3. Consolidate facts into a unified context. 4. Split consolidated facts into chunks. 5. Batch extract edges from fact chunks. 6. Post-process: Set t_obs timestamp & Derive nodes. 7. Merge all partial graphs into one global graph.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`AutoGraphSchema`	Extracted and validated graph.

`match_nodes_and_update_edges(threshold: float = 0.8) -> Atom` ¶

Match nodes in the graph and update edges accordingly using SemHash with embeddings.

This method identifies and merges similar nodes based on semantic similarity using embeddings from the instance's embedder. It updates edges to reflect any changes in node identities.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Similarity threshold for matching nodes (0.0 to 1.0). Defaults to 0.8.	`0.8`

Returns:

Type	Description
`Atom`	The updated Atom instance with matched nodes and updated edges.

Typical Methods¶

iText2KG¶

hyperextract.methods.typical.iText2KG ¶

Functions¶

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False) ¶

iText2KG_Star¶

hyperextract.methods.typical.iText2KG_Star ¶

1. Extract relationships and derive nodes¶

2. Deduplicate: Merges 'Elon Musk' and 'Musk'¶

Attributes¶

observation_date = observation_date instance-attribute ¶

Functions¶

__init__(llm_client: BaseChatModel, embedder: Embeddings, observation_date: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False) ¶

_extract_data(text: str) -> AutoGraphSchema ¶

match_nodes_and_update_edges(threshold: float = 0.8) -> iText2KG_Star ¶

KG_Gen¶

hyperextract.methods.typical.KG_Gen ¶

Functions¶

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False) ¶

_deduplicate_graph(graph_data: AutoGraphSchema[NodeSchema, EdgeSchema], threshold: float = 0.9) -> AutoGraphSchema[NodeSchema, EdgeSchema] ¶

deduplicate(threshold: float = 0.9) -> KG_Gen ¶

self_deduplicate(threshold: float = 0.9) -> KG_Gen ¶

Atom¶

hyperextract.methods.typical.Atom ¶

1. Extract: Facts -> Edges -> Nodes¶

2. Deduplicate: Merges 'Elon Musk' and 'Musk'¶

Attributes¶

facts_per_chunk = facts_per_chunk instance-attribute ¶

observation_time = observation_time instance-attribute ¶

Functions¶

__init__(llm_client: BaseChatModel, embedder: Embeddings, observation_time: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, facts_per_chunk: int = 10, max_workers: int = 10, verbose: bool = False) ¶

_extract_data(text: str) -> AutoGraphSchema ¶

match_nodes_and_update_edges(threshold: float = 0.8) -> Atom ¶

`hyperextract.methods.typical.iText2KG` ¶

`init(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)` ¶

`hyperextract.methods.typical.iText2KG_Star` ¶

`observation_date = observation_date` `instance-attribute` ¶

`init(llm_client: BaseChatModel, embedder: Embeddings, observation_date: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)` ¶

`_extract_data(text: str) -> AutoGraphSchema` ¶

`match_nodes_and_update_edges(threshold: float = 0.8) -> iText2KG_Star` ¶

`hyperextract.methods.typical.KG_Gen` ¶

`init(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)` ¶

`_deduplicate_graph(graph_data: AutoGraphSchema[NodeSchema, EdgeSchema], threshold: float = 0.9) -> AutoGraphSchema[NodeSchema, EdgeSchema]` ¶

`deduplicate(threshold: float = 0.9) -> KG_Gen` ¶

`self_deduplicate(threshold: float = 0.9) -> KG_Gen` ¶

`hyperextract.methods.typical.Atom` ¶

`facts_per_chunk = facts_per_chunk` `instance-attribute` ¶

`observation_time = observation_time` `instance-attribute` ¶

`init(llm_client: BaseChatModel, embedder: Embeddings, observation_time: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, facts_per_chunk: int = 10, max_workers: int = 10, verbose: bool = False)` ¶

`_extract_data(text: str) -> AutoGraphSchema` ¶

`match_nodes_and_update_edges(threshold: float = 0.8) -> Atom` ¶