Skip to content

Typical Methods

Direct extraction methods that process text without retrieval.


iText2KG

hyperextract.methods.typical.iText2KG

Bases: AutoGraph[NodeSchema, EdgeSchema]

iText2KG: A specialized AutoGraph for extracting high-quality triple-based KGs.

Features: - Fixed Schema (NodeSchema, EdgeSchema) optimized for triple extraction - Customized prompts from original iText2KG* implementation - Automatic deduplication and consistency checking via AutoGraph's OMem - Two-stage extraction: entities first, then relationships

Example

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = iText2KG(llm_client=llm, embedder=embedder) kg.feed_text(text) print(f"Extracted {len(kg.nodes)} entities and {len(kg.edges)} relationships.")

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize iText2KG.

Parameters:

Name Type Description Default
llm_client BaseChatModel

Language model for extraction

required
embedder Embeddings

Embedding model for vector indexing

required
chunk_size int

Characters per chunk

2048
chunk_overlap int

Overlapping characters between chunks

256
max_workers int

Max concurrent extraction workers

10
verbose bool

Display detailed execution logs and progress information

False

iText2KG_Star

hyperextract.methods.typical.iText2KG_Star

Bases: AutoGraph[NodeSchema, EdgeSchema]

iText2KG_Star: A specialized AutoGraph for extracting high-quality triple-based KGs.

Features: - One-stage extraction: Extracts edges directly, then derives nodes automatically - Customized prompts from original iText2KG_Star implementation - Semantic Deduplication: Includes match_nodes_and_update_edges using SemHash with embeddings - Automatic date tracking: Observation date set to extraction time - Nested node schema: Maintains rich semantic information (name + label)

Example

from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = iText2KG_Star(llm_client=llm, embedder=embedder)

1. Extract relationships and derive nodes

kg.feed_text("Elon Musk is the CEO of SpaceX. Musk leads Tesla.") print(f"Before dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

2. Deduplicate: Merges 'Elon Musk' and 'Musk'

kg.match_nodes_and_update_edges(threshold=0.85) print(f"After dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

Attributes

observation_date = observation_date instance-attribute

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, observation_date: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize iText2KG_Star.

Parameters:

Name Type Description Default
llm_client BaseChatModel

Language model for extraction

required
embedder Embeddings

Embedding model for vector indexing

required
observation_date str | None

Date when the extraction was performed, like '1997-10-10' or '1997-10-10 23:59:59'. If None, uses current date and time.

None
chunk_size int

Characters per chunk

2048
chunk_overlap int

Overlapping characters between chunks

256
max_workers int

Max concurrent extraction workers

10
verbose bool

Display detailed execution logs and progress information

False

_extract_data(text: str) -> AutoGraphSchema

Extract edges directly, then derive nodes (One-stage extraction).

Process: 1. Split text into chunks. 2. Batch extract edges directly from chunks (Edges contain node info). 3. Post-process: Set observation date to current date and time. 4. Derive nodes from edges. 5. Merge all partial graphs into one global graph.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
AutoGraphSchema

Extracted and validated graph.

match_nodes_and_update_edges(threshold: float = 0.8) -> iText2KG_Star

Match nodes in the graph and update edges accordingly using SemHash with embeddings.

This method identifies and merges similar nodes based on semantic similarity using embeddings from the instance's embedder. It updates edges to reflect any changes in node identities.

Parameters:

Name Type Description Default
threshold float

Similarity threshold for matching nodes (0.0 to 1.0). Defaults to 0.8.

0.8

Returns:

Type Description
iText2KG_Star

The updated iText2KG_Star instance with matched nodes and updated edges.


KG_Gen

hyperextract.methods.typical.KG_Gen

Bases: AutoGraph[NodeSchema, EdgeSchema]

Knowledge Graph Generator: A specialized AutoGraph for extracting simple triple-based KGs.

Features: - Fixed Schema (NodeSchema, EdgeSchema) optimized for triple extraction - Customized prompts from original kg_gen implementation - Automatic deduplication and consistency checking via AutoGraph's OMem - Two-stage extraction: entities first, then relationships

Example

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = KG_Gen(llm_client=llm, embedder=embedder) kg.feed_text(text) print(f"Extracted {len(kg.nodes)} entities and {len(kg.edges)} relationships.")

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize KGGenGraph.

Parameters:

Name Type Description Default
llm_client BaseChatModel

Language model for extraction

required
embedder Embeddings

Embedding model for vector indexing

required
chunk_size int

Characters per chunk

2048
chunk_overlap int

Overlapping characters between chunks

256
max_workers int

Max concurrent extraction workers

10
verbose bool

Display detailed execution logs and progress information

False

_deduplicate_graph(graph_data: AutoGraphSchema[NodeSchema, EdgeSchema], threshold: float = 0.9) -> AutoGraphSchema[NodeSchema, EdgeSchema]

Internal helper to apply SemHash deduplication on a graph data object.

Parameters:

Name Type Description Default
graph_data AutoGraphSchema[NodeSchema, EdgeSchema]

The graph data object (nodes/edges) to process in-place.

required
threshold float

SemHash similarity threshold (0.0 to 1.0).

0.9

Returns:

Type Description
AutoGraphSchema[NodeSchema, EdgeSchema]

The modified graph_data object.

deduplicate(threshold: float = 0.9) -> KG_Gen

Return a NEW KG_Gen instance with deduplicated entities and edges. Does not modify the current instance.

Parameters:

Name Type Description Default
threshold float

Similarity threshold for SemHash (0.0 to 1.0). Higher means stricter.

0.9

Returns:

Type Description
KG_Gen

A new, deduplicated KG_Gen instance.

self_deduplicate(threshold: float = 0.9) -> KG_Gen

Deduplicate the current graph IN-PLACE using SemHash.

Parameters:

Name Type Description Default
threshold float

Similarity threshold (0.0 to 1.0).

0.9

Returns:

Type Description
KG_Gen

self (modified in-place)


Atom

hyperextract.methods.typical.Atom

Bases: AutoGraph[NodeSchema, EdgeSchema]

Atom: A specialized AutoGraph for extracting high-quality triple-based KGs.

Features: - Two-stage extraction: Extracts atomic facts first, then derives edges and nodes from facts - Customized prompts from original Atom implementation - Semantic Deduplication: Includes match_nodes_and_update_edges using SemHash with embeddings - Temporal tracking: t_start, t_end, and t_obs fields capture relationship timing and extraction metadata - Evidence attribution: atomic_facts field traces each extracted edge to source facts - Nested node schema: Maintains rich semantic information (name + label)

Example

from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedder = OpenAIEmbeddings() kg = Atom(llm_client=llm, embedder=embedder)

1. Extract: Facts -> Edges -> Nodes

kg.feed_text("Elon Musk is the CEO of SpaceX. Musk leads Tesla.") print(f"Before dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

2. Deduplicate: Merges 'Elon Musk' and 'Musk'

kg.match_nodes_and_update_edges(threshold=0.85) print(f"After dedup - Nodes: {len(kg.nodes)}, Edges: {len(kg.edges)}")

Attributes

facts_per_chunk = facts_per_chunk instance-attribute

observation_time = observation_time instance-attribute

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, observation_time: str | None = None, chunk_size: int = 2048, chunk_overlap: int = 256, facts_per_chunk: int = 10, max_workers: int = 10, verbose: bool = False)

Initialize Atom.

Parameters:

Name Type Description Default
llm_client BaseChatModel

Language model for extraction

required
embedder Embeddings

Embedding model for vector indexing

required
observation_time str | None

Date when the extraction was performed, like '1997-10-10' or '1997-10-10 23:59:59'. If None, uses current date and time.

None
chunk_size int

Characters per chunk

2048
chunk_overlap int

Overlapping characters between chunks

256
facts_per_chunk int

Max number of atomic facts to group into a single extraction batch (default: 10)

10
max_workers int

Max concurrent extraction workers

10
verbose bool

Display detailed execution logs and progress information

False

_extract_data(text: str) -> AutoGraphSchema

Extract atomic facts first, then extract edges (Two-stage extraction).

Process: 1. Split text into chunks. 2. Batch extract atomic facts from chunks. 3. Consolidate facts into a unified context. 4. Split consolidated facts into chunks. 5. Batch extract edges from fact chunks. 6. Post-process: Set t_obs timestamp & Derive nodes. 7. Merge all partial graphs into one global graph.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
AutoGraphSchema

Extracted and validated graph.

match_nodes_and_update_edges(threshold: float = 0.8) -> Atom

Match nodes in the graph and update edges accordingly using SemHash with embeddings.

This method identifies and merges similar nodes based on semantic similarity using embeddings from the instance's embedder. It updates edges to reflect any changes in node identities.

Parameters:

Name Type Description Default
threshold float

Similarity threshold for matching nodes (0.0 to 1.0). Defaults to 0.8.

0.8

Returns:

Type Description
Atom

The updated Atom instance with matched nodes and updated edges.