Skip to content

RAG-Based Methods

Retrieval-Augmented Generation methods for large-document extraction.


LightRAG

hyperextract.methods.rag.Light_RAG

Bases: AutoGraph[NodeSchema, EdgeSchema]

Light-RAG: Standard Graph-based Retrieval-Augmented Generation

Extracts entity-relationship graphs (nodes and binary edges) from text documents. Optimized for standard Knowledge Graph construction and traversal.

Features: - Two-stage extraction: Entities first, then binary relationships. - Custom Key Extractors: Precise deduplication using name-based node keys and (Source, Target) keys for edges. - Structured Knowledge Representation: Pydantic-based schemas. - Specialized Merging: Custom LLM rules for merging duplicate entities and relationships.

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize Light_RAG engine.

Parameters:

Name Type Description Default
llm_client BaseChatModel

LLM client for text generation.

required
embedder Embeddings

Embedding model for text embeddings.

required
chunk_size int

Size of text chunks for indexing.

2048
chunk_overlap int

Overlap between text chunks for indexing.

256
max_workers int

Maximum number of workers for indexing.

10
verbose bool

Display detailed execution logs and progress information.

False

GraphRAG

hyperextract.methods.rag.Graph_RAG

Bases: AutoGraph[NodeSchema, EdgeSchema]

Graph-RAG: Graph-based Retrieval-Augmented Generation

Extracts entity-relationships (binary edges) and supports advanced GraphRAG features: - Community Detection (Leiden/Modularity) - Community Reports (Summarization) - Global Search (Map-Reduce over summaries)

Implements the architecture of Microsoft GraphRAG / Nano-GraphRAG.

Attributes

community_reports: Dict[str, CommunityReport] = {} instance-attribute

_community_graph: Optional[Any] = None instance-attribute

_community_hierarchy: Dict[int, Dict[str, List[str]]] = {} instance-attribute

_node_to_community: Dict[str, Dict[str, Any]] = {} instance-attribute

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize Graph_RAG engine.

Parameters:

Name Type Description Default
llm_client BaseChatModel

LLM client for text generation.

required
embedder Embeddings

Embedding model for text embeddings.

required
chunk_size int

Size of text chunks for indexing.

2048
chunk_overlap int

Overlap between text chunks for indexing.

256
max_workers int

Maximum number of workers for indexing.

10
verbose bool

Display detailed execution logs and progress information.

False

dump(folder_path: str | Path) -> None

Saves graph state: internal data, community reports, and GraphML for visualization.

load(folder_path: str | Path) -> None

Loads graph state from directory.

_ensure_community_graph()

Lazily build _community_graph from nodes and edges when needed.

build_communities(level: int = 0)

Detects communities in the graph and generates reports for them. Uses Leiden algorithm (via graspologic).

Parameters:

Name Type Description Default
level int

The hierarchical level for Leiden algorithm (default 0).

0

search(query: str, top_k_nodes: int = 3, top_k_edges: int = 3, top_k: int | None = None, use_community: bool = False) -> Tuple[List, List, Optional[Dict]]

Unified graph search interface with optional community enhancement.

Parameters:

Name Type Description Default
query str

Search query string.

required
top_k_nodes int

Number of node results to return (default: 3).

3
top_k_edges int

Number of edge results to return (default: 3).

3
top_k int | None

If provided, sets both top_k_nodes and top_k_edges to this value.

None
use_community bool

If True, enables community-aware search. Requires networkx and graspologic. Default: False.

False

Returns:

Type Description
List

Tuple of (nodes, edges, community_context).

List

community_context is None when use_community=False.

_global_search_impl(query: str, top_k_nodes: int = 3, top_k_edges: int = 3) -> Tuple[List, List, Dict]

Internal implementation for community-enhanced search.

_get_community_context_for_query(query: str) -> Dict

Get community context related to the query.


HyperRAG

hyperextract.methods.rag.Hyper_RAG

Bases: AutoHypergraph[NodeSchema, EdgeSchema]

Hyper-RAG: Hypergraph-based Retrieval-Augmented Generation

Extracts multi-entity relationships (hyperedges) from text documents.

Features: - Two-stage extraction: Entities first, then low-order (binary) and high-order (n-ary) relationships. - Custom Key Extractors: Precise deduplication using name-based node keys and sorted participant tuples for edges. - Hyperedge Support: Handles complex n-ary relationships connecting multiple entities simultaneously. - Structured Knowledge Representation: Pydantic-based Node and Edge schemas with comprehensive attributes. - Advanced Indexing: Optimized field-level indexing for efficient semantic search and retrieval.

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize Hyper_RAG engine.

Parameters:

Name Type Description Default
llm_client BaseChatModel

LLM client for text generation.

required
embedder Embeddings

Embedding model for text embeddings.

required
chunk_size int

Size of text chunks for indexing.

2048
chunk_overlap int

Overlap between text chunks for indexing.

256
max_workers int

Maximum number of workers for indexing.

10
verbose bool

Display detailed execution logs and progress information.

False

HyperGraphRAG

hyperextract.methods.rag.HyperGraph_RAG

Bases: AutoHypergraph[NodeSchema, EdgeSchema]

HyperGraphRAG extractor using semantic knowledge segments as hyperedges.

This class implements the HyperGraphRAG algorithm which models knowledge as a hypergraph where each hyperedge represents a complete "knowledge segment" (atomic semantic unit) that connects multiple related entities. Unlike traditional knowledge graphs that represent pairwise relationships, hypergraphs can naturally express n-ary relationships and maintain the original context and semantic integrity of information.

The extraction process simultaneously identifies both knowledge segments (hyperedges) and the entities involved within each segment, followed by deduplication and merging across multiple text chunks.

Schema Components:

  • NodeSchema: Represents an entity (node) extracted from the source text within a specific context.

    • name (str): The entity name, maintaining original language and capitalization.
    • type (str): Entity type such as 'person', 'organization', 'event', 'location', 'concept', etc.
    • description (str): Comprehensive description of the entity's attributes and activities within the knowledge segment.
    • key_score (int): Importance score (0-100) indicating the entity's significance in context.
  • EdgeSchema: Represents a knowledge segment (hyper-edge) that connects multiple entities through a specific context or event.

    • knowledge_segment (str): The actual text span (sentence or phrase) from source material capturing the relationship, event, or context connecting entities.
    • completeness_score (int): Quality score (0-10) indicating how complete and meaningful the segment is as a standalone knowledge unit.
    • related_entities (List[str]): Names of all entities involved in or referenced by this knowledge segment.

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

Initialize HyperGraph_RAG extraction engine.

Configures the hypergraph extraction pipeline with LLM-based knowledge segment and entity identification, along with semantic embeddings for vector-based retrieval.

Parameters:

Name Type Description Default
llm_client BaseChatModel

Language model client for structured extraction.

required
embedder Embeddings

Embedding model for semantic vector representation.

required
chunk_size int

Text chunk size for processing (default: 2048 tokens).

2048
chunk_overlap int

Overlap between consecutive chunks to preserve context (default: 256 tokens).

256
max_workers int

Maximum parallel workers for batch extraction (default: 10).

10
verbose bool

Whether to display detailed extraction logs (default: True).

False

_extract_data(text: str) -> AutoHypergraphSchema[NodeSchema, EdgeSchema]

Extract hypergraph knowledge from text via knowledge segment identification.

Performs simultaneous extraction of knowledge segments (hyperedges) and associated entities from the input text. The extraction process: 1. Segments the text into knowledge-bearing units (sentences/clauses) 2. Identifies all entities referenced within each knowledge segment 3. Merges and deduplicates extracted components across multiple chunks 4. Prunes dangling edges (hyperedges without valid entity references)

Parameters:

Name Type Description Default
text str

Source text for hypergraph extraction.

required

Returns:

Type Description
AutoHypergraphSchema[NodeSchema, EdgeSchema]

AutoHypergraphSchema[NodeSchema, EdgeSchema]: Extracted hypergraph containing deduplicated and merged entities (nodes) and knowledge segments (hyperedges).


CogRAG

hyperextract.methods.rag.Cog_RAG

Cognitive-Inspired Dual-Hypergraph RAG System.

Combines two Hypergraph Layers: 1. Theme Layer (Cog_RAG_ThemeLayer): Captures macro narratives and themes. 2. Detail Layer (Cog_RAG_DetailLayer): Captures micro entity relationships.

Attributes

metadata = {'created_at': datetime.now(), 'updated_at': datetime.now()} instance-attribute

theme_layer = Cog_RAG_ThemeLayer(llm_client=llm_client, embedder=embedder, chunk_size=chunk_size, chunk_overlap=chunk_overlap, max_workers=max_workers, verbose=verbose) instance-attribute

detail_layer = Cog_RAG_DetailLayer(llm_client=llm_client, embedder=embedder, chunk_size=chunk_size, chunk_overlap=chunk_overlap, max_workers=max_workers, verbose=verbose) instance-attribute

llm = llm_client instance-attribute

verbose = verbose instance-attribute

nodes property

Return combined unique nodes from both layers (simple aggregation).

edges property

Return all edges (Themes + Relations).

Functions

__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)

feed_text(text: str)

Feed text to both hypergraph layers.

build_index()

Build indices for both layers.

search(query: str, top_k_themes: int = 3, top_k_entities: int = 3)

Dual-Layer Search. Strategy: 1. Macro: Search for Themes in ThemeLayer (Edge-First). 2. Micro: Search for Entities in DetailLayer (Node-First).

chat(query: str, top_k_themes: int = 3, top_k_entities: int = 3)

Generate an answer using context from both layers.

dump(folder_path: str)

Save both extracted graphs and metadata.

load(folder_path: str)

Load extracted graphs and metadata.

show(node_label_extractor=None, edge_label_extractor=None)

Visualize the specified layer interactively.