基于 RAG 的方法¶
用于大型文档提取的检索增强生成方法。
LightRAG¶
hyperextract.methods.rag.Light_RAG
¶
Bases: AutoGraph[NodeSchema, EdgeSchema]
Light-RAG: Standard Graph-based Retrieval-Augmented Generation
Extracts entity-relationship graphs (nodes and binary edges) from text documents. Optimized for standard Knowledge Graph construction and traversal.
Features: - Two-stage extraction: Entities first, then binary relationships. - Custom Key Extractors: Precise deduplication using name-based node keys and (Source, Target) keys for edges. - Structured Knowledge Representation: Pydantic-based schemas. - Specialized Merging: Custom LLM rules for merging duplicate entities and relationships.
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize Light_RAG engine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
LLM client for text generation. |
required |
embedder
|
Embeddings
|
Embedding model for text embeddings. |
required |
chunk_size
|
int
|
Size of text chunks for indexing. |
2048
|
chunk_overlap
|
int
|
Overlap between text chunks for indexing. |
256
|
max_workers
|
int
|
Maximum number of workers for indexing. |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information. |
False
|
GraphRAG¶
hyperextract.methods.rag.Graph_RAG
¶
Bases: AutoGraph[NodeSchema, EdgeSchema]
Graph-RAG: Graph-based Retrieval-Augmented Generation
Extracts entity-relationships (binary edges) and supports advanced GraphRAG features: - Community Detection (Leiden/Modularity) - Community Reports (Summarization) - Global Search (Map-Reduce over summaries)
Implements the architecture of Microsoft GraphRAG / Nano-GraphRAG.
Attributes¶
community_reports: Dict[str, CommunityReport] = {}
instance-attribute
¶
_community_graph: Optional[Any] = None
instance-attribute
¶
_community_hierarchy: Dict[int, Dict[str, List[str]]] = {}
instance-attribute
¶
_node_to_community: Dict[str, Dict[str, Any]] = {}
instance-attribute
¶
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize Graph_RAG engine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
LLM client for text generation. |
required |
embedder
|
Embeddings
|
Embedding model for text embeddings. |
required |
chunk_size
|
int
|
Size of text chunks for indexing. |
2048
|
chunk_overlap
|
int
|
Overlap between text chunks for indexing. |
256
|
max_workers
|
int
|
Maximum number of workers for indexing. |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information. |
False
|
dump(folder_path: str | Path) -> None
¶
Saves graph state: internal data, community reports, and GraphML for visualization.
load(folder_path: str | Path) -> None
¶
Loads graph state from directory.
_ensure_community_graph()
¶
Lazily build _community_graph from nodes and edges when needed.
build_communities(level: int = 0)
¶
Detects communities in the graph and generates reports for them. Uses Leiden algorithm (via graspologic).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
int
|
The hierarchical level for Leiden algorithm (default 0). |
0
|
search(query: str, top_k_nodes: int = 3, top_k_edges: int = 3, top_k: int | None = None, use_community: bool = False) -> Tuple[List, List, Optional[Dict]]
¶
Unified graph search interface with optional community enhancement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Search query string. |
required |
top_k_nodes
|
int
|
Number of node results to return (default: 3). |
3
|
top_k_edges
|
int
|
Number of edge results to return (default: 3). |
3
|
top_k
|
int | None
|
If provided, sets both top_k_nodes and top_k_edges to this value. |
None
|
use_community
|
bool
|
If True, enables community-aware search. Requires networkx and graspologic. Default: False. |
False
|
Returns:
| Type | Description |
|---|---|
List
|
Tuple of (nodes, edges, community_context). |
List
|
community_context is None when use_community=False. |
_global_search_impl(query: str, top_k_nodes: int = 3, top_k_edges: int = 3) -> Tuple[List, List, Dict]
¶
Internal implementation for community-enhanced search.
_get_community_context_for_query(query: str) -> Dict
¶
Get community context related to the query.
HyperRAG¶
hyperextract.methods.rag.Hyper_RAG
¶
Bases: AutoHypergraph[NodeSchema, EdgeSchema]
Hyper-RAG: Hypergraph-based Retrieval-Augmented Generation
Extracts multi-entity relationships (hyperedges) from text documents.
Features: - Two-stage extraction: Entities first, then low-order (binary) and high-order (n-ary) relationships. - Custom Key Extractors: Precise deduplication using name-based node keys and sorted participant tuples for edges. - Hyperedge Support: Handles complex n-ary relationships connecting multiple entities simultaneously. - Structured Knowledge Representation: Pydantic-based Node and Edge schemas with comprehensive attributes. - Advanced Indexing: Optimized field-level indexing for efficient semantic search and retrieval.
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize Hyper_RAG engine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
LLM client for text generation. |
required |
embedder
|
Embeddings
|
Embedding model for text embeddings. |
required |
chunk_size
|
int
|
Size of text chunks for indexing. |
2048
|
chunk_overlap
|
int
|
Overlap between text chunks for indexing. |
256
|
max_workers
|
int
|
Maximum number of workers for indexing. |
10
|
verbose
|
bool
|
Display detailed execution logs and progress information. |
False
|
HyperGraphRAG¶
hyperextract.methods.rag.HyperGraph_RAG
¶
Bases: AutoHypergraph[NodeSchema, EdgeSchema]
HyperGraphRAG extractor using semantic knowledge segments as hyperedges.
This class implements the HyperGraphRAG algorithm which models knowledge as a hypergraph where each hyperedge represents a complete "knowledge segment" (atomic semantic unit) that connects multiple related entities. Unlike traditional knowledge graphs that represent pairwise relationships, hypergraphs can naturally express n-ary relationships and maintain the original context and semantic integrity of information.
The extraction process simultaneously identifies both knowledge segments (hyperedges) and the entities involved within each segment, followed by deduplication and merging across multiple text chunks.
Schema Components:
-
NodeSchema: Represents an entity (node) extracted from the source text within a specific context.
name(str): The entity name, maintaining original language and capitalization.type(str): Entity type such as 'person', 'organization', 'event', 'location', 'concept', etc.description(str): Comprehensive description of the entity's attributes and activities within the knowledge segment.key_score(int): Importance score (0-100) indicating the entity's significance in context.
-
EdgeSchema: Represents a knowledge segment (hyper-edge) that connects multiple entities through a specific context or event.
knowledge_segment(str): The actual text span (sentence or phrase) from source material capturing the relationship, event, or context connecting entities.completeness_score(int): Quality score (0-10) indicating how complete and meaningful the segment is as a standalone knowledge unit.related_entities(List[str]): Names of all entities involved in or referenced by this knowledge segment.
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
Initialize HyperGraph_RAG extraction engine.
Configures the hypergraph extraction pipeline with LLM-based knowledge segment and entity identification, along with semantic embeddings for vector-based retrieval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_client
|
BaseChatModel
|
Language model client for structured extraction. |
required |
embedder
|
Embeddings
|
Embedding model for semantic vector representation. |
required |
chunk_size
|
int
|
Text chunk size for processing (default: 2048 tokens). |
2048
|
chunk_overlap
|
int
|
Overlap between consecutive chunks to preserve context (default: 256 tokens). |
256
|
max_workers
|
int
|
Maximum parallel workers for batch extraction (default: 10). |
10
|
verbose
|
bool
|
Whether to display detailed extraction logs (default: True). |
False
|
_extract_data(text: str) -> AutoHypergraphSchema[NodeSchema, EdgeSchema]
¶
Extract hypergraph knowledge from text via knowledge segment identification.
Performs simultaneous extraction of knowledge segments (hyperedges) and associated entities from the input text. The extraction process: 1. Segments the text into knowledge-bearing units (sentences/clauses) 2. Identifies all entities referenced within each knowledge segment 3. Merges and deduplicates extracted components across multiple chunks 4. Prunes dangling edges (hyperedges without valid entity references)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Source text for hypergraph extraction. |
required |
Returns:
| Type | Description |
|---|---|
AutoHypergraphSchema[NodeSchema, EdgeSchema]
|
AutoHypergraphSchema[NodeSchema, EdgeSchema]: Extracted hypergraph containing deduplicated and merged entities (nodes) and knowledge segments (hyperedges). |
CogRAG¶
hyperextract.methods.rag.Cog_RAG
¶
Cognitive-Inspired Dual-Hypergraph RAG System.
Combines two Hypergraph Layers: 1. Theme Layer (Cog_RAG_ThemeLayer): Captures macro narratives and themes. 2. Detail Layer (Cog_RAG_DetailLayer): Captures micro entity relationships.
Attributes¶
metadata = {'created_at': datetime.now(), 'updated_at': datetime.now()}
instance-attribute
¶
theme_layer = Cog_RAG_ThemeLayer(llm_client=llm_client, embedder=embedder, chunk_size=chunk_size, chunk_overlap=chunk_overlap, max_workers=max_workers, verbose=verbose)
instance-attribute
¶
detail_layer = Cog_RAG_DetailLayer(llm_client=llm_client, embedder=embedder, chunk_size=chunk_size, chunk_overlap=chunk_overlap, max_workers=max_workers, verbose=verbose)
instance-attribute
¶
llm = llm_client
instance-attribute
¶
verbose = verbose
instance-attribute
¶
nodes
property
¶
Return combined unique nodes from both layers (simple aggregation).
edges
property
¶
Return all edges (Themes + Relations).
Functions¶
__init__(llm_client: BaseChatModel, embedder: Embeddings, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False)
¶
feed_text(text: str)
¶
Feed text to both hypergraph layers.
build_index()
¶
Build indices for both layers.
search(query: str, top_k_themes: int = 3, top_k_entities: int = 3)
¶
Dual-Layer Search. Strategy: 1. Macro: Search for Themes in ThemeLayer (Edge-First). 2. Micro: Search for Entities in DetailLayer (Node-First).
chat(query: str, top_k_themes: int = 3, top_k_entities: int = 3)
¶
Generate an answer using context from both layers.
dump(folder_path: str)
¶
Save both extracted graphs and metadata.
load(folder_path: str)
¶
Load extracted graphs and metadata.
show(node_label_extractor=None, edge_label_extractor=None)
¶
Visualize the specified layer interactively.