记录类型¶
用于记录和存储数据,不涉及实体间关系。
AutoModel¶
hyperextract.types.AutoModel
¶
Bases: BaseAutoType[T]
AutoModel - extracts a single structured object from text.
This pattern is designed for extracting exactly one structured object per document, regardless of document length. Suitable for document-level information like summaries, metadata, or aggregate statistics.
Key characteristics
- Extraction target: One unique structured object per document
- Merge strategy: Configurable via MergeStrategy enum (supports LLM-powered intelligent merging)
- MERGE_FIELD: Non-null fields overwrite, lists append (simple field merge)
- LLM.BALANCED: LLM intelligently synthesizes both versions (default)
- LLM.PREFER_EXISTING: LLM synthesis but prioritizes original data
- LLM.PREFER_INCOMING: LLM synthesis but prioritizes new data
- Indexing strategy: Each non-null field of the object is indexed independently
- Processing: Uses LangChain native batch processing for efficient multi-chunk handling
- Advanced merging: All chunk extractions are treated as the same object, triggering merge logic
Attributes¶
_strategy_or_merger = strategy_or_merger
instance-attribute
¶
_constructor_kwargs = kwargs
instance-attribute
¶
_label_extractor = label_extractor
instance-attribute
¶
_constant_key = 'singleton'
instance-attribute
¶
_key_extractor = lambda x: self._constant_key
instance-attribute
¶
_merger = strategy_or_merger
instance-attribute
¶
data: T
property
¶
Returns all stored knowledge (read-only access).
Returns:
| Type | Description |
|---|---|
T
|
The internal knowledge data as a Pydantic model instance. |
Functions¶
__init__(data_schema: Type[T], llm_client: BaseChatModel, embedder: Embeddings, *, strategy_or_merger: MergeStrategy | BaseMerger = MergeStrategy.LLM.BALANCED, prompt: str = '', label_extractor: Callable[[T], str] = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False, **kwargs)
¶
Initialize AutoModel with schema and configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_schema
|
Type[T]
|
Pydantic BaseModel subclass defining the object structure. |
required |
llm_client
|
BaseChatModel
|
Language model client for extraction. |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing. |
required |
strategy_or_merger
|
MergeStrategy | BaseMerger
|
Merge strategy for multi-chunk results. Can be: - MergeStrategy enum value (e.g., MergeStrategy.MERGE_FIELD, MergeStrategy.LLM.BALANCED) - Custom BaseMerger instance Default: MergeStrategy.LLM.BALANCED (LLM intelligently synthesizes both versions) |
BALANCED
|
prompt
|
str
|
Custom extraction prompt (defaults to generic prompt). |
''
|
label_extractor
|
Callable[[T], str]
|
Optional function to extract label from model instance for visualization. |
None
|
chunk_size
|
int
|
Maximum characters per chunk for long texts. |
2048
|
chunk_overlap
|
int
|
Overlapping characters between adjacent chunks. |
256
|
max_workers
|
int
|
Maximum concurrent extraction tasks. |
10
|
verbose
|
bool
|
Whether to log progress information. |
False
|
_create_empty_instance() -> AutoModel[T]
¶
Creates a new empty instance with the same configuration.
Overrides parent method to include AutoModel-specific parameters.
Returns:
| Type | Description |
|---|---|
AutoModel[T]
|
New AutoModel instance. |
_default_prompt() -> str
¶
Returns the default extraction prompt for single-object extraction.
empty() -> bool
¶
Checks if the model is empty (no data stored).
Returns:
| Type | Description |
|---|---|
bool
|
True if no data is stored, False otherwise. |
_init_data_state() -> None
¶
INIT/RESET: Initialize or reset to empty state (None). Called during init and when clear() is called.
_init_index_state() -> None
¶
Initialize vector index to empty state.
_set_data_state(data: T) -> None
¶
SET: Full reset. Replace with new data (e.g., load from disk). Called by parse() or load() where data IS the new state.
_update_data_state(incoming_data: T) -> None
¶
UPDATE: Incremental merge. Merge fields with field-level update strategy (called by feed()).
For AutoModel, incremental update means filling missing fields, first extraction wins.
merge_batch_data(data_list: List[T]) -> T
¶
Merge multiple extracted objects using configured strategy.
Leverages ontomem's merge strategies to intelligently combine results from multiple chunks. All extractions are treated as the same object (singleton) to trigger the merge logic.
Supported merge strategies: - MERGE_FIELD: Non-null fields overwrite, lists append (simple field merge) - LLM.BALANCED: LLM synthesizes both versions, balancing insights - LLM.PREFER_EXISTING: LLM synthesis prioritizing original data - LLM.PREFER_INCOMING: LLM synthesis prioritizing new data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_list
|
List[T]
|
List of extracted data objects from batch processing to merge. |
required |
Returns:
| Type | Description |
|---|---|
T
|
A new merged knowledge object with intelligently combined fields. |
build_index()
¶
Builds vector index from all non-null fields in the data object.
search(query: str, top_k: int = 3) -> List[Any]
¶
Searches all indexed fields using semantic similarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Search query string. |
required |
top_k
|
int
|
Number of results to return. |
3
|
Returns:
| Type | Description |
|---|---|
List[Any]
|
List of relevant knowledge items (field-value dictionaries). |
dump_index(folder_path: str | Path) -> None
¶
Saves FAISS vector index to disk.
load_index(folder_path: str | Path) -> None
¶
Loads FAISS vector index from disk.
show(label_extractor: Callable[[T], str] = None, *, top_k: int = 3) -> None
¶
Visualize the model using OntoSight.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
label_extractor
|
Callable[[T], str]
|
Optional function to extract label from model instance for visualization. If not provided, uses the one from init. |
None
|
top_k
|
int
|
Number of items to retrieve for chat callback (default: 3). |
3
|
__add__(other: Union[AutoModel, AutoList]) -> AutoList
¶
Operator overload for '+' to combine AutoModel instances into AutoList.
Supports multiple combination patterns: - AutoModel + AutoModel → AutoList (create list from both items) - AutoModel + AutoList → AutoList (prepend model to list)
This enables intuitive chain operations like: unit1 + unit2 + unit3
Usage
unit1 = AutoModel(PersonSchema, ...) unit2 = AutoModel(PersonSchema, ...) person_list = unit1 + unit2 # → AutoList[PersonSchema]
Chain operations¶
unit3 = AutoModel(PersonSchema, ...) person_list = unit1 + unit2 + unit3 # → AutoList with 3 items
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Union[AutoModel, AutoList]
|
Another AutoModel with the same data schema, or AutoList. |
required |
Returns:
| Type | Description |
|---|---|
AutoList
|
AutoList containing both objects as items. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If schemas don't match or invalid operand type. |
AutoList¶
hyperextract.types.AutoList
¶
Bases: BaseAutoType[AutoListSchema[ItemSchema]], Generic[ItemSchema]
AutoList - extracts a collection of objects from text.
This pattern extracts multiple independent objects from a document, suitable for extracting entities, events, references, or any collection of structured items.
Key characteristics
- Extraction target: A collection of structured objects
- Merge strategy: Append with basic deduplication (extensible by subclasses)
- Indexing strategy: Each item in the list is indexed independently
Comparison with AutoModel
- AutoModel: Extracts a single structured object (e.g., summary, metadata)
- AutoList: Extracts multiple independent objects (e.g., entity list, event list)
Attributes¶
item_list_schema: Type[AutoListSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Item list')))
instance-attribute
¶
item_schema = item_schema
instance-attribute
¶
fields_for_index = fields_for_index
instance-attribute
¶
_item_label_extractor = item_label_extractor
instance-attribute
¶
items: List[ItemSchema]
property
¶
Returns the internal list of extracted items.
data: AutoListSchema
property
¶
Returns all stored knowledge (read-only access).
Returns:
| Type | Description |
|---|---|
AutoListSchema
|
The internal knowledge data as AutoListSchema. |
Functions¶
__init__(item_schema: Type[ItemSchema], llm_client: BaseChatModel, embedder: Embeddings, *, prompt: str = '', item_label_extractor: Callable[[ItemSchema], str] = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False, fields_for_index: List[str] | None = None)
¶
Initialize AutoList with item schema and configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item_schema
|
Type[ItemSchema]
|
Pydantic BaseModel subclass for individual list items. |
required |
llm_client
|
BaseChatModel
|
Language model client for extraction. |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing. |
required |
prompt
|
str
|
Custom extraction prompt (defaults to list-oriented prompt). |
''
|
item_label_extractor
|
Callable[[ItemSchema], str]
|
Optional function to extract label from item for visualization. |
None
|
chunk_size
|
int
|
Maximum characters per chunk for long texts. |
2048
|
chunk_overlap
|
int
|
Overlapping characters between adjacent chunks. |
256
|
max_workers
|
int
|
Maximum concurrent extraction tasks. |
10
|
verbose
|
bool
|
Whether to log progress information. |
False
|
fields_for_index
|
List[str] | None
|
Optional list of field names to include in vector index. If None, all fields are indexed. |
None
|
_default_prompt() -> str
¶
Returns the default extraction prompt for list-based extraction.
_create_empty_instance() -> AutoList[ItemSchema]
¶
Creates a new empty instance with the same configuration.
Overrides base class method to handle AutoList's item_schema parameter.
Returns:
| Type | Description |
|---|---|
AutoList[ItemSchema]
|
A new AutoList instance with the same configuration. |
empty() -> bool
¶
Checks if the list is empty.
Returns:
| Type | Description |
|---|---|
bool
|
True if no items are stored, False otherwise. |
_init_data_state() -> None
¶
INIT/RESET: Initialize or reset with empty schema. Called during init and when clear() is called.
_set_data_state(data: AutoListSchema) -> None
¶
SET: Full reset. Replace with new data (e.g., load from disk). Called by parse() or load() where data IS the new state.
_update_data_state(incoming_data: AutoListSchema) -> None
¶
UPDATE: Incremental merge. Append incoming items to current list (called by feed()).
For AutoList, incremental update means appending new items to existing list.
_init_index_state() -> None
¶
Initialize vector index to empty state.
merge_batch_data(data_list: List[AutoListSchema]) -> AutoListSchema
¶
Pure data merge method implementing list append strategy.
Merge strategy: Collects all items from all container objects and merges them into a single list. Used for aggregating extraction results from batch processing across multiple chunks. Subclasses can override this method to implement more sophisticated deduplication logic (e.g., AutoSet with custom key_extractor).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_list
|
List[AutoListSchema]
|
List of container objects from batch processing to merge. |
required |
Returns:
| Type | Description |
|---|---|
AutoListSchema
|
A new merged AutoListSchema object with combined items from all containers. |
build_index() -> None
¶
Builds independent vector index for each item in the list.
If fields_for_index is specified, only those fields are indexed. Otherwise, all fields are indexed.
search(query: str, top_k: int = 3) -> List[ItemSchema]
¶
Searches items in the list using semantic similarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Search query string. |
required |
top_k
|
int
|
Number of results to return. |
3
|
Returns:
| Type | Description |
|---|---|
List[ItemSchema]
|
List of relevant items. |
dump_index(folder_path: str | Path) -> None
¶
Saves FAISS vector index to disk.
load_index(folder_path: str | Path) -> None
¶
Loads FAISS vector index from disk.
show(item_label_extractor: Callable[[ItemSchema], str] = None, *, top_k_for_search: int = 3, top_k_for_chat: int = 3) -> None
¶
Visualize the list using OntoSight.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item_label_extractor
|
Callable[[ItemSchema], str]
|
Optional function to extract label from item for visualization. If not provided, uses the one from init. |
None
|
top_k_for_search
|
int
|
Number of items to retrieve for search callback (default: 3). |
3
|
top_k_for_chat
|
int
|
Number of items to retrieve for chat callback (default: 3). |
3
|
__len__() -> int
¶
Returns the number of elements in the list.
__getitem__(key: int | slice) -> ItemSchema | AutoList[ItemSchema]
¶
Support index access and slicing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
int | slice
|
Integer index or slice object. |
required |
Returns:
| Type | Description |
|---|---|
ItemSchema | AutoList[ItemSchema]
|
|
ItemSchema | AutoList[ItemSchema]
|
|
Raises:
| Type | Description |
|---|---|
IndexError
|
If index is out of range. |
TypeError
|
If key is neither int nor slice. |
Examples:
__setitem__(index: int, item: ItemSchema) -> None
¶
Support index assignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Position to set (supports negative indexing). |
required |
item
|
ItemSchema
|
The item to set at that position. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If item schema doesn't match. |
IndexError
|
If index is out of range. |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
__add__(other: BaseAutoType[ItemSchema]) -> AutoList[ItemSchema]
¶
Operator overload for '+' to combine knowledge instances.
Supports multiple combination patterns: - AutoList + AutoList → AutoList (merge lists) - AutoList + AutoModel → AutoList (append model to list)
This enables chain operations like: model1 + model2 + model3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
BaseAutoType[ItemSchema]
|
Another AutoList or AutoModel with compatible schema. |
required |
Returns:
| Type | Description |
|---|---|
AutoList[ItemSchema]
|
New AutoList with combined items. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If schemas don't match or invalid operand type. |
__delitem__(index: int) -> None
¶
Support del operation for removing items by index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Position to delete (supports negative indexing). |
required |
Raises:
| Type | Description |
|---|---|
IndexError
|
If index is out of range. |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
__iter__() -> Iterator[ItemSchema]
¶
__contains__(item: ItemSchema) -> bool
¶
Support 'in' operator for membership testing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
ItemSchema
|
The item to check for membership. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if item exists in the list, False otherwise. |
Comparison Logic
- Check if item's model_fields match any item's schema
- If schemas match, compare model_dump() equality
Examples:
__repr__() -> str
¶
Return detailed string representation.
Returns:
| Type | Description |
|---|---|
str
|
String in format: ClassNameItemSchema |
Examples:
__str__() -> str
¶
append(item: ItemSchema) -> None
¶
Append a single item to the end of the list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
ItemSchema
|
The item to append. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If item schema doesn't match item_schema. |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
extend(items: Iterable[ItemSchema] | AutoList[ItemSchema]) -> None
¶
Extend the list by appending multiple items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
Iterable[ItemSchema] | AutoList[ItemSchema]
|
Iterable of items to append. Can be: - List of items - Another AutoList instance - Any iterable yielding items |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If any item's schema doesn't match item_schema. |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
insert(index: int, item: ItemSchema) -> None
¶
Insert an item at a specific position.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Position to insert at (supports negative indexing). |
required |
item
|
ItemSchema
|
The item to insert. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If item schema doesn't match item_schema. |
Examples:
>>> knowledge.insert(0, new_person) # Insert at beginning
>>> knowledge.insert(-1, new_person) # Insert before last
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
remove(item: ItemSchema) -> None
¶
Remove the first occurrence of an item from the list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
ItemSchema
|
The item to remove. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If item is not found in the list. |
Comparison Logic
Uses _items_equal() to find matching item.
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
pop(index: int = -1) -> ItemSchema
¶
Remove and return an item at the given position.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Position to pop (default: -1, last item). |
-1
|
Returns:
| Type | Description |
|---|---|
ItemSchema
|
The removed item. |
Raises:
| Type | Description |
|---|---|
IndexError
|
If list is empty or index is out of range. |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
index(item: ItemSchema, start: int = 0, stop: int | None = None) -> int
¶
Return the index of the first occurrence of item.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
ItemSchema
|
The item to find. |
required |
start
|
int
|
Start searching from this position (default: 0). |
0
|
stop
|
int | None
|
Stop searching at this position (default: end of list). |
None
|
Returns:
| Type | Description |
|---|---|
int
|
The index of the first matching item. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If item is not found in the specified range. |
Comparison Logic
Uses _items_equal() to match items.
Examples:
count(item: ItemSchema) -> int
¶
copy() -> AutoList[ItemSchema]
¶
Create a deep copy of this AutoList instance.
Returns:
| Type | Description |
|---|---|
AutoList[ItemSchema]
|
A new AutoList instance with copied items and metadata. |
Note
The vector index is not copied; it needs to be rebuilt if needed.
Examples:
reverse() -> None
¶
sort(key: Callable[[ItemSchema], Any] | None = None, reverse: bool = False) -> None
¶
Sort the items in place.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Callable[[ItemSchema], Any] | None
|
Function to extract comparison key from each item. Must be provided since Items may not be directly comparable. |
None
|
reverse
|
bool
|
If True, sort in descending order (default: False). |
False
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If key is not provided and items aren't comparable. |
Examples:
Side Effects
- Rebuilds the vector index if it exists (to maintain consistency)
- Updates metadata timestamp
Note
Does not call clear_index() since elements aren't modified, but rebuilds index to maintain metadata order consistency.
_validate_item_schema(item: Any) -> None
¶
Validate that item's schema matches item_schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
Any
|
The item to validate. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If schemas don't match, with detailed field difference. |
_items_equal(item1: BaseModel, item2: BaseModel) -> bool
¶
Check if two items are equal.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item1
|
BaseModel
|
First item to compare. |
required |
item2
|
BaseModel
|
Second item to compare. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if items are equal, False otherwise. |
Comparison Logic
- Check if both have the same model_fields (schema)
- If schemas match, compare model_dump() equality
AutoSet¶
hyperextract.types.AutoSet
¶
Bases: BaseAutoType[AutoSetSchema[ItemSchema]], Generic[ItemSchema]
AutoSet - extracts a unique collection of objects.
This pattern automatically deduplicates items based on a user-specified key extractor function. Provides flexible merge strategies including LLM-powered intelligent merging for handling duplicates.
Key characteristics
- Extraction target: A unique collection of structured objects
- Deduplication: Based on key_extractor function (user-specified)
- Merge strategy: Configurable via MergeStrategy enum:
- KEEP_EXISTING: Preserve first (original) data, ignore updates
- KEEP_INCOMING: Always use latest data, overwrite existing
- MERGE_FIELD: Non-null fields overwrite, lists append (default)
- LLM.BALANCED: LLM intelligently synthesizes both versions
- LLM.PREFER_EXISTING: LLM synthesis but prioritizes original data
- LLM.PREFER_INCOMING: LLM synthesis but prioritizes new data
- LLM.CUSTOM_RULE: User-defined rules with dynamic context
- Internal storage: Dict for O(1) lookup and deduplication
- External interface: List (via items property)
- Set operations: union (|), intersection (&), difference (-)
Comparison with AutoList
- AutoList: Allows duplicates, simple append merge
- AutoSet: Automatic deduplication, intelligent merge strategies
Example
class KeywordSchema(BaseModel): ... term: str ... category: str | None = None ... frequency: int | None = None
keywords = AutoSet( ... item_schema=KeywordSchema, ... llm_client=llm, ... embedder=embedder, ... key_extractor=lambda x: x.term, ... merge_item_strategy="field_merge" ... ) keywords.parse("Python is great. Python is powerful.") len(keywords) # Only 1 item (deduplicated) 1
Attributes¶
item_set_schema: Type[AutoSetSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Set of unique items')))
instance-attribute
¶
item_schema = item_schema
instance-attribute
¶
fields_for_index = fields_for_index
instance-attribute
¶
_constructor_kwargs = kwargs
instance-attribute
¶
key_extractor = key_extractor
instance-attribute
¶
strategy_or_merger = strategy_or_merger
instance-attribute
¶
_merger = strategy_or_merger
instance-attribute
¶
_data_memory: OMem[ItemSchema] = OMem(memory_schema=item_schema, key_extractor=key_extractor, llm_client=llm_client, embedder=embedder, strategy_or_merger=(self._merger), verbose=verbose, fields_for_index=fields_for_index)
instance-attribute
¶
_item_label_extractor = item_label_extractor
instance-attribute
¶
data: AutoSetSchema[ItemSchema]
property
¶
Returns all stored knowledge (read-only access).
Returns:
| Type | Description |
|---|---|
AutoSetSchema[ItemSchema]
|
The internal knowledge data as a Pydantic model instance. |
items: List[ItemSchema]
property
¶
Returns the internal items as a list (for external interface compatibility).
Returns:
| Type | Description |
|---|---|
List[ItemSchema]
|
List of unique items. |
keys: List[Any]
property
¶
Returns all unique key values.
Returns:
| Type | Description |
|---|---|
List[Any]
|
List of unique key values. |
Functions¶
__init__(item_schema: Type[ItemSchema], llm_client: BaseChatModel, embedder: Embeddings, key_extractor: Callable[[ItemSchema], Any], *, strategy_or_merger: MergeStrategy | BaseMerger = MergeStrategy.LLM.BALANCED, prompt: str = '', item_label_extractor: Callable[[ItemSchema], str] = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False, fields_for_index: List[str] | None = None, **kwargs: Any)
¶
Initialize AutoSet with key extractor and merge strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item_schema
|
Type[ItemSchema]
|
Pydantic BaseModel subclass for individual items. |
required |
llm_client
|
BaseChatModel
|
Language model client for extraction and merging. |
required |
embedder
|
Embeddings
|
Embedding model for vector indexing. |
required |
key_extractor
|
Callable[[ItemSchema], Any]
|
Function to extract unique key from an item (required). |
required |
strategy_or_merger
|
MergeStrategy | BaseMerger
|
Merge strategy or pre-configured merger instance. Can be: 1. A MergeStrategy enum value (e.g., MergeStrategy.LLM.BALANCED) 2. A pre-configured BaseMerger instance (for full control) |
BALANCED
|
prompt
|
str
|
Custom extraction prompt. |
''
|
item_label_extractor
|
Callable[[ItemSchema], str]
|
Optional function to extract label from item for visualization. |
None
|
chunk_size
|
int
|
Maximum characters per chunk for long texts. |
2048
|
chunk_overlap
|
int
|
Overlapping characters between adjacent chunks. |
256
|
max_workers
|
int
|
Maximum concurrent extraction tasks. |
10
|
verbose
|
bool
|
Whether to display detailed execution logs and progress information. |
False
|
fields_for_index
|
List[str] | None
|
Optional list of field names in item_schema to include in vector index. If None, all text fields are indexed by default. Useful for optimizing search on complex schemas. Example: ['name', 'summary'] (only index these fields) |
None
|
**kwargs
|
Any
|
Additional arguments passed to create_merger() when strategy_or_merger is a MergeStrategy enum. Ignored if strategy_or_merger is a BaseMerger instance. |
{}
|
_create_empty_instance() -> AutoSet[ItemSchema]
¶
Creates a new empty instance with the same configuration.
Overrides parent method to include AutoSet-specific parameters.
Returns:
| Type | Description |
|---|---|
AutoSet[ItemSchema]
|
New AutoSet instance with identical configuration. |
_default_prompt() -> str
¶
Returns the default extraction prompt for set-based extraction.
empty() -> bool
¶
Checks if the set is empty.
Returns:
| Type | Description |
|---|---|
bool
|
True if no items are stored, False otherwise. |
_init_data_state() -> None
¶
INIT/RESET: Initialize or reset OMem as empty. Called during init and when clear() is called.
_init_index_state() -> None
¶
Initialize vector index to empty state.
_set_data_state(data: AutoSetSchema[ItemSchema]) -> None
¶
SET: Full Reset. Wipe OMem and refill from data (e.g., load from disk). Called by parse() or load() where data IS the new state.
_update_data_state(incoming_data: AutoSetSchema[ItemSchema]) -> None
¶
UPDATE: Incremental merge. Add to OMem efficiently (called by feed()).
Unlike the default behavior which uses merge_batch for full re-merge, AutoSet optimizes this by directly adding items to OMem, which handles deduplication and merging internally.
merge_batch_data(data_list: List[AutoSetSchema[ItemSchema]]) -> AutoSetSchema[ItemSchema]
¶
Merges multiple data containers with automatic deduplication.
Pure function: Does not modify internal state. Delegates to OMem's merge strategy for efficient deduplication and merging. All merge strategies are handled by the Merger implementation in OMem.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_list
|
List[AutoSetSchema[ItemSchema]]
|
List of container objects from batch processing to merge. |
required |
Returns:
| Type | Description |
|---|---|
AutoSetSchema[ItemSchema]
|
New merged AutoSetSchema with deduplicated items and resolved conflicts. |
build_index(force: bool = False) -> None
¶
Build/rebuild independent vector index for each item in the set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
force
|
bool
|
If True, forces rebuilding the index even if it already exists. |
False
|
search(query: str, top_k: int = 3) -> List[ItemSchema]
¶
Searches items in the set using semantic similarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Search query string. |
required |
top_k
|
int
|
Number of results to return. |
3
|
Returns:
| Type | Description |
|---|---|
List[ItemSchema]
|
List of relevant items. |
dump_index(folder_path: str | Path) -> None
¶
Saves FAISS vector index to disk.
load_index(folder_path: str | Path) -> None
¶
Loads FAISS vector index from disk.
show(item_label_extractor: Callable[[ItemSchema], str] = None, *, top_k_for_search: int = 3, top_k_for_chat: int = 3) -> None
¶
Visualize the set using OntoSight.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item_label_extractor
|
Callable[[ItemSchema], str]
|
Optional function to extract label from item for visualization. If not provided, uses the one from init. |
None
|
top_k_for_search
|
int
|
Number of items to retrieve for search callback (default: 3). |
3
|
top_k_for_chat
|
int
|
Number of items to retrieve for chat callback (default: 3). |
3
|
__len__() -> int
¶
Returns the number of unique items in the set.
__contains__(key: Any) -> bool
¶
Checks if a unique key exists in the set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Any
|
The unique key value to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if key exists, False otherwise. |
__repr__() -> str
¶
Returns a developer-friendly representation.
__str__() -> str
¶
Returns a user-friendly string representation.
__iter__() -> Iterator[ItemSchema]
¶
add(item: ItemSchema) -> None
¶
Adds a single item to the set with automatic deduplication.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
ItemSchema
|
The item to add. |
required |
remove(key: Any) -> Optional[ItemSchema]
¶
Removes an item by its unique key value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Any
|
The unique key value to remove. |
required |
Returns:
| Type | Description |
|---|---|
Optional[ItemSchema]
|
The removed item, or None if not found. |
contains(key: Any) -> bool
¶
Checks if an item with the given key exists in the set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Any
|
The unique key value to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if key exists, False otherwise. |
get(key: Any, default: Optional[ItemSchema] = None) -> Optional[ItemSchema]
¶
Gets an item by its unique key value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Any
|
The unique key value to retrieve. |
required |
default
|
Optional[ItemSchema]
|
Default value if key not found. |
None
|
Returns:
| Type | Description |
|---|---|
Optional[ItemSchema]
|
The item if found, otherwise default. |
update(items: List[ItemSchema]) -> None
¶
Batch adds multiple items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
List[ItemSchema]
|
List of items to add. |
required |
discard(key: Any) -> None
¶
Removes an item by its unique key value, silently ignoring if not found.
Unlike remove(), this method does not raise an error if the key does not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Any
|
The unique key value to remove. |
required |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
pop() -> ItemSchema
¶
Removes and returns an arbitrary item from the set.
Returns:
| Type | Description |
|---|---|
ItemSchema
|
The removed item. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the set is empty. |
Examples:
Side Effects
- Clears the vector index (needs rebuild)
- Updates metadata timestamp
copy() -> AutoSet[ItemSchema]
¶
Creates a deep copy of the set.
Returns:
| Type | Description |
|---|---|
AutoSet[ItemSchema]
|
A new AutoSet instance with copies of all items. |
Examples:
__or__(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
__and__(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
__sub__(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
__xor__(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
Symmetric difference operation: set1 ^ set2.
Returns a new set containing items in either set but not in both.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
AutoSet[ItemSchema]
|
New AutoSet with symmetric difference of items. |
union(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
Union operation (named method).
intersection(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
Intersection operation (named method).
difference(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
Difference operation (named method).
symmetric_difference(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]
¶
Symmetric difference operation (named method).
__eq__(other: Any) -> bool
¶
Equality comparison: set1 == set2.
Two sets are equal if they have the same schema and key set. Note: Does not compare item contents, only keys.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Any
|
Another object to compare with. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if both sets have the same keys, False otherwise. |
Examples:
__ne__(other: Any) -> bool
¶
Inequality comparison: set1 != set2.
Returns:
| Type | Description |
|---|---|
bool
|
True if sets are not equal, False otherwise. |
__le__(other: AutoSet[ItemSchema]) -> bool
¶
Subset comparison: set1 <= set2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if self is a subset of other (all keys in self are in other). |
Raises:
| Type | Description |
|---|---|
TypeError
|
If other is not a AutoSet or has different schema. |
Examples:
__lt__(other: AutoSet[ItemSchema]) -> bool
¶
Proper subset comparison: set1 < set2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if self is a proper subset of other (subset and not equal). |
Examples:
__ge__(other: AutoSet[ItemSchema]) -> bool
¶
Superset comparison: set1 >= set2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if self is a superset of other (all keys in other are in self). |
Examples:
__gt__(other: AutoSet[ItemSchema]) -> bool
¶
Proper superset comparison: set1 > set2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if self is a proper superset of other (superset and not equal). |
Examples:
issubset(other: AutoSet[ItemSchema]) -> bool
¶
Test whether every key in the set is in other.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if self is a subset of other. |
Examples:
issuperset(other: AutoSet[ItemSchema]) -> bool
¶
Test whether every key in other is in the set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if self is a superset of other. |
Examples:
isdisjoint(other: AutoSet[ItemSchema]) -> bool
¶
Test whether the set has no keys in common with other.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AutoSet[ItemSchema]
|
Another AutoSet instance. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the two sets have no keys in common. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If other is not a AutoSet or has different schema. |
Examples: