记录类型¶

用于记录和存储数据，不涉及实体间关系。

AutoModel¶

`hyperextract.types.AutoModel` ¶

Bases: BaseAutoType[T]

AutoModel - extracts a single structured object from text.

This pattern is designed for extracting exactly one structured object per document, regardless of document length. Suitable for document-level information like summaries, metadata, or aggregate statistics.

Key characteristics

Extraction target: One unique structured object per document
Merge strategy: Configurable via MergeStrategy enum (supports LLM-powered intelligent merging)
- MERGE_FIELD: Non-null fields overwrite, lists append (simple field merge)
- LLM.BALANCED: LLM intelligently synthesizes both versions (default)
- LLM.PREFER_EXISTING: LLM synthesis but prioritizes original data
- LLM.PREFER_INCOMING: LLM synthesis but prioritizes new data
Indexing strategy: Each non-null field of the object is indexed independently
Processing: Uses LangChain native batch processing for efficient multi-chunk handling
Advanced merging: All chunk extractions are treated as the same object, triggering merge logic

Attributes¶

`_strategy_or_merger = strategy_or_merger` `instance-attribute` ¶

`_constructor_kwargs = kwargs` `instance-attribute` ¶

`_label_extractor = label_extractor` `instance-attribute` ¶

`_constant_key = 'singleton'` `instance-attribute` ¶

`_key_extractor = lambda x: self._constant_key` `instance-attribute` ¶

`_merger = strategy_or_merger` `instance-attribute` ¶

`data: T` `property` ¶

Returns all stored knowledge (read-only access).

Returns:

Type	Description
`T`	The internal knowledge data as a Pydantic model instance.

Functions¶

`init(data_schema: Type[T], llm_client: BaseChatModel, embedder: Embeddings, *, strategy_or_merger: MergeStrategy | BaseMerger = MergeStrategy.LLM.BALANCED, prompt: str = '', label_extractor: Callable[[T], str] = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False, **kwargs)` ¶

Initialize AutoModel with schema and configuration.

Parameters:

Name	Type	Description	Default
`data_schema`	`Type[T]`	Pydantic BaseModel subclass defining the object structure.	required
`llm_client`	`BaseChatModel`	Language model client for extraction.	required
`embedder`	`Embeddings`	Embedding model for vector indexing.	required
`strategy_or_merger`	`MergeStrategy \| BaseMerger`	Merge strategy for multi-chunk results. Can be: - MergeStrategy enum value (e.g., MergeStrategy.MERGE_FIELD, MergeStrategy.LLM.BALANCED) - Custom BaseMerger instance Default: MergeStrategy.LLM.BALANCED (LLM intelligently synthesizes both versions)	`BALANCED`
`prompt`	`str`	Custom extraction prompt (defaults to generic prompt).	`''`
`label_extractor`	`Callable[[T], str]`	Optional function to extract label from model instance for visualization.	`None`
`chunk_size`	`int`	Maximum characters per chunk for long texts.	`2048`
`chunk_overlap`	`int`	Overlapping characters between adjacent chunks.	`256`
`max_workers`	`int`	Maximum concurrent extraction tasks.	`10`
`verbose`	`bool`	Whether to log progress information.	`False`

`_create_empty_instance() -> AutoModel[T]` ¶

Creates a new empty instance with the same configuration.

Overrides parent method to include AutoModel-specific parameters.

Returns:

Type	Description
`AutoModel[T]`	New AutoModel instance.

`_default_prompt() -> str` ¶

Returns the default extraction prompt for single-object extraction.

`empty() -> bool` ¶

Checks if the model is empty (no data stored).

Returns:

Type	Description
`bool`	True if no data is stored, False otherwise.

`_init_data_state() -> None` ¶

INIT/RESET: Initialize or reset to empty state (None). Called during init and when clear() is called.

`_init_index_state() -> None` ¶

Initialize vector index to empty state.

`_set_data_state(data: T) -> None` ¶

SET: Full reset. Replace with new data (e.g., load from disk). Called by parse() or load() where data IS the new state.

`_update_data_state(incoming_data: T) -> None` ¶

UPDATE: Incremental merge. Merge fields with field-level update strategy (called by feed()).

For AutoModel, incremental update means filling missing fields, first extraction wins.

`merge_batch_data(data_list: List[T]) -> T` ¶

Merge multiple extracted objects using configured strategy.

Leverages ontomem's merge strategies to intelligently combine results from multiple chunks. All extractions are treated as the same object (singleton) to trigger the merge logic.

Supported merge strategies: - MERGE_FIELD: Non-null fields overwrite, lists append (simple field merge) - LLM.BALANCED: LLM synthesizes both versions, balancing insights - LLM.PREFER_EXISTING: LLM synthesis prioritizing original data - LLM.PREFER_INCOMING: LLM synthesis prioritizing new data

Parameters:

Name	Type	Description	Default
`data_list`	`List[T]`	List of extracted data objects from batch processing to merge.	required

Returns:

Type	Description
`T`	A new merged knowledge object with intelligently combined fields.

`build_index()` ¶

Builds vector index from all non-null fields in the data object.

`search(query: str, top_k: int = 3) -> List[Any]` ¶

Searches all indexed fields using semantic similarity.

Parameters:

Name	Type	Description	Default
`query`	`str`	Search query string.	required
`top_k`	`int`	Number of results to return.	`3`

Returns:

Type	Description
`List[Any]`	List of relevant knowledge items (field-value dictionaries).

`dump_index(folder_path: str | Path) -> None` ¶

Saves FAISS vector index to disk.

`load_index(folder_path: str | Path) -> None` ¶

Loads FAISS vector index from disk.

`show(label_extractor: Callable[[T], str] = None, *, top_k: int = 3) -> None` ¶

Visualize the model using OntoSight.

Parameters:

Name	Type	Description	Default
`label_extractor`	`Callable[[T], str]`	Optional function to extract label from model instance for visualization. If not provided, uses the one from init.	`None`
`top_k`	`int`	Number of items to retrieve for chat callback (default: 3).	`3`

`add(other: Union[AutoModel, AutoList]) -> AutoList` ¶

Operator overload for '+' to combine AutoModel instances into AutoList.

Supports multiple combination patterns: - AutoModel + AutoModel → AutoList (create list from both items) - AutoModel + AutoList → AutoList (prepend model to list)

This enables intuitive chain operations like: unit1 + unit2 + unit3

Usage

unit1 = AutoModel(PersonSchema, ...) unit2 = AutoModel(PersonSchema, ...) person_list = unit1 + unit2 # → AutoList[PersonSchema]

Chain operations¶

unit3 = AutoModel(PersonSchema, ...) person_list = unit1 + unit2 + unit3 # → AutoList with 3 items

Parameters:

Name	Type	Description	Default
`other`	`Union[AutoModel, AutoList]`	Another AutoModel with the same data schema, or AutoList.	required

Returns:

Type	Description
`AutoList`	AutoList containing both objects as items.

Raises:

Type	Description
`TypeError`	If schemas don't match or invalid operand type.

AutoList¶

`hyperextract.types.AutoList` ¶

Bases: BaseAutoType[AutoListSchema[ItemSchema]], Generic[ItemSchema]

AutoList - extracts a collection of objects from text.

This pattern extracts multiple independent objects from a document, suitable for extracting entities, events, references, or any collection of structured items.

Key characteristics

Extraction target: A collection of structured objects
Merge strategy: Append with basic deduplication (extensible by subclasses)
Indexing strategy: Each item in the list is indexed independently

Comparison with AutoModel

AutoModel: Extracts a single structured object (e.g., summary, metadata)
AutoList: Extracts multiple independent objects (e.g., entity list, event list)

Attributes¶

`item_list_schema: Type[AutoListSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Item list')))` `instance-attribute` ¶

`item_schema = item_schema` `instance-attribute` ¶

`fields_for_index = fields_for_index` `instance-attribute` ¶

`_item_label_extractor = item_label_extractor` `instance-attribute` ¶

`items: List[ItemSchema]` `property` ¶

Returns the internal list of extracted items.

`data: AutoListSchema` `property` ¶

Returns all stored knowledge (read-only access).

Returns:

Type	Description
`AutoListSchema`	The internal knowledge data as AutoListSchema.

Functions¶

`init(item_schema: Type[ItemSchema], llm_client: BaseChatModel, embedder: Embeddings, *, prompt: str = '', item_label_extractor: Callable[[ItemSchema], str] = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False, fields_for_index: List[str] | None = None)` ¶

Initialize AutoList with item schema and configuration.

Parameters:

Name	Type	Description	Default
`item_schema`	`Type[ItemSchema]`	Pydantic BaseModel subclass for individual list items.	required
`llm_client`	`BaseChatModel`	Language model client for extraction.	required
`embedder`	`Embeddings`	Embedding model for vector indexing.	required
`prompt`	`str`	Custom extraction prompt (defaults to list-oriented prompt).	`''`
`item_label_extractor`	`Callable[[ItemSchema], str]`	Optional function to extract label from item for visualization.	`None`
`chunk_size`	`int`	Maximum characters per chunk for long texts.	`2048`
`chunk_overlap`	`int`	Overlapping characters between adjacent chunks.	`256`
`max_workers`	`int`	Maximum concurrent extraction tasks.	`10`
`verbose`	`bool`	Whether to log progress information.	`False`
`fields_for_index`	`List[str] \| None`	Optional list of field names to include in vector index. If None, all fields are indexed.	`None`

`_default_prompt() -> str` ¶

Returns the default extraction prompt for list-based extraction.

`_create_empty_instance() -> AutoList[ItemSchema]` ¶

Creates a new empty instance with the same configuration.

Overrides base class method to handle AutoList's item_schema parameter.

Returns:

Type	Description
`AutoList[ItemSchema]`	A new AutoList instance with the same configuration.

`empty() -> bool` ¶

Checks if the list is empty.

Returns:

Type	Description
`bool`	True if no items are stored, False otherwise.

`_init_data_state() -> None` ¶

INIT/RESET: Initialize or reset with empty schema. Called during init and when clear() is called.

`_set_data_state(data: AutoListSchema) -> None` ¶

SET: Full reset. Replace with new data (e.g., load from disk). Called by parse() or load() where data IS the new state.

`_update_data_state(incoming_data: AutoListSchema) -> None` ¶

UPDATE: Incremental merge. Append incoming items to current list (called by feed()).

For AutoList, incremental update means appending new items to existing list.

`_init_index_state() -> None` ¶

Initialize vector index to empty state.

`merge_batch_data(data_list: List[AutoListSchema]) -> AutoListSchema` ¶

Pure data merge method implementing list append strategy.

Merge strategy: Collects all items from all container objects and merges them into a single list. Used for aggregating extraction results from batch processing across multiple chunks. Subclasses can override this method to implement more sophisticated deduplication logic (e.g., AutoSet with custom key_extractor).

Parameters:

Name	Type	Description	Default
`data_list`	`List[AutoListSchema]`	List of container objects from batch processing to merge.	required

Returns:

Type	Description
`AutoListSchema`	A new merged AutoListSchema object with combined items from all containers.

`build_index() -> None` ¶

Builds independent vector index for each item in the list.

If fields_for_index is specified, only those fields are indexed. Otherwise, all fields are indexed.

`search(query: str, top_k: int = 3) -> List[ItemSchema]` ¶

Searches items in the list using semantic similarity.

Parameters:

Name	Type	Description	Default
`query`	`str`	Search query string.	required
`top_k`	`int`	Number of results to return.	`3`

Returns:

Type	Description
`List[ItemSchema]`	List of relevant items.

`dump_index(folder_path: str | Path) -> None` ¶

Saves FAISS vector index to disk.

`load_index(folder_path: str | Path) -> None` ¶

Loads FAISS vector index from disk.

`show(item_label_extractor: Callable[[ItemSchema], str] = None, *, top_k_for_search: int = 3, top_k_for_chat: int = 3) -> None` ¶

Visualize the list using OntoSight.

Parameters:

Name	Type	Description	Default
`item_label_extractor`	`Callable[[ItemSchema], str]`	Optional function to extract label from item for visualization. If not provided, uses the one from init.	`None`
`top_k_for_search`	`int`	Number of items to retrieve for search callback (default: 3).	`3`
`top_k_for_chat`	`int`	Number of items to retrieve for chat callback (default: 3).	`3`

`len() -> int` ¶

Returns the number of elements in the list.

`getitem(key: int | slice) -> ItemSchema | AutoList[ItemSchema]` ¶

Support index access and slicing.

Parameters:

Name	Type	Description	Default
`key`	`int \| slice`	Integer index or slice object.	required

Returns:

Type	Description
`ItemSchema \| AutoList[ItemSchema]`	For integer index: Returns the Item at that position
`ItemSchema \| AutoList[ItemSchema]`	For slice: Returns a new AutoList instance with sliced items

Raises:

Type	Description
`IndexError`	If index is out of range.
`TypeError`	If key is neither int nor slice.

Examples:

>>> knowledge[0]           # First item
>>> knowledge[-1]          # Last item
>>> knowledge[1:3]         # New AutoList with items [1:3]
>>> knowledge[:5]          # First 5 items as new instance

`setitem(index: int, item: ItemSchema) -> None` ¶

Support index assignment.

Parameters:

Name	Type	Description	Default
`index`	`int`	Position to set (supports negative indexing).	required
`item`	`ItemSchema`	The item to set at that position.	required

Raises:

Type	Description
`TypeError`	If item schema doesn't match.
`IndexError`	If index is out of range.

Examples:

>>> knowledge[0] = new_item
>>> knowledge[-1] = updated_item

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`add(other: BaseAutoType[ItemSchema]) -> AutoList[ItemSchema]` ¶

Operator overload for '+' to combine knowledge instances.

Supports multiple combination patterns: - AutoList + AutoList → AutoList (merge lists) - AutoList + AutoModel → AutoList (append model to list)

This enables chain operations like: model1 + model2 + model3

Parameters:

Name	Type	Description	Default
`other`	`BaseAutoType[ItemSchema]`	Another AutoList or AutoModel with compatible schema.	required

Returns:

Type	Description
`AutoList[ItemSchema]`	New AutoList with combined items.

Raises:

Type	Description
`TypeError`	If schemas don't match or invalid operand type.

`delitem(index: int) -> None` ¶

Support del operation for removing items by index.

Parameters:

Name	Type	Description	Default
`index`	`int`	Position to delete (supports negative indexing).	required

Raises:

Type	Description
`IndexError`	If index is out of range.

Examples:

>>> del knowledge[0]      # Delete first item
>>> del knowledge[-1]     # Delete last item

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`iter() -> Iterator[ItemSchema]` ¶

Support iteration over items.

Returns:

Type	Description
`Iterator[ItemSchema]`	Iterator over items in the list.

Examples:

>>> for item in knowledge:
...     print(item.name)

>>> items_list = list(knowledge)
>>> names = [item.name for item in knowledge]

`contains(item: ItemSchema) -> bool` ¶

Support 'in' operator for membership testing.

Parameters:

Name	Type	Description	Default
`item`	`ItemSchema`	The item to check for membership.	required

Returns:

Type	Description
`bool`	True if item exists in the list, False otherwise.

Comparison Logic

Check if item's model_fields match any item's schema
If schemas match, compare model_dump() equality

Examples:

>>> if person in knowledge:
...     print("Person already exists")

`repr() -> str` ¶

Return detailed string representation.

Returns:

Type	Description
`str`	String in format: ClassNameItemSchema

Examples:

>>> repr(knowledge)
'AutoList[PersonSchema](5 items)'

`str() -> str` ¶

Return human-readable string representation.

Returns:

Type	Description
`str`	Brief description of the knowledge instance.

Examples:

>>> str(knowledge)
'AutoList with 5 PersonSchema items'

`append(item: ItemSchema) -> None` ¶

Append a single item to the end of the list.

Parameters:

Name	Type	Description	Default
`item`	`ItemSchema`	The item to append.	required

Raises:

Type	Description
`TypeError`	If item schema doesn't match item_schema.

Examples:

>>> knowledge.append(PersonSchema(name="Alice", age=30))

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`extend(items: Iterable[ItemSchema] | AutoList[ItemSchema]) -> None` ¶

Extend the list by appending multiple items.

Parameters:

Name	Type	Description	Default
`items`	`Iterable[ItemSchema] \| AutoList[ItemSchema]`	Iterable of items to append. Can be: - List of items - Another AutoList instance - Any iterable yielding items	required

Raises:

Type	Description
`TypeError`	If any item's schema doesn't match item_schema.

Examples:

>>> knowledge.extend([person1, person2, person3])
>>> knowledge.extend(other_knowledge)

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`insert(index: int, item: ItemSchema) -> None` ¶

Insert an item at a specific position.

Parameters:

Name	Type	Description	Default
`index`	`int`	Position to insert at (supports negative indexing).	required
`item`	`ItemSchema`	The item to insert.	required

Raises:

Type	Description
`TypeError`	If item schema doesn't match item_schema.

Examples:

>>> knowledge.insert(0, new_person)    # Insert at beginning
>>> knowledge.insert(-1, new_person)   # Insert before last

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`remove(item: ItemSchema) -> None` ¶

Remove the first occurrence of an item from the list.

Parameters:

Name	Type	Description	Default
`item`	`ItemSchema`	The item to remove.	required

Raises:

Type	Description
`ValueError`	If item is not found in the list.

Comparison Logic

Uses _items_equal() to find matching item.

Examples:

>>> knowledge.remove(person)

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`pop(index: int = -1) -> ItemSchema` ¶

Remove and return an item at the given position.

Parameters:

Name	Type	Description	Default
`index`	`int`	Position to pop (default: -1, last item).	`-1`

Returns:

Type	Description
`ItemSchema`	The removed item.

Raises:

Type	Description
`IndexError`	If list is empty or index is out of range.

Examples:

>>> last_item = knowledge.pop()
>>> first_item = knowledge.pop(0)

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`index(item: ItemSchema, start: int = 0, stop: int | None = None) -> int` ¶

Return the index of the first occurrence of item.

Parameters:

Name	Type	Description	Default
`item`	`ItemSchema`	The item to find.	required
`start`	`int`	Start searching from this position (default: 0).	`0`
`stop`	`int \| None`	Stop searching at this position (default: end of list).	`None`

Returns:

Type	Description
`int`	The index of the first matching item.

Raises:

Type	Description
`ValueError`	If item is not found in the specified range.

Comparison Logic

Uses _items_equal() to match items.

Examples:

>>> idx = knowledge.index(person)
>>> idx = knowledge.index(person, 5, 10)  # Search in range [5:10]

`count(item: ItemSchema) -> int` ¶

Return the number of times item appears in the list.

Parameters:

Name	Type	Description	Default
`item`	`ItemSchema`	The item to count.	required

Returns:

Type	Description
`int`	Number of occurrences.

Comparison Logic

Uses _items_equal() to match items.

Examples:

>>> count = knowledge.count(person)

`copy() -> AutoList[ItemSchema]` ¶

Create a deep copy of this AutoList instance.

Returns:

Type	Description
`AutoList[ItemSchema]`	A new AutoList instance with copied items and metadata.

Note

The vector index is not copied; it needs to be rebuilt if needed.

Examples:

>>> backup = knowledge.copy()
>>> backup.append(new_item)  # Original unchanged

`reverse() -> None` ¶

Reverse the items in place.

Examples:

>>> knowledge.reverse()

Side Effects

Rebuilds the vector index if it exists (to maintain consistency)
Updates metadata timestamp

Note

Does not call clear_index() since elements aren't modified, but rebuilds index to maintain metadata order consistency.

`sort(key: Callable[[ItemSchema], Any] | None = None, reverse: bool = False) -> None` ¶

Sort the items in place.

Parameters:

Name	Type	Description	Default
`key`	`Callable[[ItemSchema], Any] \| None`	Function to extract comparison key from each item. Must be provided since Items may not be directly comparable.	`None`
`reverse`	`bool`	If True, sort in descending order (default: False).	`False`

Raises:

Type	Description
`TypeError`	If key is not provided and items aren't comparable.

Examples:

>>> knowledge.sort(key=lambda x: x.name)
>>> knowledge.sort(key=lambda x: x.age, reverse=True)

Side Effects

Rebuilds the vector index if it exists (to maintain consistency)
Updates metadata timestamp

Note

Does not call clear_index() since elements aren't modified, but rebuilds index to maintain metadata order consistency.

`_validate_item_schema(item: Any) -> None` ¶

Validate that item's schema matches item_schema.

Parameters:

Name	Type	Description	Default
`item`	`Any`	The item to validate.	required

Raises:

Type	Description
`TypeError`	If schemas don't match, with detailed field difference.

`_items_equal(item1: BaseModel, item2: BaseModel) -> bool` ¶

Check if two items are equal.

Parameters:

Name	Type	Description	Default
`item1`	`BaseModel`	First item to compare.	required
`item2`	`BaseModel`	Second item to compare.	required

Returns:

Type	Description
`bool`	True if items are equal, False otherwise.

Comparison Logic

Check if both have the same model_fields (schema)
If schemas match, compare model_dump() equality

AutoSet¶

`hyperextract.types.AutoSet` ¶

Bases: BaseAutoType[AutoSetSchema[ItemSchema]], Generic[ItemSchema]

AutoSet - extracts a unique collection of objects.

This pattern automatically deduplicates items based on a user-specified key extractor function. Provides flexible merge strategies including LLM-powered intelligent merging for handling duplicates.

Key characteristics

Extraction target: A unique collection of structured objects
Deduplication: Based on key_extractor function (user-specified)
Merge strategy: Configurable via MergeStrategy enum:
- KEEP_EXISTING: Preserve first (original) data, ignore updates
- KEEP_INCOMING: Always use latest data, overwrite existing
- MERGE_FIELD: Non-null fields overwrite, lists append (default)
- LLM.BALANCED: LLM intelligently synthesizes both versions
- LLM.PREFER_EXISTING: LLM synthesis but prioritizes original data
- LLM.PREFER_INCOMING: LLM synthesis but prioritizes new data
- LLM.CUSTOM_RULE: User-defined rules with dynamic context
Internal storage: Dict for O(1) lookup and deduplication
External interface: List (via items property)
Set operations: union (|), intersection (&), difference (-)

Comparison with AutoList

AutoList: Allows duplicates, simple append merge
AutoSet: Automatic deduplication, intelligent merge strategies

Example

class KeywordSchema(BaseModel): ... term: str ... category: str | None = None ... frequency: int | None = None

keywords = AutoSet( ... item_schema=KeywordSchema, ... llm_client=llm, ... embedder=embedder, ... key_extractor=lambda x: x.term, ... merge_item_strategy="field_merge" ... ) keywords.parse("Python is great. Python is powerful.") len(keywords) # Only 1 item (deduplicated) 1

Attributes¶

`item_set_schema: Type[AutoSetSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Set of unique items')))` `instance-attribute` ¶

`item_schema = item_schema` `instance-attribute` ¶

`fields_for_index = fields_for_index` `instance-attribute` ¶

`_constructor_kwargs = kwargs` `instance-attribute` ¶

`key_extractor = key_extractor` `instance-attribute` ¶

`strategy_or_merger = strategy_or_merger` `instance-attribute` ¶

`_merger = strategy_or_merger` `instance-attribute` ¶

`_data_memory: OMem[ItemSchema] = OMem(memory_schema=item_schema, key_extractor=key_extractor, llm_client=llm_client, embedder=embedder, strategy_or_merger=(self._merger), verbose=verbose, fields_for_index=fields_for_index)` `instance-attribute` ¶

`_item_label_extractor = item_label_extractor` `instance-attribute` ¶

`data: AutoSetSchema[ItemSchema]` `property` ¶

Returns all stored knowledge (read-only access).

Returns:

Type	Description
`AutoSetSchema[ItemSchema]`	The internal knowledge data as a Pydantic model instance.

`items: List[ItemSchema]` `property` ¶

Returns the internal items as a list (for external interface compatibility).

Returns:

Type	Description
`List[ItemSchema]`	List of unique items.

`keys: List[Any]` `property` ¶

Returns all unique key values.

Returns:

Type	Description
`List[Any]`	List of unique key values.

Functions¶

`init(item_schema: Type[ItemSchema], llm_client: BaseChatModel, embedder: Embeddings, key_extractor: Callable[[ItemSchema], Any], *, strategy_or_merger: MergeStrategy | BaseMerger = MergeStrategy.LLM.BALANCED, prompt: str = '', item_label_extractor: Callable[[ItemSchema], str] = None, chunk_size: int = 2048, chunk_overlap: int = 256, max_workers: int = 10, verbose: bool = False, fields_for_index: List[str] | None = None, **kwargs: Any)` ¶

Initialize AutoSet with key extractor and merge strategy.

Parameters:

Name	Type	Description	Default
`item_schema`	`Type[ItemSchema]`	Pydantic BaseModel subclass for individual items.	required
`llm_client`	`BaseChatModel`	Language model client for extraction and merging.	required
`embedder`	`Embeddings`	Embedding model for vector indexing.	required
`key_extractor`	`Callable[[ItemSchema], Any]`	Function to extract unique key from an item (required).	required
`strategy_or_merger`	`MergeStrategy \| BaseMerger`	Merge strategy or pre-configured merger instance. Can be: 1. A MergeStrategy enum value (e.g., MergeStrategy.LLM.BALANCED) 2. A pre-configured BaseMerger instance (for full control)	`BALANCED`
`prompt`	`str`	Custom extraction prompt.	`''`
`item_label_extractor`	`Callable[[ItemSchema], str]`	Optional function to extract label from item for visualization.	`None`
`chunk_size`	`int`	Maximum characters per chunk for long texts.	`2048`
`chunk_overlap`	`int`	Overlapping characters between adjacent chunks.	`256`
`max_workers`	`int`	Maximum concurrent extraction tasks.	`10`
`verbose`	`bool`	Whether to display detailed execution logs and progress information.	`False`
`fields_for_index`	`List[str] \| None`	Optional list of field names in item_schema to include in vector index. If None, all text fields are indexed by default. Useful for optimizing search on complex schemas. Example: ['name', 'summary'] (only index these fields)	`None`
`**kwargs`	`Any`	Additional arguments passed to create_merger() when strategy_or_merger is a MergeStrategy enum. Ignored if strategy_or_merger is a BaseMerger instance.	`{}`

`_create_empty_instance() -> AutoSet[ItemSchema]` ¶

Creates a new empty instance with the same configuration.

Overrides parent method to include AutoSet-specific parameters.

Returns:

Type	Description
`AutoSet[ItemSchema]`	New AutoSet instance with identical configuration.

`_default_prompt() -> str` ¶

Returns the default extraction prompt for set-based extraction.

`empty() -> bool` ¶

Checks if the set is empty.

Returns:

Type	Description
`bool`	True if no items are stored, False otherwise.

`_init_data_state() -> None` ¶

INIT/RESET: Initialize or reset OMem as empty. Called during init and when clear() is called.

`_init_index_state() -> None` ¶

Initialize vector index to empty state.

`_set_data_state(data: AutoSetSchema[ItemSchema]) -> None` ¶

SET: Full Reset. Wipe OMem and refill from data (e.g., load from disk). Called by parse() or load() where data IS the new state.

`_update_data_state(incoming_data: AutoSetSchema[ItemSchema]) -> None` ¶

UPDATE: Incremental merge. Add to OMem efficiently (called by feed()).

Unlike the default behavior which uses merge_batch for full re-merge, AutoSet optimizes this by directly adding items to OMem, which handles deduplication and merging internally.

`merge_batch_data(data_list: List[AutoSetSchema[ItemSchema]]) -> AutoSetSchema[ItemSchema]` ¶

Merges multiple data containers with automatic deduplication.

Pure function: Does not modify internal state. Delegates to OMem's merge strategy for efficient deduplication and merging. All merge strategies are handled by the Merger implementation in OMem.

Parameters:

Name	Type	Description	Default
`data_list`	`List[AutoSetSchema[ItemSchema]]`	List of container objects from batch processing to merge.	required

Returns:

Type	Description
`AutoSetSchema[ItemSchema]`	New merged AutoSetSchema with deduplicated items and resolved conflicts.

`build_index(force: bool = False) -> None` ¶

Build/rebuild independent vector index for each item in the set.

Parameters:

Name	Type	Description	Default
`force`	`bool`	If True, forces rebuilding the index even if it already exists.	`False`

`search(query: str, top_k: int = 3) -> List[ItemSchema]` ¶

Searches items in the set using semantic similarity.

Parameters:

Name	Type	Description	Default
`query`	`str`	Search query string.	required
`top_k`	`int`	Number of results to return.	`3`

Returns:

Type	Description
`List[ItemSchema]`	List of relevant items.

`dump_index(folder_path: str | Path) -> None` ¶

Saves FAISS vector index to disk.

`load_index(folder_path: str | Path) -> None` ¶

Loads FAISS vector index from disk.

`show(item_label_extractor: Callable[[ItemSchema], str] = None, *, top_k_for_search: int = 3, top_k_for_chat: int = 3) -> None` ¶

Visualize the set using OntoSight.

Parameters:

Name	Type	Description	Default
`item_label_extractor`	`Callable[[ItemSchema], str]`	Optional function to extract label from item for visualization. If not provided, uses the one from init.	`None`
`top_k_for_search`	`int`	Number of items to retrieve for search callback (default: 3).	`3`
`top_k_for_chat`	`int`	Number of items to retrieve for chat callback (default: 3).	`3`

`len() -> int` ¶

Returns the number of unique items in the set.

`contains(key: Any) -> bool` ¶

Checks if a unique key exists in the set.

Parameters:

Name	Type	Description	Default
`key`	`Any`	The unique key value to check.	required

Returns:

Type	Description
`bool`	True if key exists, False otherwise.

`repr() -> str` ¶

Returns a developer-friendly representation.

`str() -> str` ¶

Returns a user-friendly string representation.

`iter() -> Iterator[ItemSchema]` ¶

Enables iteration over all items in the set.

Yields:

Type	Description
`ItemSchema`	Iterator over all unique items.

Examples:

>>> for skill in skills:
...     print(skill.name)
>>> names = [s.name for s in skills]

`add(item: ItemSchema) -> None` ¶

Adds a single item to the set with automatic deduplication.

Parameters:

Name	Type	Description	Default
`item`	`ItemSchema`	The item to add.	required

`remove(key: Any) -> Optional[ItemSchema]` ¶

Removes an item by its unique key value.

Parameters:

Name	Type	Description	Default
`key`	`Any`	The unique key value to remove.	required

Returns:

Type	Description
`Optional[ItemSchema]`	The removed item, or None if not found.

`contains(key: Any) -> bool` ¶

Checks if an item with the given key exists in the set.

Parameters:

Name	Type	Description	Default
`key`	`Any`	The unique key value to check.	required

Returns:

Type	Description
`bool`	True if key exists, False otherwise.

`get(key: Any, default: Optional[ItemSchema] = None) -> Optional[ItemSchema]` ¶

Gets an item by its unique key value.

Parameters:

Name	Type	Description	Default
`key`	`Any`	The unique key value to retrieve.	required
`default`	`Optional[ItemSchema]`	Default value if key not found.	`None`

Returns:

Type	Description
`Optional[ItemSchema]`	The item if found, otherwise default.

`update(items: List[ItemSchema]) -> None` ¶

Batch adds multiple items.

Parameters:

Name	Type	Description	Default
`items`	`List[ItemSchema]`	List of items to add.	required

`discard(key: Any) -> None` ¶

Removes an item by its unique key value, silently ignoring if not found.

Unlike remove(), this method does not raise an error if the key does not exist.

Parameters:

Name	Type	Description	Default
`key`	`Any`	The unique key value to remove.	required

Examples:

>>> skills.discard("Python")  # No error if not found

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`pop() -> ItemSchema` ¶

Removes and returns an arbitrary item from the set.

Returns:

Type	Description
`ItemSchema`	The removed item.

Raises:

Type	Description
`KeyError`	If the set is empty.

Examples:

>>> skill = skills.pop()
>>> print(f"Removed: {skill.name}")

Side Effects

Clears the vector index (needs rebuild)
Updates metadata timestamp

`copy() -> AutoSet[ItemSchema]` ¶

Creates a deep copy of the set.

Returns:

Type	Description
`AutoSet[ItemSchema]`	A new AutoSet instance with copies of all items.

Examples:

>>> backup = skills.copy()
>>> backup.add(new_skill)
>>> # Original skills unchanged

`or(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Union operation: set1 | set2.

Returns a new set containing all items from both sets.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`AutoSet[ItemSchema]`	New AutoSet with union of items.

`and(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Intersection operation: set1 & set2.

Returns a new set containing only items present in both sets.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`AutoSet[ItemSchema]`	New AutoSet with intersection of items.

`sub(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Difference operation: set1 - set2.

Returns a new set containing items in self but not in other.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`AutoSet[ItemSchema]`	New AutoSet with difference of items.

`xor(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Symmetric difference operation: set1 ^ set2.

Returns a new set containing items in either set but not in both.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`AutoSet[ItemSchema]`	New AutoSet with symmetric difference of items.

`union(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Union operation (named method).

`intersection(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Intersection operation (named method).

`difference(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Difference operation (named method).

`symmetric_difference(other: AutoSet[ItemSchema]) -> AutoSet[ItemSchema]` ¶

Symmetric difference operation (named method).

`eq(other: Any) -> bool` ¶

Equality comparison: set1 == set2.

Two sets are equal if they have the same schema and key set. Note: Does not compare item contents, only keys.

Parameters:

Name	Type	Description	Default
`other`	`Any`	Another object to compare with.	required

Returns:

Type	Description
`bool`	True if both sets have the same keys, False otherwise.

Examples:

>>> skills1 == skills2  # True if same keys

`ne(other: Any) -> bool` ¶

Inequality comparison: set1 != set2.

Returns:

Type	Description
`bool`	True if sets are not equal, False otherwise.

`le(other: AutoSet[ItemSchema]) -> bool` ¶

Subset comparison: set1 <= set2.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if self is a subset of other (all keys in self are in other).

Raises:

Type	Description
`TypeError`	If other is not a AutoSet or has different schema.

Examples:

>>> skills1 <= skills2  # True if skills1 is subset of skills2

`lt(other: AutoSet[ItemSchema]) -> bool` ¶

Proper subset comparison: set1 < set2.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if self is a proper subset of other (subset and not equal).

Examples:

>>> skills1 < skills2  # True if skills1 is proper subset

`ge(other: AutoSet[ItemSchema]) -> bool` ¶

Superset comparison: set1 >= set2.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if self is a superset of other (all keys in other are in self).

Examples:

>>> skills1 >= skills2  # True if skills1 is superset of skills2

`gt(other: AutoSet[ItemSchema]) -> bool` ¶

Proper superset comparison: set1 > set2.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if self is a proper superset of other (superset and not equal).

Examples:

>>> skills1 > skills2  # True if skills1 is proper superset

`issubset(other: AutoSet[ItemSchema]) -> bool` ¶

Test whether every key in the set is in other.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if self is a subset of other.

Examples:

>>> skills1.issubset(skills2)

`issuperset(other: AutoSet[ItemSchema]) -> bool` ¶

Test whether every key in other is in the set.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if self is a superset of other.

Examples:

>>> skills1.issuperset(skills2)

`isdisjoint(other: AutoSet[ItemSchema]) -> bool` ¶

Test whether the set has no keys in common with other.

Parameters:

Name	Type	Description	Default
`other`	`AutoSet[ItemSchema]`	Another AutoSet instance.	required

Returns:

Type	Description
`bool`	True if the two sets have no keys in common.

Raises:

Type	Description
`TypeError`	If other is not a AutoSet or has different schema.

Examples:

>>> skills1.isdisjoint(skills2)  # True if no common skills

记录类型¶

AutoModel¶

hyperextract.types.AutoModel ¶

Attributes¶

_strategy_or_merger = strategy_or_merger instance-attribute ¶

_constructor_kwargs = kwargs instance-attribute ¶

_label_extractor = label_extractor instance-attribute ¶

_constant_key = 'singleton' instance-attribute ¶

_key_extractor = lambda x: self._constant_key instance-attribute ¶

_merger = strategy_or_merger instance-attribute ¶

data: T property ¶

Functions¶

_create_empty_instance() -> AutoModel[T] ¶

_default_prompt() -> str ¶

empty() -> bool ¶

_init_data_state() -> None ¶

_init_index_state() -> None ¶

_set_data_state(data: T) -> None ¶

_update_data_state(incoming_data: T) -> None ¶

merge_batch_data(data_list: List[T]) -> T ¶

build_index() ¶

search(query: str, top_k: int = 3) -> List[Any] ¶

dump_index(folder_path: str | Path) -> None ¶

load_index(folder_path: str | Path) -> None ¶

show(label_extractor: Callable[[T], str] = None, *, top_k: int = 3) -> None ¶

__add__(other: Union[AutoModel, AutoList]) -> AutoList ¶

Chain operations¶

AutoList¶

hyperextract.types.AutoList ¶

Attributes¶

item_list_schema: Type[AutoListSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Item list'))) instance-attribute ¶

item_schema = item_schema instance-attribute ¶

fields_for_index = fields_for_index instance-attribute ¶

_item_label_extractor = item_label_extractor instance-attribute ¶

items: List[ItemSchema] property ¶

data: AutoListSchema property ¶

Functions¶

_default_prompt() -> str ¶

_create_empty_instance() -> AutoList[ItemSchema] ¶

empty() -> bool ¶

_init_data_state() -> None ¶

_set_data_state(data: AutoListSchema) -> None ¶

_update_data_state(incoming_data: AutoListSchema) -> None ¶

_init_index_state() -> None ¶

merge_batch_data(data_list: List[AutoListSchema]) -> AutoListSchema ¶

build_index() -> None ¶

search(query: str, top_k: int = 3) -> List[ItemSchema] ¶

dump_index(folder_path: str | Path) -> None ¶

load_index(folder_path: str | Path) -> None ¶

show(item_label_extractor: Callable[[ItemSchema], str] = None, *, top_k_for_search: int = 3, top_k_for_chat: int = 3) -> None ¶

__len__() -> int ¶

__getitem__(key: int | slice) -> ItemSchema | AutoList[ItemSchema] ¶

__setitem__(index: int, item: ItemSchema) -> None ¶

__add__(other: BaseAutoType[ItemSchema]) -> AutoList[ItemSchema] ¶

__delitem__(index: int) -> None ¶

__iter__() -> Iterator[ItemSchema] ¶

__contains__(item: ItemSchema) -> bool ¶

__repr__() -> str ¶

__str__() -> str ¶

append(item: ItemSchema) -> None ¶

extend(items: Iterable[ItemSchema] | AutoList[ItemSchema]) -> None ¶

insert(index: int, item: ItemSchema) -> None ¶

remove(item: ItemSchema) -> None ¶

pop(index: int = -1) -> ItemSchema ¶

index(item: ItemSchema, start: int = 0, stop: int | None = None) -> int ¶

count(item: ItemSchema) -> int ¶

copy() -> AutoList[ItemSchema] ¶

reverse() -> None ¶

sort(key: Callable[[ItemSchema], Any] | None = None, reverse: bool = False) -> None ¶

_validate_item_schema(item: Any) -> None ¶

_items_equal(item1: BaseModel, item2: BaseModel) -> bool ¶

AutoSet¶

hyperextract.types.AutoSet ¶

Attributes¶

item_set_schema: Type[AutoSetSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Set of unique items'))) instance-attribute ¶

item_schema = item_schema instance-attribute ¶

fields_for_index = fields_for_index instance-attribute ¶

_constructor_kwargs = kwargs instance-attribute ¶

key_extractor = key_extractor instance-attribute ¶

strategy_or_merger = strategy_or_merger instance-attribute ¶

`hyperextract.types.AutoModel` ¶

`_strategy_or_merger = strategy_or_merger` `instance-attribute` ¶

`_constructor_kwargs = kwargs` `instance-attribute` ¶

`_label_extractor = label_extractor` `instance-attribute` ¶

`_constant_key = 'singleton'` `instance-attribute` ¶

`_key_extractor = lambda x: self._constant_key` `instance-attribute` ¶

`_merger = strategy_or_merger` `instance-attribute` ¶

`data: T` `property` ¶

`_create_empty_instance() -> AutoModel[T]` ¶

`_default_prompt() -> str` ¶

`empty() -> bool` ¶

`_init_data_state() -> None` ¶

`_init_index_state() -> None` ¶

`_set_data_state(data: T) -> None` ¶

`_update_data_state(incoming_data: T) -> None` ¶

`merge_batch_data(data_list: List[T]) -> T` ¶

`build_index()` ¶

`search(query: str, top_k: int = 3) -> List[Any]` ¶

`dump_index(folder_path: str | Path) -> None` ¶

`load_index(folder_path: str | Path) -> None` ¶

`show(label_extractor: Callable[[T], str] = None, *, top_k: int = 3) -> None` ¶

`add(other: Union[AutoModel, AutoList]) -> AutoList` ¶

`hyperextract.types.AutoList` ¶

`item_list_schema: Type[AutoListSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Item list')))` `instance-attribute` ¶

`item_schema = item_schema` `instance-attribute` ¶

`fields_for_index = fields_for_index` `instance-attribute` ¶

`_item_label_extractor = item_label_extractor` `instance-attribute` ¶

`items: List[ItemSchema]` `property` ¶

`data: AutoListSchema` `property` ¶

`_default_prompt() -> str` ¶

`_create_empty_instance() -> AutoList[ItemSchema]` ¶

`empty() -> bool` ¶

`_init_data_state() -> None` ¶

`_set_data_state(data: AutoListSchema) -> None` ¶

`_update_data_state(incoming_data: AutoListSchema) -> None` ¶

`_init_index_state() -> None` ¶

`merge_batch_data(data_list: List[AutoListSchema]) -> AutoListSchema` ¶

`build_index() -> None` ¶

`search(query: str, top_k: int = 3) -> List[ItemSchema]` ¶

`dump_index(folder_path: str | Path) -> None` ¶

`load_index(folder_path: str | Path) -> None` ¶

`show(item_label_extractor: Callable[[ItemSchema], str] = None, *, top_k_for_search: int = 3, top_k_for_chat: int = 3) -> None` ¶

`len() -> int` ¶

`getitem(key: int | slice) -> ItemSchema | AutoList[ItemSchema]` ¶

`setitem(index: int, item: ItemSchema) -> None` ¶

`add(other: BaseAutoType[ItemSchema]) -> AutoList[ItemSchema]` ¶

`delitem(index: int) -> None` ¶

`iter() -> Iterator[ItemSchema]` ¶

`contains(item: ItemSchema) -> bool` ¶

`repr() -> str` ¶

`str() -> str` ¶

`append(item: ItemSchema) -> None` ¶

`extend(items: Iterable[ItemSchema] | AutoList[ItemSchema]) -> None` ¶

`insert(index: int, item: ItemSchema) -> None` ¶

`remove(item: ItemSchema) -> None` ¶

`pop(index: int = -1) -> ItemSchema` ¶

`index(item: ItemSchema, start: int = 0, stop: int | None = None) -> int` ¶

`count(item: ItemSchema) -> int` ¶

`copy() -> AutoList[ItemSchema]` ¶

`reverse() -> None` ¶

`sort(key: Callable[[ItemSchema], Any] | None = None, reverse: bool = False) -> None` ¶

`_validate_item_schema(item: Any) -> None` ¶

`_items_equal(item1: BaseModel, item2: BaseModel) -> bool` ¶

`hyperextract.types.AutoSet` ¶

`item_set_schema: Type[AutoSetSchema[ItemSchema]] = create_model(container_name, items=(List[item_schema], Field(default_factory=list, description='Set of unique items')))` `instance-attribute` ¶

`item_schema = item_schema` `instance-attribute` ¶

`fields_for_index = fields_for_index` `instance-attribute` ¶

`_constructor_kwargs = kwargs` `instance-attribute` ¶

`key_extractor = key_extractor` `instance-attribute` ¶

`strategy_or_merger = strategy_or_merger` `instance-attribute` ¶

`_merger = strategy_or_merger` `instance-attribute` ¶

`_data_memory: OMem[ItemSchema] = OMem(memory_schema=item_schema, key_extractor=key_extractor, llm_client=llm_client, embedder=embedder, strategy_or_merger=(self._merger), verbose=verbose, fields_for_index=fields_for_index)` `instance-attribute` ¶

`_item_label_extractor = item_label_extractor` `instance-attribute` ¶

`data: AutoSetSchema[ItemSchema]` `property` ¶

`items: List[ItemSchema]` `property` ¶

`keys: List[Any]` `property` ¶

`_create_empty_instance() -> AutoSet[ItemSchema]` ¶

`_default_prompt() -> str` ¶

`empty() -> bool` ¶

`_init_data_state() -> None` ¶