Skip to content

Frequently Asked Questions

Common questions about Hyper-Extract.


General

What is Hyper-Extract?

Hyper-Extract is an LLM-powered knowledge extraction framework that transforms unstructured text into structured knowledge graphs, lists, models, and more.

What can I use it for?

  • Research paper analysis
  • Knowledge base construction
  • Document processing
  • Information extraction
  • Question-answering systems

Is it free?

The software is open-source (Apache-2.0). You need to provide your own OpenAI API key for LLM calls.


Installation

What are the requirements?

  • Python 3.11+
  • OpenAI API key

How do I install it?

pip install hyperextract

Installation fails with "No module named 'hyperextract'"

Try:

pip install --upgrade hyperextract

Or use a virtual environment:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install hyperextract


Configuration

Where do I set my API key?

Option 1: CLI

he config init -k YOUR_API_KEY

Option 2: Environment variable

export OPENAI_API_KEY=your-api-key

Option 3: .env file

OPENAI_API_KEY=your-api-key

Can I use a different LLM provider?

Yes, set the base URL:

he config set llm.base_url https://your-provider.com/v1

Which models are supported?

  • OpenAI models (gpt-4o, gpt-4o-mini, etc.)
  • Any OpenAI-compatible API

Usage

Which template should I use?

See the How to Choose guide or use:

he list template

How do I process a PDF?

Convert to text first:

pdftotext document.pdf document.txt
he parse document.txt -t general/graph -l en

Can I process multiple documents?

Option 1: Feed incrementally

he parse doc1.md -t general/graph -o ./ka/ -l en
he feed ./ka/ doc2.md
he feed ./ka/ doc3.md

Option 2: Process directory

he parse ./docs/ -t general/graph -o ./ka/ -l en

How do I extract in Chinese?

he parse doc.md -t general/biography_graph -l zh

Performance

Why is extraction slow?

  • Long documents are chunked and processed in parallel
  • Each chunk requires an LLM call
  • Consider using --no-index during batch processing

How can I speed it up?

  1. Use smaller chunk sizes
  2. Reduce max_workers if hitting rate limits
  3. Process documents in parallel (manually)

Memory issues with large documents?

Process in smaller batches:

for batch in chunks(documents, 5):
    for doc in batch:
        ka.feed_text(doc)
    ka.dump("./checkpoint/")


Results

Where is my data stored?

./output/
├── data.json      # Extracted knowledge
├── metadata.json  # Extraction info
└── index/         # Search index

How do I visualize results?

he show ./output/

Or in Python:

# Build index for interactive search/chat in visualization
result.build_index()

result.show()

Interactive Visualization

Can I export to other formats?

import json

# To JSON
json_data = result.data.model_dump_json()

# To dict
data_dict = result.data.model_dump()

Troubleshooting

"API key not found"

Run:

he config init -k YOUR_API_KEY

"Template not found"

List available templates:

he list template

"Index not found" error

Build the index:

he build-index ./output/

Search returns no results

Try: - Different search terms - Increase top_k: he search ./ka/ "query" -n 10 - Check if index is built: he info ./ka/


Advanced

Can I create custom templates?

Yes! See Custom Templates.

Can I use my own extraction method?

Yes, implement and register:

from hyperextract.methods import register_method

class MyMethod:
    def extract(self, text):
        # Your logic
        pass

register_method("my_method", MyMethod, "graph", "Description")

How do I integrate with my application?

from hyperextract import Template

class MyApp:
    def __init__(self):
        self.ka = Template.create("general/graph", "en")

    def process_document(self, text):
        return self.ka.parse(text)

Getting More Help