Skip to content

Step 1: Extract Knowledge

Parse a research paper and extract structured concepts.


Goal

Extract entities, relationships, and concepts from a research paper into a structured knowledge graph.


Preparation

1. Get a Research Paper

Download a paper or use your own. For this tutorial, we'll use a sample:

# Download a sample paper (or use your own)
curl -o paper.md https://arxiv.org/abs/1706.03762  # Attention Is All You Need

2. Convert to Text (if needed)

If you have a PDF:

pdftotext paper.pdf paper.md

Extract Using CLI

Basic Extraction

he parse paper.md -t general/concept_graph -o ./paper_kb/ -l en

What this does: - Reads the paper - Extracts concepts and their relationships - Saves to ./paper_kb/

Verify Extraction

he info ./paper_kb/

Expected output:

Knowledge Abstract Info

Path          ./paper_kb/
Template      general/concept_graph
Language      en
Nodes         25
Edges         32
Index         Built

Visualize

he show ./paper_kb/

Extract Using Python

Script

"""Step 1: Extract knowledge from research paper."""

from dotenv import load_dotenv
load_dotenv()

from hyperextract import Template
from pathlib import Path

# Configuration
PAPER_FILE = "paper.md"
OUTPUT_DIR = "./paper_kb/"

def main():
    # Create template
    print("Creating concept extraction template...")
    ka = Template.create("general/concept_graph", language="en")

    # Read paper
    print(f"Reading: {PAPER_FILE}")
    text = Path(PAPER_FILE).read_text(encoding="utf-8")

    # Extract knowledge
    print("Extracting concepts and relationships...")
    result = ka.parse(text)

    # Display results
    print(f"\nExtraction Complete:")
    print(f"  Nodes: {len(result.nodes)}")
    print(f"  Edges: {len(result.edges)}")

    # Show sample nodes
    print("\nSample Concepts:")
    for node in result.nodes[:5]:
        print(f"  - {node.name} ({node.type})")

    # Save
    print(f"\nSaving to: {OUTPUT_DIR}")
    result.dump(OUTPUT_DIR)

    # Build index for next step
    print("Building search index...")
    result.build_index()
    result.dump(OUTPUT_DIR)

    print("\n✓ Step 1 complete!")
    print(f"  Knowledge base: {OUTPUT_DIR}")
    print(f"\nNext: Run 'python step2_search.py'")

if __name__ == "__main__":
    main()

Run

python step1_extract.py

Understanding the Output

What Was Extracted?

The concept graph template extracts:

Entities: - Concepts (models, algorithms, techniques) - Authors - Datasets - Metrics

Relations: - uses — Concept uses another - improves_upon — Improvement relationships - evaluated_on — Evaluation datasets - achieves — Results/metrics

Example Output

# Entities
[
    {"name": "Transformer", "type": "model"},
    {"name": "Attention Mechanism", "type": "concept"},
    {"name": "BLEU Score", "type": "metric"}
]

# Relations
[
    {"source": "Transformer", "target": "Attention Mechanism", "type": "uses"},
    {"source": "Transformer", "target": "BLEU Score", "type": "achieves"}
]

Troubleshooting

"No entities extracted"

  • Check paper is not empty: wc -l paper.md
  • Try different template: general/graph
  • Check language setting matches document

"Extraction is slow"

  • Long papers are chunked automatically
  • Each chunk requires an LLM call
  • Consider using --no-index and building later

Next Step

Step 2: Semantic Search