Provider System¶
Hyper-Extract supports three ways to connect to LLMs: OpenAI, Alibaba Bailian, and local vLLM. All use the same create_client() interface — only the first line changes.
Verified Model Compatibility¶
Cloud API¶
| Platform | Model | json_schema Support |
Compatible | Notes |
|---|---|---|---|---|
| OpenAI | gpt-4o / gpt-4o-mini / gpt-5 | ✅ | ✅ | Native support, recommended |
| Alibaba Bailian | qwen-plus / qwen-turbo / qwen3.6-plus / deepseek-r1 | ✅ | ✅ | Works out of the box |
| Alibaba Bailian | qwen-max / deepseek-v3 | ❌ | ❌ | Only json_object; json_schema not supported |
Bailian users: Both qwen-max and deepseek-v3 do not support
json_schema. If you hitmessages must contain the word 'json'or get non-JSON output, switch to qwen-plus, qwen-turbo, or deepseek-r1. See Issue #24.
Local Deployment¶
| Service | Model | Quantization | VRAM | Verified |
|---|---|---|---|---|
| LLM | Qwen3.5-9B | GPTQ-Marlin 4bit | ~8GB | ✅ AutoList / AutoGraph |
| Embedding | BAAI/bge-m3 | None | ~2GB | ✅ Semantic search |
Cloud Embedding¶
| Platform | Model | Dimensions | Verified |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | ✅ |
| Alibaba Bailian | text-embedding-v4 | 1024 | ✅ |
Quick Start by Platform¶
from hyperextract import create_client
# OpenAI
llm, emb = create_client("openai", api_key="sk-xxx")
# Bailian
llm, emb = create_client("bailian", api_key="sk-xxx")
# Local vLLM
llm, emb = create_client(
llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
embedder="vllm:bge-m3@http://localhost:8001/v1",
api_key="dummy",
)
Full configuration options: Provider Configuration Guide.
vLLM Deployment¶
Start LLM Service¶
vllm serve /path/to/qwen3.5-9b-gptq-marlin \
--served-model-name Qwen/Qwen3.5-9B \
--trust-remote-code \
--quantization gptq_marlin \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--port 8000 \
--api-key dummy
Start Embedding Service¶
Docker¶
docker run --runtime nvidia --gpus all \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3.5-9B \
--trust-remote-code
Recommendations¶
Use non-thinking models for local vLLM deployment.
When deploying locally, thinking models (e.g. Qwen3.5 thinking mode) output <think>...</think> tags, which conflict with constrained decoding's requirement for JSON from the first token. Disable thinking mode when deploying Qwen3.5-9B:
DeepSeek-R1 via Bailian is verified to work — Bailian filters out <think> tags on the backend, so AutoGraph and with_structured_output work correctly. If accessing DeepSeek-R1 through other channels, thinking output may still cause issues.
Prefer GPTQ-Marlin over AWQ for quantization — AWQ has compatibility issues with vLLM 0.21.0.