api-embedding / API.md
fahmiaziz98
init README
9847166
|
raw
history blame
15.8 kB

πŸ“– Unified Embedding API Documentation

Complete API reference for the Unified Embedding API v3.0.0.

Features: Dense Embeddings, Sparse Embeddings, and Document Reranking


🌐 Base URL

https://fahmiaziz-api-embedding.hf.space

For local development:

http://localhost:7860

πŸ”‘ Authentication

Currently no authentication required.


πŸ“Š Endpoints Overview

Endpoint Method Description
/api/v1/embeddings/embed POST Generate document embeddings
/api/v1/embeddings/query POST Generate query embeddings
/api/v1/rerank POST Rerank documents by relevance
/api/v1/models GET List available models
/api/v1/models/{model_id} GET Get model information
/health GET Health check
/ GET API information

πŸš€ Embedding Endpoints

1. Generate Document Embeddings

POST /api/v1/embeddings/embed

Generate embeddings for document texts. Supports both single and batch processing.

Request Body

{
  "texts": ["string"],           // Required: List of texts (1-100 items)
  "model_id": "string",          // Required: Model identifier
  "prompt": "string",            // Optional: Instruction prompt
  "options": {                   // Optional: Embedding parameters
    "normalize_embeddings": true,
    "batch_size": 32,
    "max_length": 512,
    "show_progress_bar": false
  }
}

Parameters

Field Type Required Description
texts array[string] βœ… Yes List of texts to embed (min: 1, max: 100)
model_id string βœ… Yes Model identifier (e.g., "qwen3-0.6b")
prompt string ❌ No Instruction prompt for the model
options object ❌ No Additional embedding parameters

Options Parameters

Field Type Default Description
normalize_embeddings boolean false L2 normalize output embeddings
batch_size integer 32 Processing batch size (1-256)
max_length integer 512 Maximum sequence length (1-8192)
show_progress_bar boolean false Display progress during encoding
precision string float32 Precision ("float32", "int8", "binary")

Response - Single Text (Dense)

{
  "embedding": [0.123, -0.456, 0.789, ...],
  "dimension": 768,
  "model_id": "qwen3-0.6b",
  "processing_time": 0.0523
}

Response - Batch (Dense)

{
  "embeddings": [
    [0.123, -0.456, ...],
    [0.234, 0.567, ...],
    [0.345, -0.678, ...]
  ],
  "dimension": 768,
  "count": 3,
  "model_id": "qwen3-0.6b",
  "processing_time": 0.1245
}

Response - Single Text (Sparse)

{
  "sparse_embedding": {
    "text": "Hello world",
    "indices": [10, 25, 42, 100],
    "values": [0.85, 0.62, 0.91, 0.73]
  },
  "model_id": "splade-pp-v2",
  "processing_time": 0.0421
}

Response - Batch (Sparse)

{
  "embeddings": [
    {
      "text": "First doc",
      "indices": [10, 25, 42],
      "values": [0.85, 0.62, 0.91]
    },
    {
      "text": "Second doc",
      "indices": [15, 30, 50],
      "values": [0.73, 0.88, 0.65]
    }
  ],
  "count": 2,
  "model_id": "splade-pp-v2",
  "processing_time": 0.0892
}

Examples

Single Text (Dense Model):

curl -X 'POST' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/embed' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "texts": ["What is artificial intelligence?"],
  "model_id": "qwen3-0.6b"
}'

Single Text (Sparse Model):

curl -X 'POST' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/embed' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "texts": ["Hello world"],
  "model_id": "splade-pp-v2"
}'

Batch (with Options):

curl -X 'POST' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/embed' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "texts": [
    "First document to embed",
    "Second document to embed",
    "Third document to embed"
  ],
  "model_id": "qwen3-0.6b",
  "options": {
    "normalize_embeddings": true,
    "batch_size": 32
  }
}'

Python Example:

import requests

url = "https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/embed"

payload = {
    "texts": ["Hello world"],
    "model_id": "qwen3-0.6b"
}

response = requests.post(url, json=payload)
data = response.json()

print(f"Embedding dimension: {data['dimension']}")
print(f"Processing time: {data['processing_time']:.3f}s")

2. Generate Query Embeddings

POST /api/v1/embeddings/query

Generate embeddings optimized for search queries. Some models differentiate between query and document embeddings.

Request Body

Same as /embed endpoint.

{
  "texts": ["string"],
  "model_id": "string",
  "prompt": "string",
  "options": {}
}

Response

Same format as /embed endpoint.

Examples

Single Query:

curl -X 'POST' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/query' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "texts": ["What is machine learning?"],
  "model_id": "qwen3-0.6b",
  "prompt": "Represent this query for retrieval",
  "options": {
    "normalize_embeddings": true
  }
}'

Batch Queries:

curl -X 'POST' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/query' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "texts": [
    "First query",
    "Second query",
    "Third query"
  ],
  "model_id": "qwen3-0.6b"
}'

Python Example:

import requests

url = "https://fahmiaziz-api-embedding.hf.space/api/v1/embeddings/query"

payload = {
    "texts": ["What is AI?"],
    "model_id": "qwen3-0.6b",
    "options": {
        "normalize_embeddings": True
    }
}

response = requests.post(url, json=payload)
embedding = response.json()["embedding"]

3. Rerank Documents

POST /api/v1/rerank

Rerank documents based on their relevance to a query using CrossEncoder models.

Request Body

{
  "query": "string",             // Required: Search query
  "documents": ["string"],       // Required: List of documents (min: 1)
  "model_id": "string",          // Required: Reranking model identifier
  "top_k": integer,              // Required: Number of top results to return
}

Parameters

Field Type Required Description
query string βœ… Yes Search query text
documents array[string] βœ… Yes List of documents to rerank (min: 1)
model_id string βœ… Yes Reranking model identifier
top_k integer βœ… Yes Maximum number of results to return

Response

{
  "model_id": "jina-reranker-v3",
  "processing_time": 0.56,
  "query": "Python for data science",
  "results": [
    {
      "index": 0,
      "score": 0.95,
      "text": "Python is excellent for data science"
    },
    {
      "index": 2,
      "score": 0.73,
      "text": "R is also used in data science"
    }
  ]
}

Response Fields

Field Type Description
model_id string Model identifier used
processing_time float Processing time in seconds
query string Original search query
results array Reranked documents with scores
results[].index integer Original index in input documents
results[].score float Relevance score (0-1, normalized)
results[].text string Document text

Examples

Basic Reranking:

curl -X 'POST' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/rerank' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": "Python for data science",
  "documents": [
    "Python is great for data science",
    "Java is used for enterprise applications",
    "R is also used in data science",
    "JavaScript is for web development"
  ],
  "model_id": "jina-reranker-v3",
  "top_k": 2
}'

Python Example:

import requests

url = "https://fahmiaziz-api-embedding.hf.space/api/v1/rerank"

payload = {
    "query": "best programming language for beginners",
    "documents": [
        "Python is beginner-friendly with simple syntax",
        "C++ is powerful but complex for beginners",
        "JavaScript is essential for web development",
        "Rust offers memory safety but steep learning curve"
    ],
    "model_id": "jina-reranker-v3",
    "top_k": 2
}

response = requests.post(url, json=payload)
data = response.json()

print(f"Top result: {data['results'][0]['text']}")
print(f"Score: {data['results'][0]['score']:.3f}")

JavaScript Example:

const url = "https://fahmiaziz-api-embedding.hf.space/api/v1/rerank";

const response = await fetch(url, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    query: "AI applications",
    documents: [
      "Computer vision for image recognition",
      "Recipe for chocolate cake",
      "Natural language processing for chatbots",
      "Travel guide to Paris"
    ],
    model_id: "jina-reranker-v3",
    top_k: 2
  })
});

const { results } = await response.json();
console.log("Top results:", results);

πŸ€– Model Management

3. List Available Models

GET /api/v1/models

Get a list of all available embedding models.

Response

{
  "models": [
    {
      "id": "qwen3-0.6b",
      "name": "Qwen/Qwen3-Embedding-0.6B",
      "type": "embeddings",
      "loaded": true,
      "repository": "https://huggingface.co/Qwen/Qwen3-Embedding-0.6B"
    },
    {
      "id": "splade-pp-v2",
      "name": "prithivida/Splade_PP_en_v2",
      "type": "sparse-embeddings",
      "loaded": true,
      "repository": "https://huggingface.co/prithivida/Splade_PP_en_v2"
    }
  ],
  "total": 2
}

Example

curl -X 'GET' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/models' \
  -H 'accept: application/json'

4. Get Model Information

GET /api/v1/models/{model_id}

Get detailed information about a specific model.

Parameters

Parameter Type Required Description
model_id string βœ… Yes Model identifier

Response

{
  "id": "qwen3-0.6b",
  "name": "Qwen/Qwen3-Embedding-0.6B",
  "type": "embeddings",
  "loaded": true,
  "repository": "https://huggingface.co/Qwen/Qwen3-Embedding-0.6B"
}

Example

curl -X 'GET' \
  'https://fahmiaziz-api-embedding.hf.space/api/v1/models/qwen3-0.6b' \
  -H 'accept: application/json'

πŸ₯ System Endpoints

5. Health Check

GET /health

Check API health status.

Response

{
  "status": "ok",
  "total_models": 2,
  "loaded_models": 2,
  "startup_complete": true
}

Example

curl -X 'GET' \
  'https://fahmiaziz-api-embedding.hf.space/health' \
  -H 'accept: application/json'

6. API Information

GET /

Get basic API information.

Response

{
  "message": "Unified Embedding API - Dense & Sparse Embeddings",
  "version": "3.0.0",
  "docs_url": "/docs"
}

❌ Error Responses

All errors follow this format:

{
  "detail": "Error message description"
}

HTTP Status Codes

Code Description
200 Success
400 Bad Request - Invalid input
404 Not Found - Model not found
422 Unprocessable Entity - Validation error
500 Internal Server Error
503 Service Unavailable - Server not ready

Common Errors

Model Not Found (404):

{
  "detail": "Model 'unknown-model' not found in configuration"
}

Validation Error (422):

{
  "detail": [
    {
      "loc": ["body", "texts"],
      "msg": "texts list cannot be empty",
      "type": "value_error"
    }
  ]
}

Batch Too Large (422):

{
  "detail": "Batch size (150) exceeds maximum (100)"
}

πŸ“¦ Available Models

Dense Embedding Models

Model ID Name Dimension Description
qwen3-0.6b Qwen/Qwen3-Embedding-0.6B 768 Efficient multilingual embeddings

Sparse Embedding Models

Model ID Name Type Description
splade-pp-v2 prithivida/Splade_PP_en_v2 Sparse SPLADE++ English v2

Reranking Models

Model ID Name Type Description
jina-reranker-v3 jinaai/jina-reranker-v3-base-en CrossEncoder High-quality reranking (English)
bge-v2-m3 BAAI/bge-reranker-v2-m3 CrossEncoder Multilingual reranking

πŸ”§ Rate Limits

Current Limits:

  • Max text length: 8,192 characters
  • Max batch size: 100 texts per request
  • No rate limiting (subject to server resources)

πŸ’‘ Best Practices

1. Batch Processing

Always batch multiple texts together for better performance:

# ❌ Bad - Multiple requests
for text in texts:
    response = requests.post(url, json={"texts": [text], ...})

# βœ… Good - Single batch request
response = requests.post(url, json={"texts": texts, ...})

2. Normalize Embeddings for Similarity

For cosine similarity, always normalize:

payload = {
    "texts": ["text"],
    "model_id": "qwen3-0.6b",
    "options": {"normalize_embeddings": True}
}

3. Model Selection

  • Dense models (qwen3-0.6b): Best for semantic similarity
  • Sparse models (splade-pp-v2): Best for keyword matching + semantic
  • Rerank models (jina-reranker-v3): Best for re-scoring top candidates

4. Two-Stage Retrieval (Recommended for RAG)

# Stage 1: Fast retrieval with embeddings (top 100)
query_embedding = embed_query(query)
candidates = vector_search(query_embedding, top_k=100)

# Stage 2: Precise reranking (top 10)
reranked = rerank(
    query=query,
    documents=[c["text"] for c in candidates],
    model_id="jina-reranker-v3",
    top_k=10
)

5. Error Handling

Always handle errors gracefully:

try:
    response = requests.post(url, json=payload)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e}")
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

πŸ› Troubleshooting

Empty Response

  • Check texts field is not empty
  • Validate model_id exists

Slow Performance

  • Use batch requests instead of multiple single requests
  • Reduce batch_size in options if memory issues
  • Check model is preloaded (first request is slower)

Connection Errors

  • Verify base URL is correct
  • Check network connectivity
  • Ensure server is running (/health endpoint)

πŸ“ž Support


πŸ”„ Changelog

v3.0.0 (Current)

  • ✨ Added reranking endpoint (/api/v1/rerank)
  • ✨ Support for CrossEncoder models
  • ✨ Unified batch-only response format
  • ✨ Flexible kwargs support
  • ✨ In-memory caching
  • ✨ Improved error handling
  • ✨ Comprehensive documentation
  • πŸ› Fixed type hint errors in RerankModel
  • πŸ› Fixed duplicate parameter errors in rerank endpoint

Last Updated: 2025-11-02