---
title: Api Embedding
emoji: 🐠
colorFrom: green
colorTo: purple
sdk: docker
pinned: false
---

# Unified Embedding API

**🧩 A self-hosted embedding service for dense, sparse, and reranking models with OpenAI-compatible API.**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-green.svg)](https://fastapi.tiangolo.com/)
[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/spaces/fahmiaziz/api-embedding)

---

## Overview

The Unified Embedding API is a modular, self-hosted solution designed to simplify the development and management of embedding models for Retrieval-Augmented Generation (RAG) and semantic search applications. Built on FastAPI and Sentence Transformers, this API provides a unified interface for dense embeddings, sparse embeddings (SPLADE), and document reranking through CrossEncoder models.

**Key Differentiation:** Unlike traditional embedding services that require separate infrastructure for each model type, this API consolidates all embedding operations into a single, configurable endpoint with OpenAI-compatible responses.

### Project Motivation

During the development of RAG and agentic systems for production environments and portfolio projects, several operational challenges emerged:

1. **Development Environment Overhead:** Each experiment required setting up isolated environments with PyTorch, Transformers, and associated dependencies (often 5-10GB per environment)
2. **Model Experimentation Costs:** Testing different models for optimal precision, MRR, and recall metrics necessitated downloading multiple model versions, consuming significant disk space and compute resources
3. **Hardware Limitations:** Running models locally on CPU-only machines frequently resulted in thermal throttling and system instability

**Solution Approach:** After evaluating Hugging Face's Text Embeddings Inference (TEI), the need for a more flexible, configuration-driven solution became apparent. This project addresses these challenges by:

- Providing a single API endpoint that can serve multiple model types
- Enabling model switching through configuration files without code changes
- Leveraging Hugging Face Spaces for free, serverless hosting
- Maintaining compatibility with OpenAI's client libraries for seamless integration

---

## Technical Motivation

### Architecture Decisions

#### 1. Framework Selection: SentenceTransformers + FastAPI

**SentenceTransformers** was chosen as the core embedding library for several technical reasons:

- **Unified Model Interface:** Provides consistent APIs across diverse model architectures (BERT, RoBERTa, SPLADE, CrossEncoders)
- **Model Ecosystem:** Direct compatibility with 5,000+ pre-trained models on Hugging Face Hub

**FastAPI** serves as the web framework due to:

- **Async-First Architecture:** Non-blocking I/O operations critical for handling concurrent embedding requests
- **Automatic API Documentation:** OpenAPI/Swagger generation reduces documentation overhead
- **Type Safety:** Pydantic integration ensures request validation at the schema level

#### 2. Hosting Strategy: Hugging Face Spaces

Deploying on Hugging Face Spaces provides several operational advantages:

- Zero infrastructure cost for CPU-based workloads (2vCPU, 16GB RAM)
- Eliminates need for dedicated VPS or cloud compute instances
- No egress fees for model weight downloads from HF Hub
- Built-in CI/CD through git-based deployments
- Easy transition to paid GPU instances for larger models
- Native support for Docker-based deployments

---

## Features

### Core Capabilities

- **Multi-Model Support:** Serve dense embeddings (transformers), sparse embeddings (SPLADE), and reranking models (CrossEncoders) from a single API
- **OpenAI Compatibility:** Drop-in replacement for OpenAI's embedding API with client library support
- **Configuration-Driven:** Switch models through YAML configuration without code modifications
- **Batch Processing:** Automatic optimization for single and batch requests
- **Type Safety:** Full Pydantic validation with OpenAPI schema generation
- **Async Operations:** Non-blocking request handling with FastAPI's async/await

---

## Architecture

### System Components

```
┌─────────────────────────────────────────────────────────┐
│                     FastAPI Server                      │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Embeddings │  │  Reranking │  │   Models   │         │
│  │  Endpoint  │  │  Endpoint  │  │  Endpoint  │         │
│  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   Model Manager                         │
│  • Configuration Loading                                │
│  • Model Lifecycle Management                           │
│  • Thread-Safe Model Access                             │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              Embedding Implementations                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐     │
│  │    Dense     │ │    Sparse    │ │   Reranking  │     │
│  │(Transformer) │ │   (SPLADE)   │ │(CrossEncoder)│     │
│  └──────────────┘ └──────────────┘ └──────────────┘     │
└─────────────────────────────────────────────────────────┘
```

### Project Structure

```
unified-embedding-api/
├── src/
│   ├── api/                    # API layer
│   │   ├── dependencies.py     # Dependency injection
│   │   └── routes/
│   │       ├── embeddings.py   # Dense/sparse endpoints
│   │       ├── model_list.py   # Model management
│   │       └── health.py       # Health checks
│   │       └── rerank.py       # Reranking endpoint
│   ├── core/                   # Business logic
│   │   ├── base.py             # Abstract base classes
│   │   ├── config.py           # Configuration models
│   │   ├── exceptions.py       # Custom exceptions     
│   │   └── manager.py          # Model lifecycle management
│   ├── models/                 # Domain models
│   │   ├── embeddings/
│   │   │   ├── dense.py        # Dense embedding implementation
│   │   │   ├── sparse.py       # Sparse embedding implementation
│   │   │   └── rank.py         # Reranking implementation
│   │   └── schemas/
│   │       ├── common.py       # Shared schemas
│   │       ├── requests.py     # Request models
│   │       └── responses.py    # Response models
│   ├── config/
│   │   ├── settings.py         # Application settings
│   │   └── models.yaml         # Model configuration
│   └── utils/
│       ├── logger.py           # Logging configuration
│       └── validators.py       # Validation kwrags, token etc
├── app.py                      # Application entry point
├── requirements.txt            # Development dependencies
└── Dockerfile                  # Container definition
```

---

## Quick Start

### Deployment on Hugging Face Spaces

**Prerequisites:**
- Hugging Face account
- Git installed locally

**Steps:**

1. **Duplicate Space**
   - Navigate to [fahmiaziz/api-embedding](https://huggingface.co/spaces/fahmiaziz/api-embedding)
   - Click the three-dot menu → "Duplicate this Space"

2. **Configure Environment**
   - In Space settings, add `HF_TOKEN` as a repository secret (for private model access)
   - Ensure Space visibility is set to "Public"

3. **Clone Repository**
   ```bash
   git clone https://huggingface.co/spaces/YOUR_USERNAME/api-embedding
   cd api-embedding
   ```

4. **Configure Models**
   Edit `src/config/models.yaml`:
   ```yaml
   models:
     custom-model:
       name: "organization/model-name"
       type: "embeddings"  # Options: embeddings, sparse-embeddings, rerank
   ```

5. **Deploy Changes**
   ```bash
   git add src/config/models.yaml
   git commit -m "Configure custom models"
   git push
   ```

6. **Access API** 
   - Click **⋯** → **Embed this Space** → copy **Direct URL**
   - Base URL: `https://YOUR_USERNAME-api-embedding.hf.space`
   - Documentation: `https://YOUR_USERNAME-api-embedding.hf.space/docs`

### Local Development (NOT RECOMMENDED)

**System Requirements:**
- Python 3.10+
- 8GB RAM minimum
- 10GB++ disk space 

**Setup:**

```bash
# Clone repository
git clone https://github.com/fahmiaziz98/unified-embedding-api.git
cd unified-embedding-api

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Start server
python app.py
```

Server will be available at `http://localhost:7860`

### Docker Deployment

```bash
# Build image
docker build -t unified-embedding-api .

# Run container
docker run -p 7860:7860 unified-embedding-api

```

---

## Usage

### Native API (requests)

```python
import requests

BASE_URL = "https://fahmiaziz-api-embedding.hf.space/api/v1"

# Generate embeddings
response = requests.post(
    f"{BASE_URL}/embeddings",
    json={
        "input": "Natural language processing",
        "model": "qwen3-0.6b"
    }
)

data = response.json()
embedding = data["data"][0]["embedding"]
print(f"Embedding dimensions: {len(embedding)}")
```

### OpenAI Client Integration

The API implements OpenAI's embedding API specification, enabling direct integration with OpenAI's Python client:

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://fahmiaziz-api-embedding.hf.space/api/v1",
    api_key="not-required"  # Placeholder required by client
)

# Single text embedding
response = client.embeddings.create(
    input="Text to embed",
    model="qwen3-0.6b"
)

embedding_vector = response.data[0].embedding
```

**Async Operations:**

```python
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://fahmiaziz-api-embedding.hf.space/api/v1",
    api_key="not-required"
)

async def generate_embeddings(texts: list[str]):
    response = await client.embeddings.create(
        input=texts,
        model="qwen3-0.6b"
    )
    return [item.embedding for item in response.data]

# Usage in async context
embeddings = await generate_embeddings(["text1", "text2"])
```

### Document Reranking

```python
import requests

response = requests.post(
    f"{BASE_URL}/rerank",
    json={
        "query": "machine learning frameworks",
        "documents": [
            "TensorFlow is a comprehensive ML platform",
            "React is a JavaScript UI library",
            "PyTorch provides flexible neural networks"
        ],
        "model": "bge-v2-m3",
        "top_k": 2
    }
)

results = response.json()["results"]
for result in results:
    print(f"Score: {result['score']:.3f} - {result['text']}")
```

---

## API Reference

### Endpoints

| Endpoint | Method | Description | OpenAI Compatible |
|----------|--------|-------------|-------------------|
| `/api/v1/embeddings` | POST | Generate embeddings | Yes |
| `/api/v1/embed_sparse` | POST | Generate sparse embeddings | No |
| `/api/v1/rerank` | POST | Rerank documents | No |
| `/api/v1/models` | GET | List available models | Partial |
| `/health` | GET | Health check | No |

### Request Format

**Embeddings (OpenAI-compatible):**
```json
{
  "input": "text" ["text1", "text2"],
  "model": "model-identifier",
  "encoding_format": "float"
}
```

**Sparse Embeddings:**
```json
{
  "input": "text" ["text1", "text2"],
  "model": "splade-model-id"
}
```

**Reranking:**
```json
{
  "query": "search query",
  "documents": ["doc1", "doc2"],
  "model": "reranker-id",
  "top_k": 10
}
```

### Response Format

**Standard Embedding Response:**
```json
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.123, -0.456,],
      "index": 0
    }
  ],
  "model": "qwen3-0.6b",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}
```

---

## Configuration

### Model Configuration
Default configuration is optimized for **CPU 2vCPU / 16GB RAM**. See [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for model recommendations and memory usage reference.

Edit `src/config/models.yaml` to add or modify models:

```yaml
models:
  # Dense embedding model
  custom-dense:
    name: "sentence-transformers/all-MiniLM-L6-v2"
    type: "embeddings"

  # Sparse embedding model
  custom-sparse:
    name: "prithivida/Splade_PP_en_v1"
    type: "sparse-embeddings"

  # Reranking model
  custom-reranker:
    name: "BAAI/bge-reranker-base"
    type: "rerank"
```

**Model Type Reference:**

| Type | Description | Use Case |
|------|-------------|----------|
| `embeddings` | Dense vector embeddings | Semantic search, similarity |
| `sparse-embeddings` | Sparse vectors (SPLADE) | Keyword + semantic hybrid |
| `rerank` | CrossEncoder scoring | Precision reranking |

⚠️ If you plan to use larger models like `Qwen2-embedding-8B`, please upgrade your Space.

### Application Settings

Configure through `src/config/settings.py` file:

```bash
# Application
APP_NAME="Unified Embedding API"
VERSION="3.0.0"

# Server
HOST=0.0.0.0
PORT=7860  # don't change port
WORKERS=1

# Models
MODEL_CONFIG_PATH=src/config/models.yaml
PRELOAD_MODELS=true
DEVICE=cpu

# Logging
LOG_LEVEL=INFO
```

---

## Performance Optimization

### Recommended Practices

1. **Batch Processing**
   - Always send multiple texts in a single request when possible
   - Batch size of 16-32 provides optimal throughput/latency balance

2. **Normalization**
   - Enable `normalize_embeddings` for cosine similarity operations
   - Reduces downstream computation in vector databases

3. **Model Selection**
   - Dense models: Best for semantic similarity
   - Sparse models: Better for keyword matching + semantics
   - Reranking: Use as second-stage after initial retrieval

---

## Migration from OpenAI

Replace OpenAI embedding calls with minimal code changes:

**Before (OpenAI):**
```python
from openai import OpenAI
client = OpenAI(api_key="sk-...")

response = client.embeddings.create(
    input="Hello world",
    model="text-embedding-3-small"
)
```

**After (Self-hosted):**
```python
from openai import OpenAI
client = OpenAI(
    base_url="https://your-space.hf.space/api/v1",
    api_key="not-required"
)

response = client.embeddings.create(
    input="Hello world",
    model="qwen3-0.6b"  # Your configured model
)
```

**Compatibility Matrix:**

| Feature | Supported | Notes |
|---------|-----------|-------|
| `input` (string) | ✓ | Converted to list internally |
| `input` (list) | ✓ | Batch processing |
| `model` parameter | ✓ | Use configured model IDs |
| `encoding_format` | Partial | Always returns float |
| `dimensions` | ✗ | Returns model's native dimensions |
| `user` parameter | ✗ | Ignored |

---
## ⚠️ **Note:** This is a development API.  
For production deployment, host it on cloud platforms such as **Hugging Face TEI**, **AWS**, **GCP**, or any cloud provider of your choice.

---

## Contributing

Contributions are welcome. Please follow these guidelines:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
---

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

---

## References

- **Sentence Transformers Documentation:** https://www.sbert.net/
- **FastAPI Documentation:** https://fastapi.tiangolo.com/
- **OpenAI API Specification:** https://platform.openai.com/docs/api-reference/embeddings
- **MTEB Benchmark:** https://huggingface.co/spaces/mteb/leaderboard
- **Hugging Face Spaces:** https://huggingface.co/docs/hub/spaces

---

## Support

- **Issues:** [GitHub Issues](https://github.com/fahmiaziz98/unified-embedding-api/issues)
- **Discussions:** [GitHub Discussions](https://github.com/fahmiaziz98/unified-embedding-api/discussions)
- **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/fahmiaziz/api-embedding)

---

**Maintained by:** [Fahmi Aziz](https://github.com/fahmiaziz98)  
**Project Status:** Active Development

> ✨ "Unify your embeddings. Simplify your AI stack."