Bombek1's picture
Upload README.md with huggingface_hub
2959a75 verified
metadata
tags:
  - sentence-transformers
  - embeddings
  - litert
  - tflite
  - edge
  - on-device
license: mit
base_model: intfloat/multilingual-e5-small
pipeline_tag: feature-extraction

multilingual-e5-small - LiteRT

This is a LiteRT (formerly TensorFlow Lite) conversion of intfloat/multilingual-e5-small for efficient on-device inference.

Model Details

Property Value
Original Model intfloat/multilingual-e5-small
Format LiteRT (.tflite)
File Size 449.0 MB
Task Multilingual Sentence Embeddings (100 languages)
Max Sequence Length 512
Output Dimension 384
Pooling Mode Mean Pooling

Performance

Benchmarked on AMD CPU (WSL2):

Metric Value
Inference Latency 91.9 ms
Throughput 10.9 tokens/sec
Cosine Similarity vs Original 1.0000 ✅

Quick Start

import numpy as np
from ai_edge_litert.interpreter import Interpreter
from transformers import AutoTokenizer

# Load model and tokenizer
interpreter = Interpreter(model_path="intfloat_multilingual-e5-small.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")

def get_embedding(text: str) -> np.ndarray:
    """Get sentence embedding for input text."""
    encoded = tokenizer(
        text,
        padding="max_length",
        max_length=512,
        truncation=True,
        return_tensors="np"
    )

    interpreter.set_tensor(input_details[0]["index"], encoded["input_ids"].astype(np.int64))
    interpreter.set_tensor(input_details[1]["index"], encoded["attention_mask"].astype(np.int64))
    interpreter.invoke()

    return interpreter.get_tensor(output_details[0]["index"])[0]

# Example
embedding = get_embedding("Hello, world!")
print(f"Embedding shape: {embedding.shape}")  # (384,)

Files

  • intfloat_multilingual-e5-small.tflite - The LiteRT model file

Conversion Details

  • Conversion Tool: ai-edge-torch
  • Conversion Date: 2026-01-12
  • Source Framework: PyTorch → LiteRT
  • Validation: Cosine similarity 1.0000 vs original

Intended Use

  • Mobile Applications: On-device semantic search, RAG systems
  • Edge Devices: IoT, embedded systems, Raspberry Pi
  • Offline Processing: Privacy-preserving inference
  • Low-latency Applications: Real-time processing

Limitations

  • Fixed sequence length (512 tokens)
  • CPU inference (GPU delegate requires setup)
  • Tokenizer loaded separately from original model
  • Float32 precision

License

This model inherits the license from the original:

Citation

@article{wang2024multilingual,
    title={Multilingual E5 Text Embeddings: A Technical Report},
    author={Wang, Liang and Yang, Nan and Huang, Xiaolong and others},
    journal={arXiv preprint arXiv:2402.05672},
    year={2024}
}

Acknowledgments


Converted by Bombek1