PIXIE-Glyph-v1.0

PIXIE-Glyph-v1.0 is a jina-clip-based text–image embedding model trained for Korean and English multimodal retrieval, developed by TelePIX Co., Ltd. PIXIE stands for TelePIX Intelligent Embedding, representing TelePIX’s high-performance embedding technology. This model is optimized for text-to-image retrieval on visually rich content—especially figures and tables in academic papers. In addition to strong retrieval quality, PIXIE-Glyph-v1.0 is designed for practical deployment, offering fast retrieval latency and dimension-flexible embeddings via Matryoshka Representation Learning (MRL).

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 (default), truncatable down to 64 with MRL
Similarity Function: Cosine Similarity
Language: Multilingual — optimized for high performance in Korean and English
Domain Specialization: Academic figure/table retrieval
License: cc-by-nc-4.0

PIXIE-Glyph-v1.0 supports Matryoshka Representation Learning (MRL), enabling users to truncate the embedding dimensionality (e.g., 1024 → 512 → 256 → 128 → 64) to reduce storage, bandwidth, and similarity-compute cost, while preserving strong retrieval performance.

Quality Benchmarks

PIXIE-Glyph-v1.0 delivers strong text-to-image retrieval performance and is particularly effective for academic figure/table search and visual document retrieval. The table below reports:

nDCG@5 (higher is better), measuring ranking quality against ground-truth relevance.
Latency (ms/query) measured on SciCap with H100 (single GPU) and batch=128. Latency here denotes the average per-query retrieval time over a 20,000-item corpus under the evaluation setup described.

Benchmark Overview and Dataset Descriptions

Model Name	# params	latency	SciCap (ko,en)	KoViDoRe (v1)	EnViDoRe (v1,v2)
telepix/PIXIE-Glyph-v1.0	0.9B	1.60	0.3957	0.1962	0.4688

nomic-ai/colnomic-embed-multimodal-3b	3B	503.00	0.4758	0.7755	0.8097
vidore/colSmol-500M	0.5B	59.45	0.2045	0.1549	0.7042
vidore/colSmol-256M	0.3B	40.25	0.1701	0.1244	0.6491
jinaai/jina-clip-v2	0.9B	1.60	0.2491	0.1311	0.4544

To help interpret the evaluation results above, we summarize the intent and characteristics of each benchmark.

SciCap (ko, en)

SciCap is constructed from the SCICAP figure–caption dataset introduced in “SciCap: Generating Captions for Scientific Figures”. For bilingual evaluation, English captions (text part) were additionally translated into Korean to form (text caption, figure image) pairs for 10k Korean + 10k English queries.

KoViDoRe (v1)

KoViDoRe is a Korean Visual Document Retrieval benchmark designed to evaluate retrieval systems on real-world Korean document images across multiple settings and corpora.

EnViDoRe (v1, v2)

ViDoRe (Visual Document Retrieval benchmark) is a benchmark suite for evaluating document retrieval systems on visually rich documents across tasks, domains, and languages.

Direct Use (Text to Image Retrieval)

import os
from PIL import Image
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer, util

model_name = 'telepix/PIXIE-Glyph-v1.0'
image_dir = 'images'
os.makedirs(image_dir, exist_ok=True)

truncate_dim = 1024
model = SentenceTransformer(
    model_name,
    trust_remote_code=True,
    truncate_dim=truncate_dim,
)

target_filenames = [
    'attention_visualizations.png',
    'scaled_dot_product_attention.png',
    'the_transformer_model_architecture.png',
    'variations_on_the_transformer.png',
    'beautiful_sunset.png',
]

loaded_images = []
valid_image_paths = []

for filename in target_filenames:
    repo_path = f"{image_dir}/{filename}"
    local_path = os.path.join(image_dir, filename)

    try:
        hf_hub_download(
            repo_id=model_name,
            filename=repo_path,
            local_dir=".",
            local_dir_use_symlinks=False
        )
        img = Image.open(local_path).convert("RGB")
        loaded_images.append(img)
        valid_image_paths.append(local_path)

    except Exception as e:
        print(f"처리 실패 ({filename}): {e}")

queries = [
    '트랜스포머 전체 아키텍처 구조 찾아줘',
    'attention이 어떻게 이뤄지는지 예시 자료 있나?',
    'how performance changes depending on the model parameters.',
]

image_embeddings = model.encode(loaded_images, normalize_embeddings=True)
query_embeddings = model.encode(queries, prompt_name='retrieval.query', normalize_embeddings=True)

results = util.cos_sim(query_embeddings, image_embeddings)

print("\n" + "="*50)
for i, query in enumerate(queries):
    print(f"Query {i+1}: '{query}'")

    k = min(5, len(valid_image_paths))

    if k == 0:
        print("  - 검색할 이미지가 없습니다.")
        continue

    scores, indices = results[i].topk(k=k)

    for score, idx in zip(scores, indices):
        print(f"  - [Score: {score:.4f}] {valid_image_paths[idx]}")
    print("-" * 50)

License

The PIXIE-Glyph-v1.0 model is licensed under CC BY-NC 4.0.

Citation

@misc{TelePIX-PIXIE-Glyph-v1.0,
  title={PIXIE-Glyph-v1.0},
  author={TelePIX AI Research Team and Bongmin Kim},
  year={2026},
  url={https://huggingface.co/telepix/PIXIE-Glyph-v1.0}
}

Contact

If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net.

Downloads last month: 10

Safetensors

Model size

0.9B params

Tensor type

BF16

Collection including telepix/PIXIE-Glyph-v1.0

PIXIE-v1.0

Collection

Information retrieval models • 3 items • Updated 1 day ago • 2

Paper for telepix/PIXIE-Glyph-v1.0

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Paper • 2412.08802 • Published Dec 11, 2024 • 5