--- tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - lifecycle-assessment - climate-change - carbon-emission - sustainability pipeline_tag: sentence-similarity library_name: sentence-transformers base_model: Qwen/Qwen3-Embedding-0.6B license: mit --- # lca-qwen3-embedding Domain embedding model for lifecycle assessment (LCA) retrieval. It encodes sentences and short passages into 1024‑d L2-normalized embeddings for semantic search, similarity scoring, and clustering. ## Background Generic embedding models work well in open domains, but professional LCA retrieval often involves long, structured records (e.g., geography/technology/time fields) and domain-specific terminology. This model is trained to better align embeddings with LCA retrieval queries and documents. ## Results (our evaluation setup) On an internal evaluation derived from TianGong LCA records (converted from the Tidas structured format into retrieval-friendly text), this model improved over the base `Qwen3-Embedding-0.6B` on both ranking quality and tail coverage: - vs base `Qwen3-Embedding-0.6B`: **NDCG@10 +31.2%**, **Recall@10 +25.7%**, **MRR@10 +33.5%**, **Recall@100 +11.5%** Evaluation scale (this experiment): - Train: 17,037 query-doc pairs - Eval: 1,893 queries / 3,786 corpus docs / 1,893 qrels ### Model comparisons Key metrics (@10): | Model | NDCG@10 | Recall@10 | MRR@10 | MAP@10 | | --- | ---: | ---: | ---: | ---: | | `Qwen3-Embedding-0.6B` (base) | 0.5808 | 0.7200 | 0.5367 | 0.5367 | | `lca-qwen3-embedding` (this model) | **0.7623** | **0.9049** | **0.7163** | **0.7163** | | `codestral-embed-2505` | 0.6628 | 0.8045 | 0.6180 | 0.6180 | | `qwen3-embedding-8b` | 0.5905 | 0.7369 | 0.5442 | 0.5442 | | `qwen3-embedding-4b` | 0.5836 | 0.7290 | 0.5377 | 0.5377 | | `bge-m3` | 0.5839 | 0.7264 | 0.5388 | 0.5388 | Tail coverage (@100): | Model | NDCG@100 | Recall@100 | | --- | ---: | ---: | | `Qwen3-Embedding-0.6B` (base) | 0.6171 | 0.8922 | | `lca-qwen3-embedding` (this model) | **0.7826** | **0.9947** | | `codestral-embed-2505` | 0.6872 | 0.9171 | | `qwen3-embedding-8b` | 0.6258 | 0.9033 | | `qwen3-embedding-4b` | 0.6164 | 0.8822 | | `bge-m3` | 0.6156 | 0.8743 | Protocol note: embeddings are L2-normalized; retrieval uses inner product (equivalent to cosine similarity) with top-100 candidates. ## Model details (from the exported config) - **Backbone**: Qwen3 (`model_type=qwen3`; config architecture `Qwen3ForCausalLM`), `hidden_size=1024`, `num_hidden_layers=28` - **Max sequence length**: `1024` - **Embedding dimension**: `1024` - **Pooling**: last-token pooling (`pooling_mode_lasttoken=true`, `include_prompt=true`) - **Normalization**: L2 normalize - **Similarity**: cosine - **Prompts**: a `query` prompt is defined; the `document` prompt is empty Module stack: ``` Transformer -> Pooling(last_token, include_prompt=true) -> Normalize ``` ## Usage (SentenceTransformers) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("BIaoo/lca-qwen3-embedding") # replace with your HF repo id if forked/renamed ``` Retrieval example (encode queries and documents separately; apply the built-in query prompt): ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("BIaoo/lca-qwen3-embedding") # replace with your HF repo id if forked/renamed queries = ["wood residue gasification heat recovery"] docs = ["Report describing small-scale biomass CHP units used for district heating."] q = model.encode(queries, prompt_name="query", normalize_embeddings=True) d = model.encode(docs, normalize_embeddings=True) scores = q @ d.T # cosine similarity (because normalized) print(scores) ``` Notes: - Use `prompt_name="query"` to apply the query instruction prefix from `config_sentence_transformers.json`. - The document-side prompt is empty; encoding documents with `encode(docs, ...)` is typically sufficient. ## Intended use - Semantic search and reranking for LCA process/flow descriptions and metadata-rich technical text - Similarity scoring for deduplication / clustering of LCA-related passages ## Limitations - Trained and evaluated primarily on English technical/LCA text; performance may degrade in other languages or domains. - Evaluation numbers are from a specific internal setup; validate on your own data before production use. ## Files - `config.json`: Qwen3 model config - `config_sentence_transformers.json`, `modules.json`, `sentence_bert_config.json`: SentenceTransformers configs (prompts, modules, max length) - `model.safetensors`: weights - `tokenizer.*`, `vocab.json`, `merges.txt`: tokenizer assets - `1_Pooling/`, `2_Normalize/`: pooling / normalization modules