| --- |
| language: |
| - zh |
| - en |
| - de |
| - fr |
| license: mit |
| pipeline_tag: feature-extraction |
| library_name: transformers |
| tags: |
| - embeddings |
| - lora |
| - sociology |
| - retrieval |
| - feature-extraction |
| - sentence-transformers |
| --- |
| |
| # THETA: Textual Hybrid Embeddingβbased Topic Analysis |
|
|
| ## Model Description |
|
|
| THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). |
| It is designed to generate dense vector representations for texts in the sociology and social science domain. |
|
|
| The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG). |
|
|
| **Base Models:** |
| - Qwen3-Embedding-0.6B |
| - Qwen3-Embedding-4B |
|
|
| **Fine-tuning Methods:** |
| - Unsupervised: SimCSE (contrastive learning) |
| - Supervised: Label-guided contrastive learning with LoRA |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
| - Text embedding generation |
| - Semantic similarity computation |
| - Document retrieval |
| - Downstream NLP tasks requiring dense representations |
|
|
| It is **not** designed for text generation or decision-making in high-risk scenarios. |
|
|
| ## Model Architecture |
|
|
| - Base model: Qwen3-Embedding (0.6B / 4B) |
| - Fine-tuning method: LoRA (Low-Rank Adaptation) |
| - Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B) |
| - Framework: Transformers (PyTorch) |
|
|
| ## Repository Structure |
|
|
| ``` |
| CodeSoulco/THETA/ |
| βββ embeddings/ |
| β βββ 0.6B/ |
| β β βββ zero_shot/ |
| β β βββ supervised/ |
| β β βββ unsupervised/ |
| β βββ 4B/ |
| β βββ zero_shot/ |
| β βββ supervised/ |
| β βββ unsupervised/ |
| βββ lora/ |
| βββ 0.6B/ |
| β βββ supervised/ |
| β βββ unsupervised/ |
| βββ 4B/ |
| β βββ supervised/ |
| β βββ unsupervised/ |
| βββ logs/ |
| ``` |
|
|
| ## Training Details |
|
|
| - Fine-tuning method: LoRA |
| - Training domain: Sociology and social science texts |
| - Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health |
| - Objective: Improve domain-specific semantic representation |
| - Hardware: Dual NVIDIA GPU |
| |
| ## How to Use |
| |
| ### Load LoRA Adapter |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| from peft import PeftModel |
| import torch |
| |
| # Load base model |
| base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) |
|
|
| # Load LoRA adapter from this repo |
| model = PeftModel.from_pretrained( |
| base_model, |
| "CodeSoulco/THETA", |
| subfolder="lora_weights/0.6B/unsupervised/germanCoal" |
| ) |
| |
| # Generate embeddings |
| text = "η€ΎδΌη»ζδΈδΈͺδ½θ‘δΈΊδΉι΄ηε
³η³»" |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
| with torch.no_grad(): |
| outputs = model(**inputs) |
| |
| embeddings = outputs.last_hidden_state[:, 0, :] # CLS token |
| ``` |
| |
| ### Load Pre-computed Embeddings |
| |
| ```python |
| import numpy as np |
| |
| embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy") |
| ``` |
| |
| ## Limitations |
| |
| - The model is fine-tuned for a specific domain and may not generalize well to unrelated topics. |
| - Performance depends on input text length and quality. |
| - The model does not generate text and should not be used for generative tasks. |
| |
| ## License |
| |
| This model is released under the MIT License. |
| |
| ## Citation |
| |
| If you use this model in your research, please cite: |
| |
| ```bibtex |
| @misc{theta2026, |
| title={THETA: Textual Hybrid Embeddingβbased Topic Analysis}, |
| author={CodeSoul}, |
| year={2026}, |
| publisher={Hugging Face}, |
| url={https://huggingface.co/CodeSoulco/THETA} |
| } |
| ``` |
| |