CodeSoulco
/

THETA

+---
+language:
+  - zh
+  - en
+  - de
+  - fr
+license: mit
+pipeline_tag: feature-extraction
+library_name: transformers
+tags:
+  - embeddings
+  - lora
+  - sociology
+  - retrieval
+  - feature-extraction
+  - sentence-transformers
+---
+# THETA: Domain-Specific Embedding Model for Sociology
+## Model Description
+THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B).
+It is designed to generate dense vector representations for texts in the sociology and social science domain.
+The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
+**Base Models:**
+- Qwen3-Embedding-0.6B
+- Qwen3-Embedding-4B
+**Fine-tuning Methods:**
+- Unsupervised: SimCSE (contrastive learning)
+- Supervised: Label-guided contrastive learning with LoRA
+## Intended Use
+This model is intended for:
+- Text embedding generation
+- Semantic similarity computation
+- Document retrieval
+- Downstream NLP tasks requiring dense representations
+It is **not** designed for text generation or decision-making in high-risk scenarios.
+## Model Architecture
+- Base model: Qwen3-Embedding (0.6B / 4B)
+- Fine-tuning method: LoRA (Low-Rank Adaptation)
+- Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B)
+- Framework: Transformers (PyTorch)
+## Repository Structure
+```
+CodeSoulco/THETA/
+├── embeddings/
+│   ├── 0.6B/
+│   │   ├── supervised/
+│   │   ├── unsupervised/
+│   │   └── zero_shot/
+│   └── 4B/
+│       └── supervised/
+└── lora_weights/
+    ├── 0.6B/
+    │   ├── supervised/ (socialTwitter, hatespeech, mental_health)
+    │   └── unsupervised/ (germanCoal, FCPB)
+    └── 4B/
+        └── supervised/ (socialTwitter, hatespeech)
+```
+## Training Details
+- Fine-tuning method: LoRA
+- Training domain: Sociology and social science texts
+- Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
+- Objective: Improve domain-specific semantic representation
+- Hardware: Dual NVIDIA GPU
+## How to Use
+### Load LoRA Adapter
+```python
+from transformers import AutoTokenizer, AutoModel
+from peft import PeftModel
+import torch
+# Load base model
+base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
+# Load LoRA adapter from this repo
+model = PeftModel.from_pretrained(
+    base_model,
+    "CodeSoulco/THETA",
+    subfolder="lora_weights/0.6B/unsupervised/germanCoal"
+)
+# Generate embeddings
+text = "社会结构与个体行为之间的关系"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+with torch.no_grad():
+    outputs = model(**inputs)
+embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
+```
+### Load Pre-computed Embeddings
+```python
+import numpy as np
+embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy")
+```
+## Limitations
+- The model is fine-tuned for a specific domain and may not generalize well to unrelated topics.
+- Performance depends on input text length and quality.
+- The model does not generate text and should not be used for generative tasks.
+## License
+This model is released under the MIT License.
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{theta2026,
+  title={THETA: Domain-Specific Embedding Model for Sociology},
+  author={CodeSoul},
+  year={2026},
+  publisher={Hugging Face},
+  url={https://huggingface.co/CodeSoulco/THETA}
+}
+```