|
|
--- |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
- de |
|
|
- fr |
|
|
license: mit |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: transformers |
|
|
tags: |
|
|
- embeddings |
|
|
- lora |
|
|
- sociology |
|
|
- retrieval |
|
|
- feature-extraction |
|
|
- sentence-transformers |
|
|
--- |
|
|
|
|
|
# THETA: Domain-Specific Embedding Model for Sociology |
|
|
|
|
|
## Model Description |
|
|
|
|
|
THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). |
|
|
It is designed to generate dense vector representations for texts in the sociology and social science domain. |
|
|
|
|
|
The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG). |
|
|
|
|
|
**Base Models:** |
|
|
- Qwen3-Embedding-0.6B |
|
|
- Qwen3-Embedding-4B |
|
|
|
|
|
**Fine-tuning Methods:** |
|
|
- Unsupervised: SimCSE (contrastive learning) |
|
|
- Supervised: Label-guided contrastive learning with LoRA |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for: |
|
|
- Text embedding generation |
|
|
- Semantic similarity computation |
|
|
- Document retrieval |
|
|
- Downstream NLP tasks requiring dense representations |
|
|
|
|
|
It is **not** designed for text generation or decision-making in high-risk scenarios. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- Base model: Qwen3-Embedding (0.6B / 4B) |
|
|
- Fine-tuning method: LoRA (Low-Rank Adaptation) |
|
|
- Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B) |
|
|
- Framework: Transformers (PyTorch) |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
``` |
|
|
CodeSoulco/THETA/ |
|
|
βββ embeddings/ |
|
|
β βββ 0.6B/ |
|
|
β β βββ supervised/ |
|
|
β β βββ unsupervised/ |
|
|
β β βββ zero_shot/ |
|
|
β βββ 4B/ |
|
|
β βββ supervised/ |
|
|
βββ lora_weights/ |
|
|
βββ 0.6B/ |
|
|
β βββ supervised/ (socialTwitter, hatespeech, mental_health) |
|
|
β βββ unsupervised/ (germanCoal, FCPB) |
|
|
βββ 4B/ |
|
|
βββ supervised/ (socialTwitter, hatespeech) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- Fine-tuning method: LoRA |
|
|
- Training domain: Sociology and social science texts |
|
|
- Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health |
|
|
- Objective: Improve domain-specific semantic representation |
|
|
- Hardware: Dual NVIDIA GPU |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Load LoRA Adapter |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
from peft import PeftModel |
|
|
import torch |
|
|
|
|
|
# Load base model |
|
|
base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) |
|
|
|
|
|
# Load LoRA adapter from this repo |
|
|
model = PeftModel.from_pretrained( |
|
|
base_model, |
|
|
"CodeSoulco/THETA", |
|
|
subfolder="lora_weights/0.6B/unsupervised/germanCoal" |
|
|
) |
|
|
|
|
|
# Generate embeddings |
|
|
text = "η€ΎδΌη»ζδΈδΈͺδ½θ‘δΈΊδΉι΄ηε
³η³»" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
embeddings = outputs.last_hidden_state[:, 0, :] # CLS token |
|
|
``` |
|
|
|
|
|
### Load Pre-computed Embeddings |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
|
|
|
embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy") |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is fine-tuned for a specific domain and may not generalize well to unrelated topics. |
|
|
- Performance depends on input text length and quality. |
|
|
- The model does not generate text and should not be used for generative tasks. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{theta2026, |
|
|
title={THETA: Domain-Specific Embedding Model for Sociology}, |
|
|
author={CodeSoul}, |
|
|
year={2026}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/CodeSoulco/THETA} |
|
|
} |
|
|
``` |
|
|
|