File size: 3,651 Bytes
304730b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
language: 
  - zh
  - en
  - de
  - fr
license: mit
pipeline_tag: feature-extraction
library_name: transformers
tags:
  - embeddings
  - lora
  - sociology
  - retrieval
  - feature-extraction
  - sentence-transformers
---

# THETA: Domain-Specific Embedding Model for Sociology

## Model Description

THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B).  
It is designed to generate dense vector representations for texts in the sociology and social science domain.

The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).

**Base Models:**
- Qwen3-Embedding-0.6B
- Qwen3-Embedding-4B

**Fine-tuning Methods:**
- Unsupervised: SimCSE (contrastive learning)
- Supervised: Label-guided contrastive learning with LoRA

## Intended Use

This model is intended for:
- Text embedding generation
- Semantic similarity computation
- Document retrieval
- Downstream NLP tasks requiring dense representations

It is **not** designed for text generation or decision-making in high-risk scenarios.

## Model Architecture

- Base model: Qwen3-Embedding (0.6B / 4B)
- Fine-tuning method: LoRA (Low-Rank Adaptation)
- Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B)
- Framework: Transformers (PyTorch)

## Repository Structure

```
CodeSoulco/THETA/
β”œβ”€β”€ embeddings/
β”‚   β”œβ”€β”€ 0.6B/
β”‚   β”‚   β”œβ”€β”€ supervised/
β”‚   β”‚   β”œβ”€β”€ unsupervised/
β”‚   β”‚   └── zero_shot/
β”‚   └── 4B/
β”‚       └── supervised/
└── lora_weights/
    β”œβ”€β”€ 0.6B/
    β”‚   β”œβ”€β”€ supervised/ (socialTwitter, hatespeech, mental_health)
    β”‚   └── unsupervised/ (germanCoal, FCPB)
    └── 4B/
        └── supervised/ (socialTwitter, hatespeech)
```

## Training Details

- Fine-tuning method: LoRA
- Training domain: Sociology and social science texts
- Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
- Objective: Improve domain-specific semantic representation
- Hardware: Dual NVIDIA GPU

## How to Use

### Load LoRA Adapter

```python
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch

# Load base model
base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

# Load LoRA adapter from this repo
model = PeftModel.from_pretrained(
    base_model, 
    "CodeSoulco/THETA", 
    subfolder="lora_weights/0.6B/unsupervised/germanCoal"
)

# Generate embeddings
text = "η€ΎδΌšη»“ζž„δΈŽδΈͺδ½“θ‘ŒδΈΊδΉ‹ι—΄ηš„ε…³η³»"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
```

### Load Pre-computed Embeddings

```python
import numpy as np

embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy")
```

## Limitations

- The model is fine-tuned for a specific domain and may not generalize well to unrelated topics.
- Performance depends on input text length and quality.
- The model does not generate text and should not be used for generative tasks.

## License

This model is released under the MIT License.

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{theta2026,
  title={THETA: Domain-Specific Embedding Model for Sociology},
  author={CodeSoul},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/CodeSoulco/THETA}
}
```