lxxx0304 commited on
Commit
304730b
Β·
verified Β·
1 Parent(s): c5ae63d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +140 -3
README.md CHANGED
@@ -1,3 +1,140 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ - de
6
+ - fr
7
+ license: mit
8
+ pipeline_tag: feature-extraction
9
+ library_name: transformers
10
+ tags:
11
+ - embeddings
12
+ - lora
13
+ - sociology
14
+ - retrieval
15
+ - feature-extraction
16
+ - sentence-transformers
17
+ ---
18
+
19
+ # THETA: Domain-Specific Embedding Model for Sociology
20
+
21
+ ## Model Description
22
+
23
+ THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B).
24
+ It is designed to generate dense vector representations for texts in the sociology and social science domain.
25
+
26
+ The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
27
+
28
+ **Base Models:**
29
+ - Qwen3-Embedding-0.6B
30
+ - Qwen3-Embedding-4B
31
+
32
+ **Fine-tuning Methods:**
33
+ - Unsupervised: SimCSE (contrastive learning)
34
+ - Supervised: Label-guided contrastive learning with LoRA
35
+
36
+ ## Intended Use
37
+
38
+ This model is intended for:
39
+ - Text embedding generation
40
+ - Semantic similarity computation
41
+ - Document retrieval
42
+ - Downstream NLP tasks requiring dense representations
43
+
44
+ It is **not** designed for text generation or decision-making in high-risk scenarios.
45
+
46
+ ## Model Architecture
47
+
48
+ - Base model: Qwen3-Embedding (0.6B / 4B)
49
+ - Fine-tuning method: LoRA (Low-Rank Adaptation)
50
+ - Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B)
51
+ - Framework: Transformers (PyTorch)
52
+
53
+ ## Repository Structure
54
+
55
+ ```
56
+ CodeSoulco/THETA/
57
+ β”œβ”€β”€ embeddings/
58
+ β”‚ β”œβ”€β”€ 0.6B/
59
+ β”‚ β”‚ β”œβ”€β”€ supervised/
60
+ β”‚ β”‚ β”œβ”€β”€ unsupervised/
61
+ β”‚ β”‚ └── zero_shot/
62
+ β”‚ └── 4B/
63
+ β”‚ └── supervised/
64
+ └── lora_weights/
65
+ β”œβ”€β”€ 0.6B/
66
+ β”‚ β”œβ”€β”€ supervised/ (socialTwitter, hatespeech, mental_health)
67
+ β”‚ └── unsupervised/ (germanCoal, FCPB)
68
+ └── 4B/
69
+ └── supervised/ (socialTwitter, hatespeech)
70
+ ```
71
+
72
+ ## Training Details
73
+
74
+ - Fine-tuning method: LoRA
75
+ - Training domain: Sociology and social science texts
76
+ - Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
77
+ - Objective: Improve domain-specific semantic representation
78
+ - Hardware: Dual NVIDIA GPU
79
+
80
+ ## How to Use
81
+
82
+ ### Load LoRA Adapter
83
+
84
+ ```python
85
+ from transformers import AutoTokenizer, AutoModel
86
+ from peft import PeftModel
87
+ import torch
88
+
89
+ # Load base model
90
+ base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
91
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
92
+
93
+ # Load LoRA adapter from this repo
94
+ model = PeftModel.from_pretrained(
95
+ base_model,
96
+ "CodeSoulco/THETA",
97
+ subfolder="lora_weights/0.6B/unsupervised/germanCoal"
98
+ )
99
+
100
+ # Generate embeddings
101
+ text = "η€ΎδΌšη»“ζž„δΈŽδΈͺδ½“θ‘ŒδΈΊδΉ‹ι—΄ηš„ε…³η³»"
102
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
103
+
104
+ with torch.no_grad():
105
+ outputs = model(**inputs)
106
+
107
+ embeddings = outputs.last_hidden_state[:, 0, :] # CLS token
108
+ ```
109
+
110
+ ### Load Pre-computed Embeddings
111
+
112
+ ```python
113
+ import numpy as np
114
+
115
+ embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy")
116
+ ```
117
+
118
+ ## Limitations
119
+
120
+ - The model is fine-tuned for a specific domain and may not generalize well to unrelated topics.
121
+ - Performance depends on input text length and quality.
122
+ - The model does not generate text and should not be used for generative tasks.
123
+
124
+ ## License
125
+
126
+ This model is released under the MIT License.
127
+
128
+ ## Citation
129
+
130
+ If you use this model in your research, please cite:
131
+
132
+ ```bibtex
133
+ @misc{theta2026,
134
+ title={THETA: Domain-Specific Embedding Model for Sociology},
135
+ author={CodeSoul},
136
+ year={2026},
137
+ publisher={Hugging Face},
138
+ url={https://huggingface.co/CodeSoulco/THETA}
139
+ }
140
+ ```