lxxx0304 commited on
Commit
027ab17
Β·
verified Β·
1 Parent(s): 9203c2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -58
README.md CHANGED
@@ -20,71 +20,58 @@ tags:
20
 
21
  ## Model Description
22
 
23
- THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B).
24
- It is designed to generate dense vector representations for texts in the sociology and social science domain.
25
 
26
  The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
27
 
28
  **Base Models:**
29
- - Qwen3-Embedding-0.6B
30
- - Qwen3-Embedding-4B
31
 
32
  **Fine-tuning Methods:**
33
- - Unsupervised: SimCSE (contrastive learning)
34
- - Supervised: Label-guided contrastive learning with LoRA
35
 
36
  ## Intended Use
37
 
38
- This model is intended for:
39
- - Text embedding generation
40
- - Semantic similarity computation
41
- - Document retrieval
42
- - Downstream NLP tasks requiring dense representations
43
 
44
  It is **not** designed for text generation or decision-making in high-risk scenarios.
45
 
46
  ## Model Architecture
47
 
48
- - Base model: Qwen3-Embedding (0.6B / 4B)
49
- - Fine-tuning method: LoRA (Low-Rank Adaptation)
50
- - Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B)
51
- - Framework: Transformers (PyTorch)
 
 
52
 
53
  ## Repository Structure
54
 
55
  ```
56
  CodeSoulco/THETA/
57
- β”œβ”€β”€ embeddings/
58
- β”‚ β”œβ”€β”€ 0.6B/
59
- β”‚ β”‚ β”œβ”€β”€ zero_shot/
60
- β”‚ β”‚ β”œβ”€β”€ supervised/
61
- β”‚ β”‚ └── unsupervised/
62
- β”‚ └── 4B/
63
- β”‚ β”œβ”€β”€ zero_shot/
64
- β”‚ β”œβ”€β”€ supervised/
65
- β”‚ └── unsupervised/
66
- └── lora/
67
- β”œβ”€β”€ 0.6B/
68
- β”‚ β”œβ”€β”€ supervised/
69
- β”‚ └── unsupervised/
70
- β”œβ”€β”€ 4B/
71
- β”‚ β”œβ”€β”€ supervised/
72
- β”‚ └── unsupervised/
73
- └── logs/
74
  ```
75
 
 
 
76
  ## Training Details
77
 
78
- - Fine-tuning method: LoRA
79
- - Training domain: Sociology and social science texts
80
- - Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
81
- - Objective: Improve domain-specific semantic representation
82
- - Hardware: Dual NVIDIA GPU
83
 
84
  ## How to Use
85
 
86
- ### Load LoRA Adapter
87
-
88
  ```python
89
  from transformers import AutoTokenizer, AutoModel
90
  from peft import PeftModel
@@ -94,15 +81,15 @@ import torch
94
  base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
95
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
96
 
97
- # Load LoRA adapter from this repo
98
  model = PeftModel.from_pretrained(
99
- base_model,
100
- "CodeSoulco/THETA",
101
- subfolder="lora_weights/0.6B/unsupervised/germanCoal"
102
  )
103
 
104
  # Generate embeddings
105
- text = "η€ΎδΌšη»“ζž„δΈŽδΈͺδ½“θ‘ŒδΈΊδΉ‹ι—΄ηš„ε…³η³»"
106
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
107
 
108
  with torch.no_grad():
@@ -111,31 +98,21 @@ with torch.no_grad():
111
  embeddings = outputs.last_hidden_state[:, 0, :] # CLS token
112
  ```
113
 
114
- ### Load Pre-computed Embeddings
115
-
116
- ```python
117
- import numpy as np
118
-
119
- embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy")
120
- ```
121
-
122
  ## Limitations
123
 
124
- - The model is fine-tuned for a specific domain and may not generalize well to unrelated topics.
125
  - Performance depends on input text length and quality.
126
- - The model does not generate text and should not be used for generative tasks.
127
 
128
  ## License
129
 
130
- This model is released under the MIT License.
131
 
132
  ## Citation
133
 
134
- If you use this model in your research, please cite:
135
-
136
  ```bibtex
137
  @misc{theta2026,
138
- title={THETA: Textual Hybrid Embedding–based Topic Analysis},
139
  author={CodeSoul},
140
  year={2026},
141
  publisher={Hugging Face},
 
20
 
21
  ## Model Description
22
 
23
+ THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain.
 
24
 
25
  The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
26
 
27
  **Base Models:**
28
+ - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
29
+ - [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)
30
 
31
  **Fine-tuning Methods:**
32
+ - **Unsupervised:** SimCSE (contrastive learning)
33
+ - **Supervised:** Label-guided contrastive learning with LoRA
34
 
35
  ## Intended Use
36
 
37
+ This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations.
 
 
 
 
38
 
39
  It is **not** designed for text generation or decision-making in high-risk scenarios.
40
 
41
  ## Model Architecture
42
 
43
+ | Component | Detail |
44
+ |---|---|
45
+ | Base model | Qwen3-Embedding (0.6B / 4B) |
46
+ | Fine-tuning | LoRA (Low-Rank Adaptation) |
47
+ | Output dimension | 896 (0.6B) / 2560 (4B) |
48
+ | Framework | Transformers (PyTorch) |
49
 
50
  ## Repository Structure
51
 
52
  ```
53
  CodeSoulco/THETA/
54
+ β”œβ”€β”€ 0.6B/
55
+ β”‚ β”œβ”€β”€ supervised/
56
+ β”‚ └── unsupervised/
57
+ β”œβ”€β”€ 4B/
58
+ β”‚ β”œβ”€β”€ supervised/
59
+ β”‚ └── unsupervised/
60
+ └── logs/
 
 
 
 
 
 
 
 
 
 
61
  ```
62
 
63
+ Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/THETA-embeddings](https://huggingface.co/datasets/CodeSoulco/THETA-embeddings)
64
+
65
  ## Training Details
66
 
67
+ - **Fine-tuning method:** LoRA
68
+ - **Training domain:** Sociology and social science texts
69
+ - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
70
+ - **Objective:** Improve domain-specific semantic representation
71
+ - **Hardware:** Dual NVIDIA GPU
72
 
73
  ## How to Use
74
 
 
 
75
  ```python
76
  from transformers import AutoTokenizer, AutoModel
77
  from peft import PeftModel
 
81
  base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
82
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
83
 
84
+ # Load LoRA adapter
85
  model = PeftModel.from_pretrained(
86
+ base_model,
87
+ "CodeSoulco/THETA",
88
+ subfolder="0.6B/unsupervised/germanCoal"
89
  )
90
 
91
  # Generate embeddings
92
+ text = "Social structure and individual behavior"
93
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
94
 
95
  with torch.no_grad():
 
98
  embeddings = outputs.last_hidden_state[:, 0, :] # CLS token
99
  ```
100
 
 
 
 
 
 
 
 
 
101
  ## Limitations
102
 
103
+ - Fine-tuned for sociology/social science domain; may not generalize well to unrelated topics.
104
  - Performance depends on input text length and quality.
105
+ - Does not generate text and should not be used for generative tasks.
106
 
107
  ## License
108
 
109
+ This model is released under the **MIT License**.
110
 
111
  ## Citation
112
 
 
 
113
  ```bibtex
114
  @misc{theta2026,
115
+ title={THETA: Textual Hybrid Embedding--based Topic Analysis},
116
  author={CodeSoul},
117
  year={2026},
118
  publisher={Hugging Face},