CodeSoulco
/

THETA

@@ -20,71 +20,58 @@ tags:
 ## Model Description
-THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B).
-It is designed to generate dense vector representations for texts in the sociology and social science domain.
 The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
 **Base Models:**
-- Qwen3-Embedding-0.6B
-- Qwen3-Embedding-4B
 **Fine-tuning Methods:**
-- Unsupervised: SimCSE (contrastive learning)
-- Supervised: Label-guided contrastive learning with LoRA
 ## Intended Use
-This model is intended for:
-- Text embedding generation
-- Semantic similarity computation
-- Document retrieval
-- Downstream NLP tasks requiring dense representations
 It is **not** designed for text generation or decision-making in high-risk scenarios.
 ## Model Architecture
-- Base model: Qwen3-Embedding (0.6B / 4B)
-- Fine-tuning method: LoRA (Low-Rank Adaptation)
-- Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B)
-- Framework: Transformers (PyTorch)
 ## Repository Structure
 ```
 CodeSoulco/THETA/
-├── embeddings/
-│   ├── 0.6B/
-│   │   ├── zero_shot/
-│   │   ├── supervised/
-│   │   └── unsupervised/
-│   └── 4B/
-│       ├── zero_shot/
-│       ├── supervised/
-│       └── unsupervised/
-└── lora/
-    ├── 0.6B/
-    │   ├── supervised/
-    │   └── unsupervised/
-    ├── 4B/
-    │   ├── supervised/
-    │   └── unsupervised/
-    └── logs/
 ```
 ## Training Details
-- Fine-tuning method: LoRA
-- Training domain: Sociology and social science texts
-- Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
-- Objective: Improve domain-specific semantic representation
-- Hardware: Dual NVIDIA GPU
 ## How to Use
-### Load LoRA Adapter
 ```python
 from transformers import AutoTokenizer, AutoModel
 from peft import PeftModel
@@ -94,15 +81,15 @@ import torch
 base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
-# Load LoRA adapter from this repo
 model = PeftModel.from_pretrained(
-    base_model,
-    "CodeSoulco/THETA",
-    subfolder="lora_weights/0.6B/unsupervised/germanCoal"
 )
 # Generate embeddings
-text = "社会结构与个体行为之间的关系"
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 with torch.no_grad():
@@ -111,31 +98,21 @@ with torch.no_grad():
 embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
 ```
-### Load Pre-computed Embeddings
-```python
-import numpy as np
-embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy")
-```
 ## Limitations
-- The model is fine-tuned for a specific domain and may not generalize well to unrelated topics.
 - Performance depends on input text length and quality.
-- The model does not generate text and should not be used for generative tasks.
 ## License
-This model is released under the MIT License.
 ## Citation
-If you use this model in your research, please cite:
 ```bibtex
 @misc{theta2026,
-  title={THETA: Textual Hybrid Embedding–based Topic Analysis},
   author={CodeSoul},
   year={2026},
   publisher={Hugging Face},

 ## Model Description
+THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain.
 The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
 **Base Models:**
+- [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
+- [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)
 **Fine-tuning Methods:**
+- **Unsupervised:** SimCSE (contrastive learning)
+- **Supervised:** Label-guided contrastive learning with LoRA
 ## Intended Use
+This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations.
 It is **not** designed for text generation or decision-making in high-risk scenarios.
 ## Model Architecture
+| Component | Detail |
+|---|---|
+| Base model | Qwen3-Embedding (0.6B / 4B) |
+| Fine-tuning | LoRA (Low-Rank Adaptation) |
+| Output dimension | 896 (0.6B) / 2560 (4B) |
+| Framework | Transformers (PyTorch) |
 ## Repository Structure
 ```
 CodeSoulco/THETA/
+├── 0.6B/
+│   ├── supervised/
+│   └── unsupervised/
+├── 4B/
+│   ├── supervised/
+│   └── unsupervised/
+└── logs/
 ```
+Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/THETA-embeddings](https://huggingface.co/datasets/CodeSoulco/THETA-embeddings)
 ## Training Details
+- **Fine-tuning method:** LoRA
+- **Training domain:** Sociology and social science texts
+- **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
+- **Objective:** Improve domain-specific semantic representation
+- **Hardware:** Dual NVIDIA GPU
 ## How to Use
 ```python
 from transformers import AutoTokenizer, AutoModel
 from peft import PeftModel
 base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
+# Load LoRA adapter
 model = PeftModel.from_pretrained(
+    base_model,
+    "CodeSoulco/THETA",
+    subfolder="0.6B/unsupervised/germanCoal"
 )
 # Generate embeddings
+text = "Social structure and individual behavior"
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 with torch.no_grad():
 embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
 ```
 ## Limitations
+- Fine-tuned for sociology/social science domain; may not generalize well to unrelated topics.
 - Performance depends on input text length and quality.
+- Does not generate text and should not be used for generative tasks.
 ## License
+This model is released under the **MIT License**.
 ## Citation
 ```bibtex
 @misc{theta2026,
+  title={THETA: Textual Hybrid Embedding--based Topic Analysis},
   author={CodeSoul},
   year={2026},
   publisher={Hugging Face},