THETA / README.md

lxxx0304

Upload README.md with huggingface_hub

304730b verified 2 days ago

preview code

raw

history blame contribute delete

3.65 kB

metadata

language:
  - zh
  - en
  - de
  - fr
license: mit
pipeline_tag: feature-extraction
library_name: transformers
tags:
  - embeddings
  - lora
  - sociology
  - retrieval
  - feature-extraction
  - sentence-transformers

THETA: Domain-Specific Embedding Model for Sociology

Model Description

THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B).
It is designed to generate dense vector representations for texts in the sociology and social science domain.

The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).

Base Models:

Qwen3-Embedding-0.6B
Qwen3-Embedding-4B

Fine-tuning Methods:

Unsupervised: SimCSE (contrastive learning)
Supervised: Label-guided contrastive learning with LoRA

Intended Use

This model is intended for:

Text embedding generation
Semantic similarity computation
Document retrieval
Downstream NLP tasks requiring dense representations

It is not designed for text generation or decision-making in high-risk scenarios.

Model Architecture

Base model: Qwen3-Embedding (0.6B / 4B)
Fine-tuning method: LoRA (Low-Rank Adaptation)
Output: Fixed-length dense embeddings (896-dim for 0.6B, 2560-dim for 4B)
Framework: Transformers (PyTorch)

Repository Structure

CodeSoulco/THETA/
├── embeddings/
│   ├── 0.6B/
│   │   ├── supervised/
│   │   ├── unsupervised/
│   │   └── zero_shot/
│   └── 4B/
│       └── supervised/
└── lora_weights/
    ├── 0.6B/
    │   ├── supervised/ (socialTwitter, hatespeech, mental_health)
    │   └── unsupervised/ (germanCoal, FCPB)
    └── 4B/
        └── supervised/ (socialTwitter, hatespeech)

Training Details

Fine-tuning method: LoRA
Training domain: Sociology and social science texts
Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
Objective: Improve domain-specific semantic representation
Hardware: Dual NVIDIA GPU

How to Use

Load LoRA Adapter

from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch

# Load base model
base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

# Load LoRA adapter from this repo
model = PeftModel.from_pretrained(
    base_model, 
    "CodeSoulco/THETA", 
    subfolder="lora_weights/0.6B/unsupervised/germanCoal"
)

# Generate embeddings
text = "社会结构与个体行为之间的关系"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token

Load Pre-computed Embeddings

import numpy as np

embeddings = np.load("embeddings/0.6B/zero_shot/germanCoal_zero_shot_embeddings.npy")

Limitations

The model is fine-tuned for a specific domain and may not generalize well to unrelated topics.
Performance depends on input text length and quality.
The model does not generate text and should not be used for generative tasks.

License

This model is released under the MIT License.

Citation

If you use this model in your research, please cite:

@misc{theta2026,
  title={THETA: Domain-Specific Embedding Model for Sociology},
  author={CodeSoul},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/CodeSoulco/THETA}
}