You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

TreeLSTM - DEEPSEEK - Classification (5 classes)

Toxicity prediction model trained on the DEEPSEEK dataset.

Property Value
Model TreeLSTM
Task Classification (5 classes)
Dataset deepseek
Framework PyTorch / PyTorch Lightning

Class: TreeLSTMModel

from src.models.lstm import TreeLSTMModel

model = TreeLSTMModel(
    vocab: dict,                  # Token-to-index mapping
    hidden_dim: int = 256,        # Hidden dimension
    num_classes: int = 1,         # 1=regression, 2+=classification
    dropout: float = 0.3,
    lr: float = 0.001,
    loss_fn: str = 'auto'
)

Methods

Method Description
forward(batch) Process batch of tree graphs. Returns logits.
compute_loss(logits, targets) Computes loss based on num_classes.
load_from_checkpoint(path) Load model from checkpoint.

Tree Input Format

The model expects constituency trees converted to a PyTorch Geometric batch with:

  • x: Node features (token embeddings)
  • edge_index: Parent-child edges
  • batch: Batch assignment for each node

Usage with ToxicThesis (Recommended)

# 1. Clone ToxicThesis repository
# git clone https://github.com/simo-corbo/ToxicThesis
# cd ToxicThesis && pip install -r requirements.txt

from huggingface_hub import snapshot_download
import torch
import pickle

# 2. Download model files
model_dir = snapshot_download(
    repo_id="simocorbo/toxicthesis-deepseek-tree-classification-5",
    allow_patterns=["checkpoints/*", "*.pkl"]
)

# 3. Load vocabulary
with open(f"{model_dir}/vocab_stanza_hybrid.pkl", 'rb') as f:
    vocab = pickle.load(f)

# 4. Import and load model from ToxicThesis
from src.models.lstm import TreeLSTMModel

checkpoint = torch.load(f"{model_dir}/checkpoints/best.pt", map_location='cpu')
hparams = checkpoint.get('hyper_parameters', {})

model = TreeLSTMModel(
    vocab=vocab,
    hidden_dim=hparams.get('hidden_dim', 256),
    num_classes=hparams.get('num_classes', 5)
)
model.load_state_dict(checkpoint.get('state_dict', checkpoint), strict=False)
model.eval()

# 5. For inference, use the preprocessing pipeline to convert text to tree format
from src.preprocessing.tree_utils import text_to_tree_batch

# Parse and convert text to tree batch
tree_batch = text_to_tree_batch("Your text here", vocab)

with torch.no_grad():
    logits = model(tree_batch)
    if model.num_classes == 1:
        score = torch.sigmoid(logits).item()
        print(f"Score: {score}")
    else:
        probs = torch.softmax(logits, dim=-1)
        print(f"Probabilities: {probs}")

Note on Standalone Usage

TreeLSTM requires constituency parsing and tree-to-graph conversion. For standalone usage, you would need to implement the tree preprocessing pipeline. We recommend using ToxicThesis directly.

Score Interpretation

Output Range Meaning
probabilities List[float] Probability distribution over 5 classes.
class 0 to 4 Predicted class (argmax of probabilities).

Classes: 5 toxicity levels, where higher class index = more toxic.

Files

File Description
checkpoints/best.pt Model checkpoint (best validation loss)
hparams.yaml Hyperparameters used for training
train.csv Training metrics per epoch
val.csv Validation metrics per epoch
vocab_stanza_hybrid.pkl Vocabulary (for tree-based models)

Installation

# Clone ToxicThesis for full model implementations
git clone https://github.com/simo-corbo/ToxicThesis
cd ToxicThesis
pip install -r requirements.txt

# Or install dependencies directly
pip install torch transformers huggingface_hub fasttext-wheel stanza

Citation

@software{toxicthesis2025,
  title={ToxicThesis},
  author={Corbo, Simone},
  year={2025},
  url={https://github.com/simo-corbo/ToxicThesis}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support