| | --- |
| | license: mit |
| | language: |
| | - en |
| | base_model: distilbert/distilbert-base-uncased |
| | library_name: transformers |
| | tags: |
| | - distilbert |
| | - bert |
| | - text-classification |
| | - commission-detection |
| | - social-media |
| | pipeline_tag: text-classification |
| | datasets: |
| | - custom |
| | model-index: |
| | - name: distilbert-commissions |
| | results: |
| | - task: |
| | type: text-classification |
| | name: Text Classification |
| | dataset: |
| | type: custom |
| | name: Scraped Social Media Profiles (Bluesky & Twitter) |
| | metrics: |
| | - name: Accuracy |
| | type: accuracy |
| | value: 0.9506 |
| | verified: false |
| | - name: Precision |
| | type: precision |
| | value: 0.9513 |
| | verified: false |
| | - name: Recall |
| | type: recall |
| | value: 0.9506 |
| | verified: false |
| | - name: F1 Score |
| | type: f1 |
| | value: 0.9508 |
| | verified: false |
| | --- |
| | |
| | # DistilBERT Commission Detection Model |
| |
|
| | ## Model Description |
| |
|
| | This is a fine-tuned DistilBERT model for detecting commission-related content in social media profiles and posts. The model classifies text to identify whether an artist's profile/bio/post content shows they are open or closed for commissions, or if the text is unclear. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Architecture |
| |
|
| | - **Base Model**: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) |
| | - **Model Type**: Text Classification |
| | - **Language**: English |
| | - **License**: MIT |
| |
|
| | ### Training Data |
| |
|
| | - **Sources**: Manually scraped profile names, bios, and posts from Bluesky and Twitter by a crowd of furries uploading classifications via a custom extension built specifically to make this dataset |
| | - **Dataset**: Custom dataset of ~1000 rows and user classifications with an equal amount of artificial data to boost pattern recognition |
| |
|
| | ## Performance |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | Accuracy | 95.06% | |
| | | Precision | 95.13% | |
| | | Recall | 95.06% | |
| | | F1 Score | 95.08% | |
| |
|
| | *Note: These metrics are not independently verified.* |
| |
|
| | ## Usage |
| |
|
| | I recommend a high temperature when inferencing to lower the model's confidence. I use between 1.5 - 3.0. |
| |
|
| | ```python |
| | |
| | # Example inference # |
| | |
| | from transformers import DistilBertForSequenceClassification, DistilBertTokenizer |
| | import torch |
| | |
| | # Load model and tokenizer # |
| | model_name = 'zohfur/distilbert-commissions' |
| | tokenizer = DistilBertTokenizer.from_pretrained(model_name) |
| | model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3) |
| | |
| | # Example usage # |
| | example_sentences = [ |
| | "Commissions are currently closed.", |
| | "Check my bio for commission status.", |
| | "C*mms 0pen on p-site", |
| | "DM for comms", |
| | "Taking art requests, dm me", |
| | "comm completed for personmcperson, thank you <3", |
| | "open for trades", |
| | "Comms are not open", |
| | "Comms form will be open soon, please check back later", |
| | "~ Furry artist - 25 y.o - he/him - c*mms 0pen: 2/5 - bots dni ~" |
| | ] |
| | |
| | # Map label integers back to strings # |
| | label_map = {0: 'open', 1: 'closed', 2: 'unclear'} |
| | |
| | def predict_with_temperature(model, tokenizer, sentences, temperature=1.5): |
| | # Prepare input # |
| | encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | encoded_input = {key: value.to(device) for key, value in encoded_input.items()} |
| | model.to(device) |
| | model.eval() |
| | |
| | # Make predictions with temperature scaling # |
| | with torch.no_grad(): |
| | outputs = model(**encoded_input) |
| | logits = outputs['logits'] / temperature # Apply temperature scaling # |
| | probabilities = torch.softmax(logits, dim=1) |
| | |
| | # Extract predictions and confidence scores # |
| | predicted_class_indices = torch.argmax(probabilities, dim=1) |
| | confidences = torch.max(probabilities, dim=1).values |
| | |
| | # Convert to CPU and prepare results # |
| | predictions = { |
| | 'sentences': sentences, |
| | 'labels': [label_map[idx.item()] for idx in predicted_class_indices], |
| | 'confidences': [score.item() for score in confidences] |
| | } |
| | |
| | return predictions |
| | |
| | def print_predictions(predictions): |
| | """Print formatted predictions with confidence scores.""" |
| | print("\nClassification Results:") |
| | print("=" * 50) |
| | for i, (sentence, label, confidence) in enumerate(zip( |
| | predictions['sentences'], |
| | predictions['labels'], |
| | predictions['confidences'] |
| | ), 1): |
| | print(f"\n{i}. Sentence: '{sentence}'") |
| | print(f" Predicted Label: {label}") |
| | print(f" Confidence Score: {confidence:.4f}") |
| | |
| | # Make predictions with temperature scaling # |
| | predictions = predict_with_temperature(model, tokenizer, example_sentences, temperature=1.5) |
| | |
| | # Print results # |
| | print_predictions(predictions) |
| | ``` |
| |
|
| | ## Limitations and Biases |
| |
|
| | ### Limitations |
| |
|
| | - **Language**: Only trained on English text |
| | - **False Positives**: Requires a high temperature to avoid false positives (particularly with the words "open" and "closed") |
| | - **Platform Bias**: Trained on Bsky and Twitter/X data, might not perform as well on other platforms like FurAffinity or Instagra |
| |
|
| | ## Training Details |
| |
|
| | ### Training Procedure |
| |
|
| | - **Base Model**: DistilBERT base uncased |
| | - **Fine-tuning**: Finetuned using Huggingface's Trainer, evaluated using Trainer and sklearn.metrics |
| | - **Optimization**: Wandb hyperparameter sweep using bayers algorithm to reach highest f1 score |
| |
|
| | ### Data Preprocessing |
| |
|
| | - Classifications uploaded voluntarily by crowdsourcing extension users |
| | - Problematic unicode characters cleaned from dataset |
| | - Label encoding for classification |
| | - Class weights computed to adjust weights inversely proportional to class frequencies |
| |
|
| | ## Model Card Authors |
| |
|
| | All credit to original author Zohfur. Base model attributed to distilbert. |
| |
|
| | ## Model Card Contact |
| |
|
| | For questions or concerns about this model, please contact: [ben@zohfur.dog] |