trustyai
/

tci_plus

+---
+license: apache-2.0
+datasets:
+- lmsys/toxic-chat
+metrics:
+- perplexity
+---
+# Model Card for Model ID
+This model is a `facebook/bart-large` fine-tuned on non-toxic inputs from `lmsys/toxic-chat` dataset.
+## Model Details
+This model is not intended to be used for plain inference despite it is unlikely to generate toxic content.
+It is intended to be used instead as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over non-toxic data.
+Its name tci_plus refers to the _G+_ model in [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts](https://aclanthology.org/2023.acl-short.21.pdf).
+It can be used within `TrustyAI`'s `TMaRCo` tool for detoxifying text, see https://github.com/trustyai-explainability/trustyai-detoxify/.
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [tteofili]
+- **Shared by:** [tteofili]
+- **License:** [AL2.0]
+- **Finetuned from model:** ["facebook/bart-large"]
+## Uses
+This model is intended to be used as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.
+## Bias, Risks, and Limitations
+This model is fine-tuned over non-toxic inputs from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset and it is very likely to produce toxic content. For this reason this model should only be used in combination with other models for the sake of detecting / fixing toxic content.
+## How to Get Started with the Model
+Use the code below to start using the model for text detoxification.
+```python
+from trustyai.detoxify import TMaRCo
+tmarco = TMaRCo(expert_weights=[-1, 3])
+tmarco.load_models(["tteofili/tci_minus", "tteofili/tci_plus"])
+tmarco.rephrase(["white men can't jump"])
+```
+## Training Details
+This model has been trained on non-toxic inputs from the `lmsys/toxic-chat` dataset.
+### Training Data
+Training data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.
+### Training Procedure
+This model has been fine tuned with the following code:
+```python
+from trustyai.detoxify import TMaRCo
+dataset_name = 'lmsys/toxic-chat'
+data_dir = ''
+perc = 100
+td_columns = ['model_output', 'user_input', 'human_annotation', 'conv_id', 'jailbreaking', 'openai_moderation',
+              'toxicity']
+target_feature = 'toxicity'
+content_feature = 'user_input'
+model_prefix = 'toxic_chat_input_'
+tmarco.train_models(perc=perc, dataset_name=dataset_name, expert_feature=target_feature, model_prefix=model_prefix,
+                    data_dir=data_dir, content_feature=content_feature, td_columns=td_columns)
+```
+#### Training Hyperparameters
+This model has been trained with the following hyperparams:
+```python
+training_args = TrainingArguments(
+    evaluation_strategy="epoch",
+    learning_rate=2e-5,
+    weight_decay=0.01
+)
+```
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+Test data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.
+#### Metrics
+The model was evaluated using perplexity metric.
+### Results
+Perplexity: 1.08