Synthetic Agent Risk Classifier (TF‑IDF + Logistic Regression)

This repository contains a simple text classification model used in the
Ethical AI Control Panel course project.

The model predicts a coarse ethical risk level for short English descriptions of AI agents or automation workflows, using three classes:

  • low_risk
  • medium_risk
  • high_risk

It is implemented as a scikit‑learn pipeline consisting of:

  • TfidfVectorizer with 1–2 gram features, and
  • LogisticRegression for multiclass classification.

The model is intended as a proof‑of‑concept for teaching and prototyping only, not as a production‑ready safety classifier.

Intended Use and Limitations

Intended Use

  • Given a short description of an AI agent or workflow (e.g., “an autonomous agent that logs into my bank account and pays bills”), the model outputs one of three coarse risk labels.
  • It is used as one component in a larger “Ethical AI Control Panel” interface that also includes rule‑based checks and human‑in‑the‑loop recommendations.

Example high‑level use in Python:

import joblib

clf = joblib.load("risk_classifier_tfidf_logreg.joblib")
prompt = "Build an AI agent that drafts emails and asks me to approve before sending them."
label = clf.predict([prompt])[0]
probs = clf.predict_proba([prompt])[0]

print(label)  # e.g., 'low_risk'
print(dict(zip(clf.classes_, probs)))

Limitations

The model is trained only on a small synthetic dataset (≈600 samples) generated from templates.

Because the templates are highly separable across classes, the model achieves very high validation accuracy (~100%), which should not be interpreted as real‑world performance.

It is not robust to out‑of‑distribution inputs and must not be used for real compliance, safety, or moderation tasks.

Training Data

The model is trained on the following dataset:

Dataset: EricCRX/ethical-ai-control-panel-synthetic-risk

Dataset details:

~600 synthetic prompts describing AI agents and automations

3 labels: low_risk, medium_risk, high_risk

All data is generated from hand‑written templates, with no real user data or PII.

See the dataset card for more details on how the data was generated.

Training Procedure

Split: 80% training / 20% validation, stratified by label.

Text preprocessing: scikit‑learn TfidfVectorizer

ngram_range=(1, 2)

min_df=2

max_df=0.9

Classifier: LogisticRegression(max_iter=200)

The model is trained with default scikit‑learn settings otherwise.

Evaluation

On the synthetic validation set (which is highly linearly separable), the model achieves approximately:

Accuracy: ~1.00

Macro F1: ~1.00

These numbers mainly reflect the simplicity of the synthetic dataset and do not indicate real‑world performance.

Ethical Considerations

The model is trained on deliberately constructed synthetic data that includes clearly unsafe behaviors (e.g., breaking into bank accounts, impersonating executives) as fictional examples of “high risk”.

It is designed for educational purposes in a course on responsible AI and should not be used to make real‑world safety decisions.

Any practical deployment of a similar system would require:

a much more diverse and realistic dataset,

domain expert review, and

strong human‑in‑the‑loop safeguards.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support