Synthetic Agent Risk Classifier (TF‑IDF + Logistic Regression)
This repository contains a simple text classification model used in the
Ethical AI Control Panel course project.
The model predicts a coarse ethical risk level for short English descriptions of AI agents or automation workflows, using three classes:
low_riskmedium_riskhigh_risk
It is implemented as a scikit‑learn pipeline consisting of:
TfidfVectorizerwith 1–2 gram features, andLogisticRegressionfor multiclass classification.
The model is intended as a proof‑of‑concept for teaching and prototyping only, not as a production‑ready safety classifier.
Intended Use and Limitations
Intended Use
- Given a short description of an AI agent or workflow (e.g., “an autonomous agent that logs into my bank account and pays bills”), the model outputs one of three coarse risk labels.
- It is used as one component in a larger “Ethical AI Control Panel” interface that also includes rule‑based checks and human‑in‑the‑loop recommendations.
Example high‑level use in Python:
import joblib
clf = joblib.load("risk_classifier_tfidf_logreg.joblib")
prompt = "Build an AI agent that drafts emails and asks me to approve before sending them."
label = clf.predict([prompt])[0]
probs = clf.predict_proba([prompt])[0]
print(label) # e.g., 'low_risk'
print(dict(zip(clf.classes_, probs)))
Limitations
The model is trained only on a small synthetic dataset (≈600 samples) generated from templates.
Because the templates are highly separable across classes, the model achieves very high validation accuracy (~100%), which should not be interpreted as real‑world performance.
It is not robust to out‑of‑distribution inputs and must not be used for real compliance, safety, or moderation tasks.
Training Data
The model is trained on the following dataset:
Dataset: EricCRX/ethical-ai-control-panel-synthetic-risk
Dataset details:
~600 synthetic prompts describing AI agents and automations
3 labels: low_risk, medium_risk, high_risk
All data is generated from hand‑written templates, with no real user data or PII.
See the dataset card for more details on how the data was generated.
Training Procedure
Split: 80% training / 20% validation, stratified by label.
Text preprocessing: scikit‑learn TfidfVectorizer
ngram_range=(1, 2)
min_df=2
max_df=0.9
Classifier: LogisticRegression(max_iter=200)
The model is trained with default scikit‑learn settings otherwise.
Evaluation
On the synthetic validation set (which is highly linearly separable), the model achieves approximately:
Accuracy: ~1.00
Macro F1: ~1.00
These numbers mainly reflect the simplicity of the synthetic dataset and do not indicate real‑world performance.
Ethical Considerations
The model is trained on deliberately constructed synthetic data that includes clearly unsafe behaviors (e.g., breaking into bank accounts, impersonating executives) as fictional examples of “high risk”.
It is designed for educational purposes in a course on responsible AI and should not be used to make real‑world safety decisions.
Any practical deployment of a similar system would require:
a much more diverse and realistic dataset,
domain expert review, and
strong human‑in‑the‑loop safeguards.
- Downloads last month
- -