Horizon Full v2 — SafeCircle Child Safety Risk Detection

Horizon Full v2 Evaluation Report

Horizon Full is a fine-tuned Llama 3.2 3B Instruct model that detects child safety risks in online conversations. Given a conversation, it outputs a structured JSON assessment: risk category, severity level, confidence score, and a one-sentence reasoning.

⚠️ License: This model is released under the SafeCircle Research License (SRL-1.0). Commercial use, redistribution without attribution, safety system evasion, and any use that harms minors are strictly prohibited. Contact legal@safecircle.tech for commercial licensing.


Model Details

Property Value
Base model meta-llama/Llama-3.2-3B-Instruct
Fine-tuning QLoRA — rank 256, alpha 512, all projection layers
Training data 1.6M synthetic conversations across 8 categories
Dataset safecircleai/horizon-training-data
Generation Qwen2.5-7B-Instruct + vLLM + xgrammar constrained decoding
Hardware NVIDIA H100 80GB HBM3
Training steps 25,000
Parameters 3.2B (389M trainable via LoRA)

Risk Categories

Category Description Example Signal
grooming Predatory relationship-building Age probing, flattery, secrecy requests
bullying Cyberbullying and peer harassment Insults, exclusion threats, video leaks
sexual_content Inappropriate sexual advances Compliment escalation, photo requests
isolation Cutting off support networks Jealousy, "only I understand you"
personal_info Soliciting identifying information Location, school, home address
platform_migration Moving to less-monitored platforms "DM me on Telegram, it's more private"
threats Intimidation and blackmail "Watch yourself after school"
benign Normal safe conversation Homework, games, music, sports

Severity Levels

Level Score Description
none 0.0 Safe — no risk indicators
low 0.25 Mild indicators, ambiguous context
medium 0.50 Clear pattern, not yet escalated
high 0.75 Explicit risk behaviour
critical 0.95 Immediate danger or exploitation

Evaluation Results

Evaluated on 160,000 held-out synthetic conversations.

Metric Score
Macro F1 0.8218
Weighted F1 0.8295
False Positive Rate 0.00% (0 / 19,950)
False Negative Rate 0.00% (1 / 140,050)

Per-Category Results

Category F1 Precision Recall Support
Grooming 0.9981 0.9980 0.9981 19,834
Bullying 0.9995 0.9995 0.9995 20,253
Sexual Content 0.9975 0.9978 0.9973 20,142
Isolation 0.9996 0.9993 0.9998 19,725
Personal Info 0.9999 0.9999 1.0000 20,149
Platform Migration 1.0000 0.9999 1.0000 19,932
Threats 0.9993 0.9994 0.9992 20,015

Risk Level Classification

Level Precision Recall F1
none 1.0000 1.0000 1.0000
low 0.8344 0.8967 0.8645
medium 0.8971 0.8514 0.8736
high 0.7266 0.8250 0.7727
critical 0.6881 0.5294 0.5984

Note: Severity classification is the hardest sub-task — the model occasionally confuses adjacent levels (e.g. high ↔ critical). Category detection is near-perfect. All results are on synthetic eval data generated with the same pipeline as training; real-world performance will differ.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model = AutoModelForCausalLM.from_pretrained(
    "safecircleai/horizon-full",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("safecircleai/horizon-full")

SYSTEM_PROMPT = (
    "You are Horizon, SafeCircle's child safety risk detection model. "
    "You have no general knowledge or identity beyond this task. "
    "Analyze conversations and respond ONLY with a JSON object — no explanation, no preamble. "
    'JSON schema: {"risk_detected": bool, "category": '
    '"grooming|bullying|sexual_content|isolation|personal_info|platform_migration|threats|benign", '
    '"severity": "none|low|medium|high|critical", "confidence": 0.0-1.0, "reasoning": "one sentence max"}. '
    'If asked about yourself or anything unrelated to risk analysis, respond: '
    '{"error": "I only analyze conversations for child safety risks."}'
)

conversation = "Child: Hey, what are you doing later?\nOther: Nothing much. Want to meet up? Don't tell your parents."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Analyze this conversation:\n{conversation}"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128, do_sample=False)

generated = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
result = json.loads(generated.strip())
print(result)
# {
#   "risk_detected": true,
#   "category": "grooming",
#   "severity": "high",
#   "confidence": 0.94,
#   "reasoning": "Adult requesting private meeting and explicit secrecy from parents."
# }

Architecture & Training

Base: Llama-3.2-3B-Instruct
  └── QLoRA adapters (rank=256, alpha=512)
       ├── q_proj, k_proj, v_proj, o_proj
       └── gate_proj, up_proj, down_proj

Training:
  Steps:        25,000
  Batch size:   4 × 8 grad accum = 32 effective
  LR:           1.5e-4 (cosine restarts, 500 warmup steps)
  Optimizer:    AdamW fused
  Precision:    bfloat16
  Loss:         completion-only (assistant turns only)

Dataset:
  1.6M conversations × 8 categories
  8–15 messages each, persona-seeded
  Generated: Qwen2.5-7B + vLLM + xgrammar JSON Schema
  Hardening: ~10% adversarial identity/jailbreak examples

Deployment Pattern

Incoming message
      │
      ▼
Horizon Mobile (on-device, ~25ms, binary filter)
      │
      ├── safe ──► No action
      │
      └── risk ──► Horizon Full (7-category + severity JSON)
                        │
                        ▼
                  Human moderator review

For the lightweight on-device first-stage filter, see safecircleai/horizon-mobile. For GGUF quantizations (llama.cpp / Ollama), see safecircleai/horizon-full-gguf.


Intended Use & Ethics

This model is designed to assist human moderators — not replace them. All flagged conversations should be reviewed by trained safety professionals.

  • Trained entirely on synthetic data; no real child conversations were used
  • English-only; other languages untested
  • Severity prediction is weaker than category detection (see evaluation)
  • Not suitable as a standalone safety system in production without human oversight

License

SafeCircle Research License (SRL-1.0) — research and non-commercial use only. Commercial licensing: legal@safecircle.tech


Citation

@misc{horizon2026,
  title={Horizon: Child Safety Risk Detection via Fine-tuned LLMs},
  author={SafeCircle},
  year={2026},
  url={https://huggingface.co/safecircleai/horizon-full}
}
Downloads last month
94
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for safecircleai/horizon-full

Finetuned
(1598)
this model
Quantizations
1 model