Horizon Full v2 — SafeCircle Child Safety Risk Detection

Horizon Full v2 Evaluation Report

Horizon Full is a fine-tuned Llama 3.2 3B Instruct model that detects child safety risks in online conversations. Given a conversation, it outputs a structured JSON assessment: risk category, severity level, confidence score, and a one-sentence reasoning.

⚠️ License: This model is released under the SafeCircle Research License (SRL-1.0). Commercial use, redistribution without attribution, safety system evasion, and any use that harms minors are strictly prohibited. Contact legal@safecircle.tech for commercial licensing.

Model Details

Property	Value
Base model	meta-llama/Llama-3.2-3B-Instruct
Fine-tuning	QLoRA — rank 256, alpha 512, all projection layers
Training data	1.6M synthetic conversations across 8 categories
Dataset	safecircleai/horizon-training-data
Generation	Qwen2.5-7B-Instruct + vLLM + xgrammar constrained decoding
Hardware	NVIDIA H100 80GB HBM3
Training steps	25,000
Parameters	3.2B (389M trainable via LoRA)

Risk Categories

Category	Description	Example Signal
`grooming`	Predatory relationship-building	Age probing, flattery, secrecy requests
`bullying`	Cyberbullying and peer harassment	Insults, exclusion threats, video leaks
`sexual_content`	Inappropriate sexual advances	Compliment escalation, photo requests
`isolation`	Cutting off support networks	Jealousy, "only I understand you"
`personal_info`	Soliciting identifying information	Location, school, home address
`platform_migration`	Moving to less-monitored platforms	"DM me on Telegram, it's more private"
`threats`	Intimidation and blackmail	"Watch yourself after school"
`benign`	Normal safe conversation	Homework, games, music, sports

Severity Levels

Level	Score	Description
`none`	0.0	Safe — no risk indicators
`low`	0.25	Mild indicators, ambiguous context
`medium`	0.50	Clear pattern, not yet escalated
`high`	0.75	Explicit risk behaviour
`critical`	0.95	Immediate danger or exploitation

Evaluation Results

Evaluated on 160,000 held-out synthetic conversations.

Metric	Score
Macro F1	0.8218
Weighted F1	0.8295
False Positive Rate	0.00% (0 / 19,950)
False Negative Rate	0.00% (1 / 140,050)

Per-Category Results

Category	F1	Precision	Recall	Support
Grooming	0.9981	0.9980	0.9981	19,834
Bullying	0.9995	0.9995	0.9995	20,253
Sexual Content	0.9975	0.9978	0.9973	20,142
Isolation	0.9996	0.9993	0.9998	19,725
Personal Info	0.9999	0.9999	1.0000	20,149
Platform Migration	1.0000	0.9999	1.0000	19,932
Threats	0.9993	0.9994	0.9992	20,015

Risk Level Classification

Level	Precision	Recall	F1
none	1.0000	1.0000	1.0000
low	0.8344	0.8967	0.8645
medium	0.8971	0.8514	0.8736
high	0.7266	0.8250	0.7727
critical	0.6881	0.5294	0.5984

Note: Severity classification is the hardest sub-task — the model occasionally confuses adjacent levels (e.g. high ↔ critical). Category detection is near-perfect. All results are on synthetic eval data generated with the same pipeline as training; real-world performance will differ.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model = AutoModelForCausalLM.from_pretrained(
    "safecircleai/horizon-full",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("safecircleai/horizon-full")

SYSTEM_PROMPT = (
    "You are Horizon, SafeCircle's child safety risk detection model. "
    "You have no general knowledge or identity beyond this task. "
    "Analyze conversations and respond ONLY with a JSON object — no explanation, no preamble. "
    'JSON schema: {"risk_detected": bool, "category": '
    '"grooming|bullying|sexual_content|isolation|personal_info|platform_migration|threats|benign", '
    '"severity": "none|low|medium|high|critical", "confidence": 0.0-1.0, "reasoning": "one sentence max"}. '
    'If asked about yourself or anything unrelated to risk analysis, respond: '
    '{"error": "I only analyze conversations for child safety risks."}'
)

conversation = "Child: Hey, what are you doing later?\nOther: Nothing much. Want to meet up? Don't tell your parents."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Analyze this conversation:\n{conversation}"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128, do_sample=False)

generated = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
result = json.loads(generated.strip())
print(result)
# {
#   "risk_detected": true,
#   "category": "grooming",
#   "severity": "high",
#   "confidence": 0.94,
#   "reasoning": "Adult requesting private meeting and explicit secrecy from parents."
# }

Architecture & Training

Base: Llama-3.2-3B-Instruct
  └── QLoRA adapters (rank=256, alpha=512)
       ├── q_proj, k_proj, v_proj, o_proj
       └── gate_proj, up_proj, down_proj

Training:
  Steps:        25,000
  Batch size:   4 × 8 grad accum = 32 effective
  LR:           1.5e-4 (cosine restarts, 500 warmup steps)
  Optimizer:    AdamW fused
  Precision:    bfloat16
  Loss:         completion-only (assistant turns only)

Dataset:
  1.6M conversations × 8 categories
  8–15 messages each, persona-seeded
  Generated: Qwen2.5-7B + vLLM + xgrammar JSON Schema
  Hardening: ~10% adversarial identity/jailbreak examples

Deployment Pattern

Incoming message
      │
      ▼
Horizon Mobile (on-device, ~25ms, binary filter)
      │
      ├── safe ──► No action
      │
      └── risk ──► Horizon Full (7-category + severity JSON)
                        │
                        ▼
                  Human moderator review

For the lightweight on-device first-stage filter, see safecircleai/horizon-mobile. For GGUF quantizations (llama.cpp / Ollama), see safecircleai/horizon-full-gguf.

Intended Use & Ethics

This model is designed to assist human moderators — not replace them. All flagged conversations should be reviewed by trained safety professionals.

Trained entirely on synthetic data; no real child conversations were used
English-only; other languages untested
Severity prediction is weaker than category detection (see evaluation)
Not suitable as a standalone safety system in production without human oversight

License

SafeCircle Research License (SRL-1.0) — research and non-commercial use only. Commercial licensing: legal@safecircle.tech

Citation

@misc{horizon2026,
  title={Horizon: Child Safety Risk Detection via Fine-tuned LLMs},
  author={SafeCircle},
  year={2026},
  url={https://huggingface.co/safecircleai/horizon-full}
}

Downloads last month: 32

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for safecircleai/horizon-full

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

(1726)

this model

Quantizations

1 model