NorthernTribe-Research's picture
Update README.md
6880c92 verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: UMSR Autonomous Trainer
emoji: 🔄
colorFrom: gray
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
sdk_version: 6.7.0

UMSR Autonomous Trainer

This Space runs autonomous training cycles for UMSR Reasoner using:

The trainer defaults to teacher-student self-distillation using NorthernTribe-Research/UMSR-Reasoner-7B as both student and teacher (UMSR_DISTILL_ENABLED=true) with QLoRA-style student updates (UMSR_USE_4BIT=true, UMSR_LORA_ENABLED=true) so 7B runs remain feasible on common Space GPUs.

By default, external models are blocked (UMSR_ENFORCE_INHOUSE_MODELS=true) to keep runs independent from third-party base checkpoints.

During each run, the worker emits real-time training telemetry:

  • live_progress.json for current step/epoch/loss/KD state
  • live_events.jsonl for append-only training events
  • system_snapshot.json for native runtime OS/binary preflight
  • state/runtime_audit.jsonl for scheduler/control/run lifecycle audit events
  • runtime preflight logs for CUDA available, CUDA device count, and primary GPU name

The Space dashboard reads these files directly for near real-time updates.

Checkpoint continuity is enabled by default:

  • every cycle keeps multiple checkpoints (UMSR_SAVE_TOTAL_LIMIT)
  • new cycles auto-resume from the latest retained checkpoint across prior runs when UMSR_RESUME_FROM_CHECKPOINT=auto
  • stale in-memory runtime state is recovered on restart so autonomous scheduling can continue safely

Native Trainer Mode

The Space can run in native runtime mode with machine-level preflight checks:

  • UMSR_NATIVE_TRAINER_MODE=true enables runtime OS and binary validation
  • UMSR_REQUIRED_BINS=bash,python3,git,curl defines required system tools
  • UMSR_NATIVE_STRICT_MODE=true fails fast when required binaries are missing

When native mode is enabled, each run records a system snapshot and exposes it in Runtime Details.

24/7 Autonomous Mode

To keep the trainer running continuously:

  1. Use upgraded Space hardware.
  2. Configure sleep time to -1 (no sleep).
  3. Set secret HF_TOKEN with write access to the model repo.
  4. Set:
    • UMSR_AUTONOMOUS_ENABLED=true
    • UMSR_AUTO_START=true
    • UMSR_CONTINUOUS_MODE=true
    • UMSR_NEVER_STOP_MODE=true
    • UMSR_FORCE_AUTONOMOUS_ON_BOOT=true
    • UMSR_CYCLE_COOLDOWN_SECONDS=30
    • UMSR_LOOP_POLL_SECONDS=15
    • UMSR_SCHEDULER_WATCHDOG_SECONDS=30

If UMSR_CONTINUOUS_MODE=false, the trainer falls back to interval scheduling using UMSR_INTERVAL_MINUTES.

Failure recovery defaults are enabled:

  • exponential retry backoff after failures
  • scheduler self-healing if control loop exits unexpectedly
  • autonomous mode auto-restored by watchdog when never-stop mode is enabled
  • manual counter reset from the UI (Reset Failures)

Default Runtime Variables

  • UMSR_DATASET_ID=NorthernTribe-Research/UMSR-v1
  • UMSR_MODEL_REPO_ID=NorthernTribe-Research/UMSR-Reasoner-7B
  • UMSR_BASE_MODEL=NorthernTribe-Research/UMSR-Reasoner-7B
  • UMSR_TEACHER_MODEL=NorthernTribe-Research/UMSR-Reasoner-7B
  • UMSR_MODEL_DTYPE=bfloat16
  • UMSR_TEACHER_DTYPE=bfloat16
  • UMSR_ATTN_IMPLEMENTATION=
  • UMSR_DISTILL_ENABLED=true
  • UMSR_ENFORCE_INHOUSE_MODELS=true
  • UMSR_USE_4BIT=true
  • UMSR_USE_4BIT_TEACHER=true
  • UMSR_LORA_ENABLED=true
  • UMSR_LORA_R=32
  • UMSR_LORA_ALPHA=64
  • UMSR_LORA_DROPOUT=0.05
  • UMSR_LORA_TARGET_MODULES=q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
  • UMSR_MAX_TRAIN_SAMPLES=256
  • UMSR_MAX_EVAL_SAMPLES=64
  • UMSR_NUM_TRAIN_EPOCHS=1
  • UMSR_LEARNING_RATE=1e-4
  • UMSR_WEIGHT_DECAY=0.0
  • UMSR_WARMUP_RATIO=0.03
  • UMSR_WARMUP_STEPS=0
  • UMSR_BATCH_SIZE=1
  • UMSR_EVAL_BATCH_SIZE=1
  • UMSR_GRAD_ACCUM=1
  • UMSR_LOGGING_STEPS=1
  • UMSR_EVAL_STEPS=25
  • UMSR_SAVE_STEPS=25
  • UMSR_SAVE_TOTAL_LIMIT=4
  • UMSR_TEMPERATURE_START=2.5
  • UMSR_TEMPERATURE_END=1.2
  • UMSR_CE_WEIGHT_START=0.35
  • UMSR_CE_WEIGHT_END=0.5
  • UMSR_KD_WEIGHT_START=0.65
  • UMSR_KD_WEIGHT_END=0.5
  • UMSR_INTERVAL_MINUTES=360
  • UMSR_CONTINUOUS_MODE=true
  • UMSR_CYCLE_COOLDOWN_SECONDS=30
  • UMSR_FAILURE_PAUSE_THRESHOLD=8
  • UMSR_FAILURE_BACKOFF_BASE_SECONDS=60
  • UMSR_FAILURE_BACKOFF_MAX_SECONDS=3600
  • UMSR_LOOP_POLL_SECONDS=15
  • UMSR_NEVER_STOP_MODE=true
  • UMSR_FORCE_AUTONOMOUS_ON_BOOT=true
  • UMSR_SCHEDULER_WATCHDOG_SECONDS=30
  • UMSR_UI_REFRESH_SECONDS=5
  • UMSR_LOG_REFRESH_SECONDS=0.4
  • UMSR_CLIENT_POLL_MS=400
  • UMSR_USE_TIMER_REFRESH=false
  • UMSR_AUDIT_MAX_BYTES=8000000
  • UMSR_AUDIT_RETAIN_LINES=50000
  • UMSR_NATIVE_TRAINER_MODE=true
  • UMSR_NATIVE_STRICT_MODE=false
  • UMSR_REQUIRED_BINS=bash,python3,git,curl
  • UMSR_GRADIENT_CHECKPOINTING=true
  • UMSR_RESUME_FROM_CHECKPOINT=auto
  • UMSR_TIE_WORD_EMBEDDINGS=false

Notes

  • scripts/publish_space_hf.py defaults to GPU hardware t4-small.
  • True 24/7 operation still requires paid upgraded runtime with sleep_time=-1.
  • Spaces run in managed Linux containers; for a fully custom Ubuntu host profile, use Docker Space configuration with paid hardware.
  • If configured LoRA targets are not present in the loaded model, the trainer auto-detects compatible targets (for example GPT-style c_attn,c_proj,c_fc).
  • Runtime Logs are line-streamed with client polling by default; enable UMSR_USE_TIMER_REFRESH=true only if your runtime requires server timer ticks.
  • UI updates are automatic; no manual refresh action is required during normal operation.