A newer version of the Gradio SDK is available: 6.13.0
title: UMSR Autonomous Trainer
emoji: 🔄
colorFrom: gray
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
sdk_version: 6.7.0
UMSR Autonomous Trainer
This Space runs autonomous training cycles for UMSR Reasoner using:
- Dataset: https://huggingface.co/datasets/NorthernTribe-Research/UMSR-v1
- Target model repo: configurable via
UMSR_MODEL_REPO_ID - Main model tree root: https://huggingface.co/NorthernTribe-Research/UMSR-Reasoner-7B
- Quantized profiles:
The trainer defaults to teacher-student self-distillation using NorthernTribe-Research/UMSR-Reasoner-7B as both student and teacher (UMSR_DISTILL_ENABLED=true) with QLoRA-style student updates (UMSR_USE_4BIT=true, UMSR_LORA_ENABLED=true) so 7B runs remain feasible on common Space GPUs.
By default, external models are blocked (UMSR_ENFORCE_INHOUSE_MODELS=true) to keep runs independent from third-party base checkpoints.
During each run, the worker emits real-time training telemetry:
live_progress.jsonfor current step/epoch/loss/KD statelive_events.jsonlfor append-only training eventssystem_snapshot.jsonfor native runtime OS/binary preflightstate/runtime_audit.jsonlfor scheduler/control/run lifecycle audit events- runtime preflight logs for
CUDA available,CUDA device count, and primary GPU name
The Space dashboard reads these files directly for near real-time updates.
Checkpoint continuity is enabled by default:
- every cycle keeps multiple checkpoints (
UMSR_SAVE_TOTAL_LIMIT) - new cycles auto-resume from the latest retained checkpoint across prior runs when
UMSR_RESUME_FROM_CHECKPOINT=auto - stale in-memory runtime state is recovered on restart so autonomous scheduling can continue safely
Native Trainer Mode
The Space can run in native runtime mode with machine-level preflight checks:
UMSR_NATIVE_TRAINER_MODE=trueenables runtime OS and binary validationUMSR_REQUIRED_BINS=bash,python3,git,curldefines required system toolsUMSR_NATIVE_STRICT_MODE=truefails fast when required binaries are missing
When native mode is enabled, each run records a system snapshot and exposes it in Runtime Details.
24/7 Autonomous Mode
To keep the trainer running continuously:
- Use upgraded Space hardware.
- Configure sleep time to
-1(no sleep). - Set secret
HF_TOKENwith write access to the model repo. - Set:
UMSR_AUTONOMOUS_ENABLED=trueUMSR_AUTO_START=trueUMSR_CONTINUOUS_MODE=trueUMSR_NEVER_STOP_MODE=trueUMSR_FORCE_AUTONOMOUS_ON_BOOT=trueUMSR_CYCLE_COOLDOWN_SECONDS=30UMSR_LOOP_POLL_SECONDS=15UMSR_SCHEDULER_WATCHDOG_SECONDS=30
If UMSR_CONTINUOUS_MODE=false, the trainer falls back to interval scheduling using UMSR_INTERVAL_MINUTES.
Failure recovery defaults are enabled:
- exponential retry backoff after failures
- scheduler self-healing if control loop exits unexpectedly
- autonomous mode auto-restored by watchdog when never-stop mode is enabled
- manual counter reset from the UI (
Reset Failures)
Default Runtime Variables
UMSR_DATASET_ID=NorthernTribe-Research/UMSR-v1UMSR_MODEL_REPO_ID=NorthernTribe-Research/UMSR-Reasoner-7BUMSR_BASE_MODEL=NorthernTribe-Research/UMSR-Reasoner-7BUMSR_TEACHER_MODEL=NorthernTribe-Research/UMSR-Reasoner-7BUMSR_MODEL_DTYPE=bfloat16UMSR_TEACHER_DTYPE=bfloat16UMSR_ATTN_IMPLEMENTATION=UMSR_DISTILL_ENABLED=trueUMSR_ENFORCE_INHOUSE_MODELS=trueUMSR_USE_4BIT=trueUMSR_USE_4BIT_TEACHER=trueUMSR_LORA_ENABLED=trueUMSR_LORA_R=32UMSR_LORA_ALPHA=64UMSR_LORA_DROPOUT=0.05UMSR_LORA_TARGET_MODULES=q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_projUMSR_MAX_TRAIN_SAMPLES=256UMSR_MAX_EVAL_SAMPLES=64UMSR_NUM_TRAIN_EPOCHS=1UMSR_LEARNING_RATE=1e-4UMSR_WEIGHT_DECAY=0.0UMSR_WARMUP_RATIO=0.03UMSR_WARMUP_STEPS=0UMSR_BATCH_SIZE=1UMSR_EVAL_BATCH_SIZE=1UMSR_GRAD_ACCUM=1UMSR_LOGGING_STEPS=1UMSR_EVAL_STEPS=25UMSR_SAVE_STEPS=25UMSR_SAVE_TOTAL_LIMIT=4UMSR_TEMPERATURE_START=2.5UMSR_TEMPERATURE_END=1.2UMSR_CE_WEIGHT_START=0.35UMSR_CE_WEIGHT_END=0.5UMSR_KD_WEIGHT_START=0.65UMSR_KD_WEIGHT_END=0.5UMSR_INTERVAL_MINUTES=360UMSR_CONTINUOUS_MODE=trueUMSR_CYCLE_COOLDOWN_SECONDS=30UMSR_FAILURE_PAUSE_THRESHOLD=8UMSR_FAILURE_BACKOFF_BASE_SECONDS=60UMSR_FAILURE_BACKOFF_MAX_SECONDS=3600UMSR_LOOP_POLL_SECONDS=15UMSR_NEVER_STOP_MODE=trueUMSR_FORCE_AUTONOMOUS_ON_BOOT=trueUMSR_SCHEDULER_WATCHDOG_SECONDS=30UMSR_UI_REFRESH_SECONDS=5UMSR_LOG_REFRESH_SECONDS=0.4UMSR_CLIENT_POLL_MS=400UMSR_USE_TIMER_REFRESH=falseUMSR_AUDIT_MAX_BYTES=8000000UMSR_AUDIT_RETAIN_LINES=50000UMSR_NATIVE_TRAINER_MODE=trueUMSR_NATIVE_STRICT_MODE=falseUMSR_REQUIRED_BINS=bash,python3,git,curlUMSR_GRADIENT_CHECKPOINTING=trueUMSR_RESUME_FROM_CHECKPOINT=autoUMSR_TIE_WORD_EMBEDDINGS=false
Notes
scripts/publish_space_hf.pydefaults to GPU hardwaret4-small.- True 24/7 operation still requires paid upgraded runtime with
sleep_time=-1. - Spaces run in managed Linux containers; for a fully custom Ubuntu host profile, use Docker Space configuration with paid hardware.
- If configured LoRA targets are not present in the loaded model, the trainer auto-detects compatible targets (for example GPT-style
c_attn,c_proj,c_fc). - Runtime Logs are line-streamed with client polling by default; enable
UMSR_USE_TIMER_REFRESH=trueonly if your runtime requires server timer ticks. - UI updates are automatic; no manual refresh action is required during normal operation.