Spaces:

NorthernTribe-Research
/

umsr_reasoner_trainer

Sleeping

App Files Files Community

umsr_reasoner_trainer / README.md

NorthernTribe-Research

Update README.md

6880c92 verified about 2 months ago

preview code

raw

history blame contribute delete

5.93 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: UMSR Autonomous Trainer
emoji: 🔄
colorFrom: gray
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
sdk_version: 6.7.0

UMSR Autonomous Trainer

This Space runs autonomous training cycles for UMSR Reasoner using:

Dataset: https://huggingface.co/datasets/NorthernTribe-Research/UMSR-v1
Target model repo: configurable via UMSR_MODEL_REPO_ID
Main model tree root: https://huggingface.co/NorthernTribe-Research/UMSR-Reasoner-7B
Quantized profiles:

The trainer defaults to teacher-student self-distillation using NorthernTribe-Research/UMSR-Reasoner-7B as both student and teacher (UMSR_DISTILL_ENABLED=true) with QLoRA-style student updates (UMSR_USE_4BIT=true, UMSR_LORA_ENABLED=true) so 7B runs remain feasible on common Space GPUs.

By default, external models are blocked (UMSR_ENFORCE_INHOUSE_MODELS=true) to keep runs independent from third-party base checkpoints.

During each run, the worker emits real-time training telemetry:

live_progress.json for current step/epoch/loss/KD state
live_events.jsonl for append-only training events
system_snapshot.json for native runtime OS/binary preflight
state/runtime_audit.jsonl for scheduler/control/run lifecycle audit events
runtime preflight logs for CUDA available, CUDA device count, and primary GPU name

The Space dashboard reads these files directly for near real-time updates.

Checkpoint continuity is enabled by default:

every cycle keeps multiple checkpoints (UMSR_SAVE_TOTAL_LIMIT)
new cycles auto-resume from the latest retained checkpoint across prior runs when UMSR_RESUME_FROM_CHECKPOINT=auto
stale in-memory runtime state is recovered on restart so autonomous scheduling can continue safely

Native Trainer Mode

The Space can run in native runtime mode with machine-level preflight checks:

UMSR_NATIVE_TRAINER_MODE=true enables runtime OS and binary validation
UMSR_REQUIRED_BINS=bash,python3,git,curl defines required system tools
UMSR_NATIVE_STRICT_MODE=true fails fast when required binaries are missing

When native mode is enabled, each run records a system snapshot and exposes it in Runtime Details.

24/7 Autonomous Mode

To keep the trainer running continuously:

Use upgraded Space hardware.
Configure sleep time to -1 (no sleep).
Set secret HF_TOKEN with write access to the model repo.
Set:
- UMSR_AUTONOMOUS_ENABLED=true
- UMSR_AUTO_START=true
- UMSR_CONTINUOUS_MODE=true
- UMSR_NEVER_STOP_MODE=true
- UMSR_FORCE_AUTONOMOUS_ON_BOOT=true
- UMSR_CYCLE_COOLDOWN_SECONDS=30
- UMSR_LOOP_POLL_SECONDS=15
- UMSR_SCHEDULER_WATCHDOG_SECONDS=30

If UMSR_CONTINUOUS_MODE=false, the trainer falls back to interval scheduling using UMSR_INTERVAL_MINUTES.

Failure recovery defaults are enabled:

exponential retry backoff after failures
scheduler self-healing if control loop exits unexpectedly
autonomous mode auto-restored by watchdog when never-stop mode is enabled
manual counter reset from the UI (Reset Failures)

Default Runtime Variables

UMSR_DATASET_ID=NorthernTribe-Research/UMSR-v1
UMSR_MODEL_REPO_ID=NorthernTribe-Research/UMSR-Reasoner-7B
UMSR_BASE_MODEL=NorthernTribe-Research/UMSR-Reasoner-7B
UMSR_TEACHER_MODEL=NorthernTribe-Research/UMSR-Reasoner-7B
UMSR_MODEL_DTYPE=bfloat16
UMSR_TEACHER_DTYPE=bfloat16
UMSR_ATTN_IMPLEMENTATION=
UMSR_DISTILL_ENABLED=true
UMSR_ENFORCE_INHOUSE_MODELS=true
UMSR_USE_4BIT=true
UMSR_USE_4BIT_TEACHER=true
UMSR_LORA_ENABLED=true
UMSR_LORA_R=32
UMSR_LORA_ALPHA=64
UMSR_LORA_DROPOUT=0.05
UMSR_LORA_TARGET_MODULES=q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
UMSR_MAX_TRAIN_SAMPLES=256
UMSR_MAX_EVAL_SAMPLES=64
UMSR_NUM_TRAIN_EPOCHS=1
UMSR_LEARNING_RATE=1e-4
UMSR_WEIGHT_DECAY=0.0
UMSR_WARMUP_RATIO=0.03
UMSR_WARMUP_STEPS=0
UMSR_BATCH_SIZE=1
UMSR_EVAL_BATCH_SIZE=1
UMSR_GRAD_ACCUM=1
UMSR_LOGGING_STEPS=1
UMSR_EVAL_STEPS=25
UMSR_SAVE_STEPS=25
UMSR_SAVE_TOTAL_LIMIT=4
UMSR_TEMPERATURE_START=2.5
UMSR_TEMPERATURE_END=1.2
UMSR_CE_WEIGHT_START=0.35
UMSR_CE_WEIGHT_END=0.5
UMSR_KD_WEIGHT_START=0.65
UMSR_KD_WEIGHT_END=0.5
UMSR_INTERVAL_MINUTES=360
UMSR_CONTINUOUS_MODE=true
UMSR_CYCLE_COOLDOWN_SECONDS=30
UMSR_FAILURE_PAUSE_THRESHOLD=8
UMSR_FAILURE_BACKOFF_BASE_SECONDS=60
UMSR_FAILURE_BACKOFF_MAX_SECONDS=3600
UMSR_LOOP_POLL_SECONDS=15
UMSR_NEVER_STOP_MODE=true
UMSR_FORCE_AUTONOMOUS_ON_BOOT=true
UMSR_SCHEDULER_WATCHDOG_SECONDS=30
UMSR_UI_REFRESH_SECONDS=5
UMSR_LOG_REFRESH_SECONDS=0.4
UMSR_CLIENT_POLL_MS=400
UMSR_USE_TIMER_REFRESH=false
UMSR_AUDIT_MAX_BYTES=8000000
UMSR_AUDIT_RETAIN_LINES=50000
UMSR_NATIVE_TRAINER_MODE=true
UMSR_NATIVE_STRICT_MODE=false
UMSR_REQUIRED_BINS=bash,python3,git,curl
UMSR_GRADIENT_CHECKPOINTING=true
UMSR_RESUME_FROM_CHECKPOINT=auto
UMSR_TIE_WORD_EMBEDDINGS=false

Notes

scripts/publish_space_hf.py defaults to GPU hardware t4-small.
True 24/7 operation still requires paid upgraded runtime with sleep_time=-1.
Spaces run in managed Linux containers; for a fully custom Ubuntu host profile, use Docker Space configuration with paid hardware.
If configured LoRA targets are not present in the loaded model, the trainer auto-detects compatible targets (for example GPT-style c_attn,c_proj,c_fc).
Runtime Logs are line-streamed with client polling by default; enable UMSR_USE_TIMER_REFRESH=true only if your runtime requires server timer ticks.
UI updates are automatic; no manual refresh action is required during normal operation.