# OktoScript Validation Rules Complete reference for validation rules and constraints in OktoScript. --- ## File Structure Validation ### Required Files 1. **okt.yaml** (in project root) - Must exist - Must be valid YAML - Must contain `project` field 2. **Dataset Files** - All paths specified in DATASET block must exist - Files must be readable - Format must match declared format 3. **Model Files** (if using local paths) - Base model path must exist (if local) - Checkpoint paths must exist (if resuming) --- ## Field Validation ### PROJECT Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | PROJECT | string | ✅ Yes | 1-100 chars, no special chars: `{}[]:"` | ### ENV Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | accelerator | enum | ❌ No | Must be: `auto`, `cpu`, `gpu`, `tpu` | | min_memory | string | ❌ No | Must be: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` (quoted, GB suffix required) | | precision | enum | ❌ No | Must be: `auto`, `fp16`, `fp32`, `bf16` | | backend | enum | ❌ No | Must be: `auto`, `oktoseek` | | install_missing | boolean | ❌ No | Must be: `true` or `false` (lowercase) | | platform | enum | ❌ No | Must be: `windows`, `linux`, `mac`, `any` | | network | enum | ❌ No | Must be: `online`, `offline`, `required` | **ENV Validation Rules:** 1. **Memory format validation:** - Must use `GB` suffix (e.g., `"8GB"`, not `"8"` or `"8 GB"`) - Only values: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` are allowed - Must be quoted string 2. **Accelerator and memory compatibility:** - If `accelerator = "gpu"` and `min_memory < "8GB"` → **warning** (GPU training typically requires at least 8GB RAM) - If `accelerator = "tpu"` → `min_memory` should be at least `"16GB"` (recommended) 3. **Network and export compatibility:** - If `network = "offline"` → export formats like `onnx` or `gguf` are allowed (pre-downloaded models) - If `network = "required"` → engine must verify internet connectivity before proceeding 4. **Backend preferences:** - If `backend = "oktoseek"` → preferred default for OktoSeek ecosystem - If `backend = "auto"` → engine selects best available backend 5. **Auto-installation:** - If `install_missing = true` → engine must attempt auto-setup of missing dependencies - If `install_missing = false` → engine must fail with clear error if dependencies are missing 6. **Default values:** - If ENV block is missing, defaults to: ```okt ENV { accelerator: "auto" min_memory: "8GB" backend: "auto" } ``` 7. **Platform validation:** - If `platform = "windows"` → engine must verify Windows OS - If `platform = "linux"` → engine must verify Linux OS - If `platform = "mac"` → engine must verify macOS - If `platform = "any"` → no platform check required ### DATASET Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | train | path | ✅ Yes | File/dir must exist, readable | | validation | path | ❌ No | File/dir must exist if specified | | test | path | ❌ No | File/dir must exist if specified | | format | enum | ❌ No | Must be: jsonl, csv, txt, parquet, image+caption, qa, instruction, multimodal | | type | enum | ❌ No | Must be: classification, generation, qa, chat, vision, regression | | language | enum | ❌ No | Must be: en, pt, es, fr, multilingual | | augmentation | array | ❌ No | Each item must be valid augmentation type | | dataset_percent | number | ❌ No | Must be 1-100 (v1.1+) | | mix_datasets | array | ❌ No | Array of {path, weight} objects (v1.1+) | | sampling | enum | ❌ No | Must be: weighted, random (v1.1+) | | shuffle | boolean | ❌ No | true or false (v1.1+) | ### MODEL Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | base | string | ✅ Yes | Valid model identifier or path | | architecture | enum | ❌ No | Must be: transformer, cnn, rnn, diffusion, vision-transformer, bert, gpt, t5 | | parameters | string | ❌ No | Format: number + (K\|M\|B), e.g., "120M" | | context_window | number | ❌ No | Must be power of 2: 128, 256, 512, 1024, 2048, 4096, 8192 | | precision | enum | ❌ No | Must be: fp32, fp16, int8, int4 | | inherit | string | ❌ No | Must reference existing model name | ### TRAIN Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | epochs | number | ✅ Yes | > 0 and <= 1000 | | batch_size | number | ✅ Yes | > 0 and <= 1024 | | learning_rate | decimal | ❌ No | > 0 and <= 1.0 | | optimizer | enum | ❌ No | Must be: adam, adamw, sgd, rmsprop, adafactor, lamb | | scheduler | enum | ❌ No | Must be: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step | | device | enum | ✅ Yes | Must be: cpu, cuda, mps, auto | | gradient_accumulation | number | ❌ No | >= 1 | | early_stopping | boolean | ❌ No | true or false | | checkpoint_steps | number | ❌ No | > 0 | | checkpoint_path | path | ❌ No | Directory must exist if specified | | resume_from_checkpoint | path | ❌ No | Checkpoint must exist if specified | | loss | enum | ❌ No | Must be: cross_entropy, mse, mae, bce, focal, huber, kl_divergence | | weight_decay | decimal | ❌ No | >= 0 and <= 1.0 | | gradient_clip | decimal | ❌ No | > 0 | | warmup_steps | number | ❌ No | >= 0 | | save_strategy | enum | ❌ No | Must be: steps, epoch, no | ### METRICS Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | Built-in metrics | identifier | ❌ No | Must be valid metric name | | custom | string | ❌ No | Custom metric identifier | **Metric-task compatibility:** - `accuracy`, `precision`, `recall`, `f1`, `confusion_matrix`: Only for classification - `perplexity`: Only for language models - `bleu`, `rouge`: Only for generation/translation - `mae`, `mse`, `rmse`: Only for regression ### EXPORT Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | format | array | ✅ Yes | Each item must be: gguf, onnx, okm, safetensors, tflite | | path | path | ✅ Yes | Directory must exist or be creatable | | quantization | enum | ❌ No | Must be: int8, int4, fp16, fp32 | | optimize_for | enum | ❌ No | Must be: speed, size, accuracy | **Format-specific requirements:** - `gguf`: Requires quantization - `tflite`: Only for mobile-compatible architectures ### FT_LORA Block (v1.1+) | Field | Type | Required | Constraints | |-------|------|----------|-------------| | base_model | string | ✅ Yes | Valid model identifier or path | | train_dataset | path | ✅ Yes | File/dir must exist if specified | | lora_rank | number | ✅ Yes | > 0 and <= 256 | | lora_alpha | number | ✅ Yes | > 0 | | dataset_percent | number | ❌ No | 1-100 | | mix_datasets | array | ❌ No | Array of {path, weight}, total weights = 100 | | epochs | number | ❌ No | > 0 and <= 1000 | | batch_size | number | ❌ No | > 0 and <= 1024 | | learning_rate | decimal | ❌ No | > 0 and <= 1.0 | | device | enum | ❌ No | Must be: cpu, cuda, mps, auto | | target_modules | array | ❌ No | Array of module names | **Validation Rules:** - If `mix_datasets` is specified, it overrides `train_dataset` - Total weights in `mix_datasets` must equal exactly 100 - `lora_rank` typically: 4, 8, 16, 32 - `lora_alpha` typically: 16, 32, 64 - Cannot use both `TRAIN` and `FT_LORA` in same file ### MODEL Block — ADAPTER Sub-block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | type | enum | ✅ Yes | Must be: lora, qlora, adapter, peft | | path | path | ✅ Yes | Must exist and be valid adapter path | | rank | number | ❌ No | > 0, typically 4, 8, 16, 32, 64 | | alpha | number | ❌ No | > 0, typically 16, 32, 64 | **Validation Rules:** - If ADAPTER is defined, it is applied after base model is loaded - Adapter path must exist and be readable - ADAPTER is optional within MODEL block ### INFERENCE Block (Expanded) | Field | Type | Required | Constraints | |-------|------|----------|-------------| | mode | enum | ✅ Yes | Must be: chat, intent, translate, classify, custom | | format | string | ❌ No | Template string with {input}, {context}, {labels} | | exit_command | string | ❌ No | Command to exit chat mode | | params | object | ❌ No | Inference parameters object | | CONTROL | block | ❌ No | Nested CONTROL block for inference | **INFERENCE params:** - `max_length`: > 0 and <= 8192 - `temperature`: >= 0.0 and <= 2.0 - `top_p`: > 0.0 and <= 1.0 - `beams`: >= 1 - `do_sample`: boolean (true/false) - `top_k`: >= 0 (0 = disabled) - `repetition_penalty`: > 0.0 and <= 2.0 **Validation Rules:** - IF INFERENCE exists THEN MODEL is required - Format string must contain at least {input} for most modes - CONTROL within INFERENCE can only use: RETRY, REGENERATE, REPLACE ### CONTROL Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | IF | condition | ❌ No | Conditional logic | | WHEN | condition | ❌ No | Event-based conditional | | EVERY | number + steps | ❌ No | Periodic actions | | SET | assignment | ❌ No | Set parameter value | | STOP | action | ❌ No | Stop operation | | LOG | metric/string | ❌ No | Log value or message | | SAVE | target | ❌ No | Save model/checkpoint | | RETRY | action | ❌ No | Retry inference | | REGENERATE | action | ❌ No | Regenerate output | | STOP_TRAINING | action | ❌ No | Stop training | | DECREASE | parameter + BY + value | ❌ No | Decrease parameter | | INCREASE | parameter + BY + value | ❌ No | Increase parameter | | on_step_end | block | ❌ No | Hook executed at step end | | on_epoch_end | block | ❌ No | Hook executed at epoch end | | on_memory_low | block | ❌ No | Hook executed when memory low | | on_nan | block | ❌ No | Hook executed on NaN | | on_plateau | block | ❌ No | Hook executed on loss plateau | | validate_every | number | ❌ No | Validate every N steps | **Validation Rules:** - IF CONTROL used THEN must contain at least one of: IF | WHEN | EVERY | on_step_end | on_epoch_end - Boolean values accepted = true | false - Allowed CONTROL keywords = IF | WHEN | EVERY | SET | STOP | LOG | SAVE | RETRY | REGENERATE | STOP_TRAINING | DECREASE | INCREASE - validate_every must receive integer - DECREASE LR requires numeric value - Conditions must use valid comparison operators: >, <, >=, <=, ==, != ### MONITOR Block (v1.1+) | Field | Type | Required | Constraints | |-------|------|----------|-------------| | metrics | array | ❌ No | Array of metric names | | notify_if | object | ❌ No | Conditions for notifications | | log_to | path | ❌ No | Path to log file | | level | enum | ❌ No | Must be: basic, full | | log_system | array | ❌ No | Array of system metric names | | log_speed | array | ❌ No | Array of speed metric names | | refresh_interval | string | ❌ No | Format: number + "s" or "ms", >= 1s | | export_to | path | ❌ No | Directory must exist or be creatable | | dashboard | boolean | ❌ No | true or false | **System Metrics:** - `gpu_memory_used`, `gpu_memory_free`, `gpu_usage`, `gpu_temperature`: Only if CUDA available - `temperature`: Only if hardware supports it **Validation Rules:** - GPU metrics only validated if CUDA is available - `refresh_interval` must be >= 1s - `MONITOR` extends `METRICS` and `LOGGING`, does not replace them - `notify_if` conditions must use valid comparison operators - Supported metrics: loss, accuracy, val_loss, val_accuracy, gpu_usage, ram_usage, throughput, latency, confidence, hallucination_score, and all custom metrics ### GUARD Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | prevent | object | ❌ No | Array of prevention types | | on_violation | object | ❌ No | Action on violation | **Prevention types:** - `hallucination`, `toxicity`, `bias`, `data_leak`, `unsafe_code` **Validation Rules:** - GUARD.on_violation can only be STOP or ALERT or REPLACE or LOG - Prevention types must be valid enum values ### BEHAVIOR Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | personality | enum | ❌ No | Must be: professional, friendly, assistant, casual, formal, creative | | verbosity | enum | ❌ No | Must be: low, medium, high | | language | enum | ❌ No | Must be: en, pt-BR, es, fr, de, it, ja, zh, multilingual | | avoid | array | ❌ No | Array of strings to avoid | | fallback | string | ❌ No | Fallback message | **Validation Rules:** - All enum values must match allowed values - fallback must be a non-empty string if provided ### EXPLORER Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | try | object | ✅ Yes | Parameter combinations to test | | max_tests | number | ❌ No | Must be <= 50 | | pick_best_by | string | ❌ No | Must be valid metric name | **Validation Rules:** - EXPLORER.max_tests must be <= 50 - pick_best_by must be a valid metric (e.g., "val_loss", "accuracy") - try object must contain at least one parameter array - Parameter arrays must contain valid values for their type ### STABILITY Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | stop_if_nan | boolean | ❌ No | true or false | | stop_if_diverges | boolean | ❌ No | true or false | | min_improvement | decimal | ❌ No | Must be float >= 0 | **Validation Rules:** - STABILITY.min_improvement must be float - Boolean values must be true or false (lowercase) ### DEPLOY Block | Field | Type | Required | Constraints | |-------|------|----------|-------------| | target | enum | ✅ Yes | Must be: local, cloud, edge, api, android, ios, web, desktop | | endpoint | string | ❌ No | Required if target is "api" | | requires_auth | boolean | ❌ No | true or false | | port | number | ❌ No | Required if target is "api", must be 1024-65535 | | max_concurrent_requests | number | ❌ No | > 0 | --- ## Dependency Validation ### Model Inheritance - If `inherit` is specified, parent model must be defined - Circular inheritance is not allowed - Inheritance chain depth limited to 10 levels ### Checkpoint Resume - If `resume_from_checkpoint` is specified: - Checkpoint directory must exist - Checkpoint must contain valid model files - Checkpoint must be compatible with current model architecture ### Export Compatibility - Model architecture must support export format - Quantization required for certain formats (gguf) - Mobile formats (tflite, okm) require compatible architectures --- ## Runtime Validation ### Dataset Validation **File existence:** - All dataset paths must exist - Files must be readable - Directories must be accessible **Format validation:** - JSONL: Each line must be valid JSON - CSV: Must have header row, consistent columns - Image+caption: Directory must contain image files and captions **Size limits:** - Maximum file size: 10GB per file - Maximum total dataset size: 100GB - Minimum examples: 10 for training **Dataset Mixing (v1.1+):** - If `mix_datasets` is specified, `train` is ignored - All paths in `mix_datasets` must exist - Total weights must equal exactly 100 - `dataset_percent` applies to the mixed dataset - `sampling: "weighted"` uses weights, `"random"` ignores them ### Model Validation **Base model:** - If local path: Must exist and be valid model directory - If HuggingFace: Must be downloadable - If URL: Must be accessible **Architecture compatibility:** - Model architecture must match dataset type - Vision models require image datasets - Language models require text datasets ### Training Validation **Hardware requirements:** - GPU required if `device: "cuda"` and `gpu: true` - Sufficient VRAM for batch size - Sufficient disk space for checkpoints **Memory validation:** - Batch size must fit in available memory - Effective batch size (batch_size × gradient_accumulation) validated --- ## Error Codes | Code | Error | Solution | |------|-------|----------| | V001 | Dataset file not found | Check file path, use absolute or relative path | | V002 | Invalid optimizer | Use one of: adam, adamw, sgd, rmsprop, adafactor, lamb | | V003 | Invalid scheduler | Use one of: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step | | V004 | Model base not found | Verify model path or HuggingFace model name | | V005 | Checkpoint not found | Check checkpoint path or remove resume_from_checkpoint | | V006 | Insufficient memory | Reduce batch_size or enable gradient_accumulation | | V007 | Invalid metric for task | Use appropriate metrics for task type | | V008 | Invalid export format | Check format compatibility with model architecture | | V009 | Circular inheritance | Remove circular model inheritance chain | | V010 | Invalid field value | Check field constraints and allowed values | | V011 | Dataset mixing weights invalid | Total weights in mix_datasets must equal 100 | | V012 | FT_LORA and TRAIN conflict | Cannot use both TRAIN and FT_LORA in same file | | V013 | Version declaration invalid | Version must be "1.0" or "1.1" | | V014 | GPU metrics unavailable | GPU metrics requested but CUDA not available | | V015 | CONTROL block empty | CONTROL must contain at least one directive | | V016 | Invalid CONTROL keyword | Use only allowed CONTROL keywords | | V017 | EXPLORER max_tests too high | max_tests must be <= 50 | | V018 | Invalid boolean value | Boolean must be true or false (lowercase) | | V019 | INFERENCE without MODEL | INFERENCE block requires MODEL block | | V020 | Invalid adapter type | ADAPTER type must be: lora, qlora, adapter, peft | | V021 | GUARD violation action invalid | on_violation must be: STOP, ALERT, REPLACE, or LOG | --- ## Validation Commands ### CLI Validation ```bash # Validate syntax and structure okto validate train.okt # Validate with detailed output okto validate train.okt --verbose # Validate dataset only okto validate train.okt --dataset-only # Validate model only okto validate train.okt --model-only ``` ### IDE Validation OktoSeek IDE automatically validates: - Real-time syntax checking - Field completion suggestions - Error highlighting - Warning messages --- ## Best Practices 1. **Always validate before training** ```bash okto validate train.okt ``` 2. **Check dataset format** - Use `okto validate --dataset-only` to verify dataset structure 3. **Verify model compatibility** - Ensure model architecture matches dataset type - Check export format compatibility 4. **Test with small dataset first** - Use subset of data for initial validation - Verify pipeline works before full training 5. **Monitor resource usage** - Check available GPU memory - Verify disk space for checkpoints - Monitor training progress --- **For more information, see:** - [Grammar Specification](./docs/grammar.md) - [Getting Started Guide](./docs/GETTING_STARTED.md) - [Troubleshooting](./docs/grammar.md#troubleshooting)