| # OktoScript Validation Rules | |
| Complete reference for validation rules and constraints in OktoScript. | |
| --- | |
| ## File Structure Validation | |
| ### Required Files | |
| 1. **okt.yaml** (in project root) | |
| - Must exist | |
| - Must be valid YAML | |
| - Must contain `project` field | |
| 2. **Dataset Files** | |
| - All paths specified in DATASET block must exist | |
| - Files must be readable | |
| - Format must match declared format | |
| 3. **Model Files** (if using local paths) | |
| - Base model path must exist (if local) | |
| - Checkpoint paths must exist (if resuming) | |
| --- | |
| ## Field Validation | |
| ### PROJECT Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | PROJECT | string | β Yes | 1-100 chars, no special chars: `{}[]:"` | | |
| ### ENV Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | accelerator | enum | β No | Must be: `auto`, `cpu`, `gpu`, `tpu` | | |
| | min_memory | string | β No | Must be: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` (quoted, GB suffix required) | | |
| | precision | enum | β No | Must be: `auto`, `fp16`, `fp32`, `bf16` | | |
| | backend | enum | β No | Must be: `auto`, `oktoseek` | | |
| | install_missing | boolean | β No | Must be: `true` or `false` (lowercase) | | |
| | platform | enum | β No | Must be: `windows`, `linux`, `mac`, `any` | | |
| | network | enum | β No | Must be: `online`, `offline`, `required` | | |
| **ENV Validation Rules:** | |
| 1. **Memory format validation:** | |
| - Must use `GB` suffix (e.g., `"8GB"`, not `"8"` or `"8 GB"`) | |
| - Only values: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` are allowed | |
| - Must be quoted string | |
| 2. **Accelerator and memory compatibility:** | |
| - If `accelerator = "gpu"` and `min_memory < "8GB"` β **warning** (GPU training typically requires at least 8GB RAM) | |
| - If `accelerator = "tpu"` β `min_memory` should be at least `"16GB"` (recommended) | |
| 3. **Network and export compatibility:** | |
| - If `network = "offline"` β export formats like `onnx` or `gguf` are allowed (pre-downloaded models) | |
| - If `network = "required"` β engine must verify internet connectivity before proceeding | |
| 4. **Backend preferences:** | |
| - If `backend = "oktoseek"` β preferred default for OktoSeek ecosystem | |
| - If `backend = "auto"` β engine selects best available backend | |
| 5. **Auto-installation:** | |
| - If `install_missing = true` β engine must attempt auto-setup of missing dependencies | |
| - If `install_missing = false` β engine must fail with clear error if dependencies are missing | |
| 6. **Default values:** | |
| - If ENV block is missing, defaults to: | |
| ```okt | |
| ENV { | |
| accelerator: "auto" | |
| min_memory: "8GB" | |
| backend: "auto" | |
| } | |
| ``` | |
| 7. **Platform validation:** | |
| - If `platform = "windows"` β engine must verify Windows OS | |
| - If `platform = "linux"` β engine must verify Linux OS | |
| - If `platform = "mac"` β engine must verify macOS | |
| - If `platform = "any"` β no platform check required | |
| ### DATASET Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | train | path | β Yes | File/dir must exist, readable | | |
| | validation | path | β No | File/dir must exist if specified | | |
| | test | path | β No | File/dir must exist if specified | | |
| | format | enum | β No | Must be: jsonl, csv, txt, parquet, image+caption, qa, instruction, multimodal | | |
| | type | enum | β No | Must be: classification, generation, qa, chat, vision, regression | | |
| | language | enum | β No | Must be: en, pt, es, fr, multilingual | | |
| | augmentation | array | β No | Each item must be valid augmentation type | | |
| | dataset_percent | number | β No | Must be 1-100 (v1.1+) | | |
| | mix_datasets | array | β No | Array of {path, weight} objects (v1.1+) | | |
| | sampling | enum | β No | Must be: weighted, random (v1.1+) | | |
| | shuffle | boolean | β No | true or false (v1.1+) | | |
| ### MODEL Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | base | string | β Yes | Valid model identifier or path | | |
| | architecture | enum | β No | Must be: transformer, cnn, rnn, diffusion, vision-transformer, bert, gpt, t5 | | |
| | parameters | string | β No | Format: number + (K\|M\|B), e.g., "120M" | | |
| | context_window | number | β No | Must be power of 2: 128, 256, 512, 1024, 2048, 4096, 8192 | | |
| | precision | enum | β No | Must be: fp32, fp16, int8, int4 | | |
| | inherit | string | β No | Must reference existing model name | | |
| ### TRAIN Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | epochs | number | β Yes | > 0 and <= 1000 | | |
| | batch_size | number | β Yes | > 0 and <= 1024 | | |
| | learning_rate | decimal | β No | > 0 and <= 1.0 | | |
| | optimizer | enum | β No | Must be: adam, adamw, sgd, rmsprop, adafactor, lamb | | |
| | scheduler | enum | β No | Must be: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step | | |
| | device | enum | β Yes | Must be: cpu, cuda, mps, auto | | |
| | gradient_accumulation | number | β No | >= 1 | | |
| | early_stopping | boolean | β No | true or false | | |
| | checkpoint_steps | number | β No | > 0 | | |
| | checkpoint_path | path | β No | Directory must exist if specified | | |
| | resume_from_checkpoint | path | β No | Checkpoint must exist if specified | | |
| | loss | enum | β No | Must be: cross_entropy, mse, mae, bce, focal, huber, kl_divergence | | |
| | weight_decay | decimal | β No | >= 0 and <= 1.0 | | |
| | gradient_clip | decimal | β No | > 0 | | |
| | warmup_steps | number | β No | >= 0 | | |
| | save_strategy | enum | β No | Must be: steps, epoch, no | | |
| ### METRICS Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | Built-in metrics | identifier | β No | Must be valid metric name | | |
| | custom | string | β No | Custom metric identifier | | |
| **Metric-task compatibility:** | |
| - `accuracy`, `precision`, `recall`, `f1`, `confusion_matrix`: Only for classification | |
| - `perplexity`: Only for language models | |
| - `bleu`, `rouge`: Only for generation/translation | |
| - `mae`, `mse`, `rmse`: Only for regression | |
| ### EXPORT Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | format | array | β Yes | Each item must be: gguf, onnx, okm, safetensors, tflite | | |
| | path | path | β Yes | Directory must exist or be creatable | | |
| | quantization | enum | β No | Must be: int8, int4, fp16, fp32 | | |
| | optimize_for | enum | β No | Must be: speed, size, accuracy | | |
| **Format-specific requirements:** | |
| - `gguf`: Requires quantization | |
| - `tflite`: Only for mobile-compatible architectures | |
| ### FT_LORA Block (v1.1+) | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | base_model | string | β Yes | Valid model identifier or path | | |
| | train_dataset | path | β Yes | File/dir must exist if specified | | |
| | lora_rank | number | β Yes | > 0 and <= 256 | | |
| | lora_alpha | number | β Yes | > 0 | | |
| | dataset_percent | number | β No | 1-100 | | |
| | mix_datasets | array | β No | Array of {path, weight}, total weights = 100 | | |
| | epochs | number | β No | > 0 and <= 1000 | | |
| | batch_size | number | β No | > 0 and <= 1024 | | |
| | learning_rate | decimal | β No | > 0 and <= 1.0 | | |
| | device | enum | β No | Must be: cpu, cuda, mps, auto | | |
| | target_modules | array | β No | Array of module names | | |
| **Validation Rules:** | |
| - If `mix_datasets` is specified, it overrides `train_dataset` | |
| - Total weights in `mix_datasets` must equal exactly 100 | |
| - `lora_rank` typically: 4, 8, 16, 32 | |
| - `lora_alpha` typically: 16, 32, 64 | |
| - Cannot use both `TRAIN` and `FT_LORA` in same file | |
| ### MODEL Block β ADAPTER Sub-block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | type | enum | β Yes | Must be: lora, qlora, adapter, peft | | |
| | path | path | β Yes | Must exist and be valid adapter path | | |
| | rank | number | β No | > 0, typically 4, 8, 16, 32, 64 | | |
| | alpha | number | β No | > 0, typically 16, 32, 64 | | |
| **Validation Rules:** | |
| - If ADAPTER is defined, it is applied after base model is loaded | |
| - Adapter path must exist and be readable | |
| - ADAPTER is optional within MODEL block | |
| ### INFERENCE Block (Expanded) | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | mode | enum | β Yes | Must be: chat, intent, translate, classify, custom | | |
| | format | string | β No | Template string with {input}, {context}, {labels} | | |
| | exit_command | string | β No | Command to exit chat mode | | |
| | params | object | β No | Inference parameters object | | |
| | CONTROL | block | β No | Nested CONTROL block for inference | | |
| **INFERENCE params:** | |
| - `max_length`: > 0 and <= 8192 | |
| - `temperature`: >= 0.0 and <= 2.0 | |
| - `top_p`: > 0.0 and <= 1.0 | |
| - `beams`: >= 1 | |
| - `do_sample`: boolean (true/false) | |
| - `top_k`: >= 0 (0 = disabled) | |
| - `repetition_penalty`: > 0.0 and <= 2.0 | |
| **Validation Rules:** | |
| - IF INFERENCE exists THEN MODEL is required | |
| - Format string must contain at least {input} for most modes | |
| - CONTROL within INFERENCE can only use: RETRY, REGENERATE, REPLACE | |
| ### CONTROL Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | IF | condition | β No | Conditional logic | | |
| | WHEN | condition | β No | Event-based conditional | | |
| | EVERY | number + steps | β No | Periodic actions | | |
| | SET | assignment | β No | Set parameter value | | |
| | STOP | action | β No | Stop operation | | |
| | LOG | metric/string | β No | Log value or message | | |
| | SAVE | target | β No | Save model/checkpoint | | |
| | RETRY | action | β No | Retry inference | | |
| | REGENERATE | action | β No | Regenerate output | | |
| | STOP_TRAINING | action | β No | Stop training | | |
| | DECREASE | parameter + BY + value | β No | Decrease parameter | | |
| | INCREASE | parameter + BY + value | β No | Increase parameter | | |
| | on_step_end | block | β No | Hook executed at step end | | |
| | on_epoch_end | block | β No | Hook executed at epoch end | | |
| | on_memory_low | block | β No | Hook executed when memory low | | |
| | on_nan | block | β No | Hook executed on NaN | | |
| | on_plateau | block | β No | Hook executed on loss plateau | | |
| | validate_every | number | β No | Validate every N steps | | |
| **Validation Rules:** | |
| - IF CONTROL used THEN must contain at least one of: IF | WHEN | EVERY | on_step_end | on_epoch_end | |
| - Boolean values accepted = true | false | |
| - Allowed CONTROL keywords = IF | WHEN | EVERY | SET | STOP | LOG | SAVE | RETRY | REGENERATE | STOP_TRAINING | DECREASE | INCREASE | |
| - validate_every must receive integer | |
| - DECREASE LR requires numeric value | |
| - Conditions must use valid comparison operators: >, <, >=, <=, ==, != | |
| ### MONITOR Block (v1.1+) | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | metrics | array | β No | Array of metric names | | |
| | notify_if | object | β No | Conditions for notifications | | |
| | log_to | path | β No | Path to log file | | |
| | level | enum | β No | Must be: basic, full | | |
| | log_system | array | β No | Array of system metric names | | |
| | log_speed | array | β No | Array of speed metric names | | |
| | refresh_interval | string | β No | Format: number + "s" or "ms", >= 1s | | |
| | export_to | path | β No | Directory must exist or be creatable | | |
| | dashboard | boolean | β No | true or false | | |
| **System Metrics:** | |
| - `gpu_memory_used`, `gpu_memory_free`, `gpu_usage`, `gpu_temperature`: Only if CUDA available | |
| - `temperature`: Only if hardware supports it | |
| **Validation Rules:** | |
| - GPU metrics only validated if CUDA is available | |
| - `refresh_interval` must be >= 1s | |
| - `MONITOR` extends `METRICS` and `LOGGING`, does not replace them | |
| - `notify_if` conditions must use valid comparison operators | |
| - Supported metrics: loss, accuracy, val_loss, val_accuracy, gpu_usage, ram_usage, throughput, latency, confidence, hallucination_score, and all custom metrics | |
| ### GUARD Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | prevent | object | β No | Array of prevention types | | |
| | on_violation | object | β No | Action on violation | | |
| **Prevention types:** | |
| - `hallucination`, `toxicity`, `bias`, `data_leak`, `unsafe_code` | |
| **Validation Rules:** | |
| - GUARD.on_violation can only be STOP or ALERT or REPLACE or LOG | |
| - Prevention types must be valid enum values | |
| ### BEHAVIOR Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | personality | enum | β No | Must be: professional, friendly, assistant, casual, formal, creative | | |
| | verbosity | enum | β No | Must be: low, medium, high | | |
| | language | enum | β No | Must be: en, pt-BR, es, fr, de, it, ja, zh, multilingual | | |
| | avoid | array | β No | Array of strings to avoid | | |
| | fallback | string | β No | Fallback message | | |
| **Validation Rules:** | |
| - All enum values must match allowed values | |
| - fallback must be a non-empty string if provided | |
| ### EXPLORER Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | try | object | β Yes | Parameter combinations to test | | |
| | max_tests | number | β No | Must be <= 50 | | |
| | pick_best_by | string | β No | Must be valid metric name | | |
| **Validation Rules:** | |
| - EXPLORER.max_tests must be <= 50 | |
| - pick_best_by must be a valid metric (e.g., "val_loss", "accuracy") | |
| - try object must contain at least one parameter array | |
| - Parameter arrays must contain valid values for their type | |
| ### STABILITY Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | stop_if_nan | boolean | β No | true or false | | |
| | stop_if_diverges | boolean | β No | true or false | | |
| | min_improvement | decimal | β No | Must be float >= 0 | | |
| **Validation Rules:** | |
| - STABILITY.min_improvement must be float | |
| - Boolean values must be true or false (lowercase) | |
| ### DEPLOY Block | |
| | Field | Type | Required | Constraints | | |
| |-------|------|----------|-------------| | |
| | target | enum | β Yes | Must be: local, cloud, edge, api, android, ios, web, desktop | | |
| | endpoint | string | β No | Required if target is "api" | | |
| | requires_auth | boolean | β No | true or false | | |
| | port | number | β No | Required if target is "api", must be 1024-65535 | | |
| | max_concurrent_requests | number | β No | > 0 | | |
| --- | |
| ## Dependency Validation | |
| ### Model Inheritance | |
| - If `inherit` is specified, parent model must be defined | |
| - Circular inheritance is not allowed | |
| - Inheritance chain depth limited to 10 levels | |
| ### Checkpoint Resume | |
| - If `resume_from_checkpoint` is specified: | |
| - Checkpoint directory must exist | |
| - Checkpoint must contain valid model files | |
| - Checkpoint must be compatible with current model architecture | |
| ### Export Compatibility | |
| - Model architecture must support export format | |
| - Quantization required for certain formats (gguf) | |
| - Mobile formats (tflite, okm) require compatible architectures | |
| --- | |
| ## Runtime Validation | |
| ### Dataset Validation | |
| **File existence:** | |
| - All dataset paths must exist | |
| - Files must be readable | |
| - Directories must be accessible | |
| **Format validation:** | |
| - JSONL: Each line must be valid JSON | |
| - CSV: Must have header row, consistent columns | |
| - Image+caption: Directory must contain image files and captions | |
| **Size limits:** | |
| - Maximum file size: 10GB per file | |
| - Maximum total dataset size: 100GB | |
| - Minimum examples: 10 for training | |
| **Dataset Mixing (v1.1+):** | |
| - If `mix_datasets` is specified, `train` is ignored | |
| - All paths in `mix_datasets` must exist | |
| - Total weights must equal exactly 100 | |
| - `dataset_percent` applies to the mixed dataset | |
| - `sampling: "weighted"` uses weights, `"random"` ignores them | |
| ### Model Validation | |
| **Base model:** | |
| - If local path: Must exist and be valid model directory | |
| - If HuggingFace: Must be downloadable | |
| - If URL: Must be accessible | |
| **Architecture compatibility:** | |
| - Model architecture must match dataset type | |
| - Vision models require image datasets | |
| - Language models require text datasets | |
| ### Training Validation | |
| **Hardware requirements:** | |
| - GPU required if `device: "cuda"` and `gpu: true` | |
| - Sufficient VRAM for batch size | |
| - Sufficient disk space for checkpoints | |
| **Memory validation:** | |
| - Batch size must fit in available memory | |
| - Effective batch size (batch_size Γ gradient_accumulation) validated | |
| --- | |
| ## Error Codes | |
| | Code | Error | Solution | | |
| |------|-------|----------| | |
| | V001 | Dataset file not found | Check file path, use absolute or relative path | | |
| | V002 | Invalid optimizer | Use one of: adam, adamw, sgd, rmsprop, adafactor, lamb | | |
| | V003 | Invalid scheduler | Use one of: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step | | |
| | V004 | Model base not found | Verify model path or HuggingFace model name | | |
| | V005 | Checkpoint not found | Check checkpoint path or remove resume_from_checkpoint | | |
| | V006 | Insufficient memory | Reduce batch_size or enable gradient_accumulation | | |
| | V007 | Invalid metric for task | Use appropriate metrics for task type | | |
| | V008 | Invalid export format | Check format compatibility with model architecture | | |
| | V009 | Circular inheritance | Remove circular model inheritance chain | | |
| | V010 | Invalid field value | Check field constraints and allowed values | | |
| | V011 | Dataset mixing weights invalid | Total weights in mix_datasets must equal 100 | | |
| | V012 | FT_LORA and TRAIN conflict | Cannot use both TRAIN and FT_LORA in same file | | |
| | V013 | Version declaration invalid | Version must be "1.0" or "1.1" | | |
| | V014 | GPU metrics unavailable | GPU metrics requested but CUDA not available | | |
| | V015 | CONTROL block empty | CONTROL must contain at least one directive | | |
| | V016 | Invalid CONTROL keyword | Use only allowed CONTROL keywords | | |
| | V017 | EXPLORER max_tests too high | max_tests must be <= 50 | | |
| | V018 | Invalid boolean value | Boolean must be true or false (lowercase) | | |
| | V019 | INFERENCE without MODEL | INFERENCE block requires MODEL block | | |
| | V020 | Invalid adapter type | ADAPTER type must be: lora, qlora, adapter, peft | | |
| | V021 | GUARD violation action invalid | on_violation must be: STOP, ALERT, REPLACE, or LOG | | |
| --- | |
| ## Validation Commands | |
| ### CLI Validation | |
| ```bash | |
| # Validate syntax and structure | |
| okto validate train.okt | |
| # Validate with detailed output | |
| okto validate train.okt --verbose | |
| # Validate dataset only | |
| okto validate train.okt --dataset-only | |
| # Validate model only | |
| okto validate train.okt --model-only | |
| ``` | |
| ### IDE Validation | |
| OktoSeek IDE automatically validates: | |
| - Real-time syntax checking | |
| - Field completion suggestions | |
| - Error highlighting | |
| - Warning messages | |
| --- | |
| ## Best Practices | |
| 1. **Always validate before training** | |
| ```bash | |
| okto validate train.okt | |
| ``` | |
| 2. **Check dataset format** | |
| - Use `okto validate --dataset-only` to verify dataset structure | |
| 3. **Verify model compatibility** | |
| - Ensure model architecture matches dataset type | |
| - Check export format compatibility | |
| 4. **Test with small dataset first** | |
| - Use subset of data for initial validation | |
| - Verify pipeline works before full training | |
| 5. **Monitor resource usage** | |
| - Check available GPU memory | |
| - Verify disk space for checkpoints | |
| - Monitor training progress | |
| --- | |
| **For more information, see:** | |
| - [Grammar Specification](./docs/grammar.md) | |
| - [Getting Started Guide](./docs/GETTING_STARTED.md) | |
| - [Troubleshooting](./docs/grammar.md#troubleshooting) | |