oktoscript / VALIDATION_RULES.md
OktoSeek's picture
Upload 48 files
5fc8c9d verified
# OktoScript Validation Rules
Complete reference for validation rules and constraints in OktoScript.
---
## File Structure Validation
### Required Files
1. **okt.yaml** (in project root)
- Must exist
- Must be valid YAML
- Must contain `project` field
2. **Dataset Files**
- All paths specified in DATASET block must exist
- Files must be readable
- Format must match declared format
3. **Model Files** (if using local paths)
- Base model path must exist (if local)
- Checkpoint paths must exist (if resuming)
---
## Field Validation
### PROJECT Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| PROJECT | string | βœ… Yes | 1-100 chars, no special chars: `{}[]:"` |
### ENV Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| accelerator | enum | ❌ No | Must be: `auto`, `cpu`, `gpu`, `tpu` |
| min_memory | string | ❌ No | Must be: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` (quoted, GB suffix required) |
| precision | enum | ❌ No | Must be: `auto`, `fp16`, `fp32`, `bf16` |
| backend | enum | ❌ No | Must be: `auto`, `oktoseek` |
| install_missing | boolean | ❌ No | Must be: `true` or `false` (lowercase) |
| platform | enum | ❌ No | Must be: `windows`, `linux`, `mac`, `any` |
| network | enum | ❌ No | Must be: `online`, `offline`, `required` |
**ENV Validation Rules:**
1. **Memory format validation:**
- Must use `GB` suffix (e.g., `"8GB"`, not `"8"` or `"8 GB"`)
- Only values: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` are allowed
- Must be quoted string
2. **Accelerator and memory compatibility:**
- If `accelerator = "gpu"` and `min_memory < "8GB"` β†’ **warning** (GPU training typically requires at least 8GB RAM)
- If `accelerator = "tpu"` β†’ `min_memory` should be at least `"16GB"` (recommended)
3. **Network and export compatibility:**
- If `network = "offline"` β†’ export formats like `onnx` or `gguf` are allowed (pre-downloaded models)
- If `network = "required"` β†’ engine must verify internet connectivity before proceeding
4. **Backend preferences:**
- If `backend = "oktoseek"` β†’ preferred default for OktoSeek ecosystem
- If `backend = "auto"` β†’ engine selects best available backend
5. **Auto-installation:**
- If `install_missing = true` β†’ engine must attempt auto-setup of missing dependencies
- If `install_missing = false` β†’ engine must fail with clear error if dependencies are missing
6. **Default values:**
- If ENV block is missing, defaults to:
```okt
ENV {
accelerator: "auto"
min_memory: "8GB"
backend: "auto"
}
```
7. **Platform validation:**
- If `platform = "windows"` β†’ engine must verify Windows OS
- If `platform = "linux"` β†’ engine must verify Linux OS
- If `platform = "mac"` β†’ engine must verify macOS
- If `platform = "any"` β†’ no platform check required
### DATASET Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| train | path | βœ… Yes | File/dir must exist, readable |
| validation | path | ❌ No | File/dir must exist if specified |
| test | path | ❌ No | File/dir must exist if specified |
| format | enum | ❌ No | Must be: jsonl, csv, txt, parquet, image+caption, qa, instruction, multimodal |
| type | enum | ❌ No | Must be: classification, generation, qa, chat, vision, regression |
| language | enum | ❌ No | Must be: en, pt, es, fr, multilingual |
| augmentation | array | ❌ No | Each item must be valid augmentation type |
| dataset_percent | number | ❌ No | Must be 1-100 (v1.1+) |
| mix_datasets | array | ❌ No | Array of {path, weight} objects (v1.1+) |
| sampling | enum | ❌ No | Must be: weighted, random (v1.1+) |
| shuffle | boolean | ❌ No | true or false (v1.1+) |
### MODEL Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| base | string | βœ… Yes | Valid model identifier or path |
| architecture | enum | ❌ No | Must be: transformer, cnn, rnn, diffusion, vision-transformer, bert, gpt, t5 |
| parameters | string | ❌ No | Format: number + (K\|M\|B), e.g., "120M" |
| context_window | number | ❌ No | Must be power of 2: 128, 256, 512, 1024, 2048, 4096, 8192 |
| precision | enum | ❌ No | Must be: fp32, fp16, int8, int4 |
| inherit | string | ❌ No | Must reference existing model name |
### TRAIN Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| epochs | number | βœ… Yes | > 0 and <= 1000 |
| batch_size | number | βœ… Yes | > 0 and <= 1024 |
| learning_rate | decimal | ❌ No | > 0 and <= 1.0 |
| optimizer | enum | ❌ No | Must be: adam, adamw, sgd, rmsprop, adafactor, lamb |
| scheduler | enum | ❌ No | Must be: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step |
| device | enum | βœ… Yes | Must be: cpu, cuda, mps, auto |
| gradient_accumulation | number | ❌ No | >= 1 |
| early_stopping | boolean | ❌ No | true or false |
| checkpoint_steps | number | ❌ No | > 0 |
| checkpoint_path | path | ❌ No | Directory must exist if specified |
| resume_from_checkpoint | path | ❌ No | Checkpoint must exist if specified |
| loss | enum | ❌ No | Must be: cross_entropy, mse, mae, bce, focal, huber, kl_divergence |
| weight_decay | decimal | ❌ No | >= 0 and <= 1.0 |
| gradient_clip | decimal | ❌ No | > 0 |
| warmup_steps | number | ❌ No | >= 0 |
| save_strategy | enum | ❌ No | Must be: steps, epoch, no |
### METRICS Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| Built-in metrics | identifier | ❌ No | Must be valid metric name |
| custom | string | ❌ No | Custom metric identifier |
**Metric-task compatibility:**
- `accuracy`, `precision`, `recall`, `f1`, `confusion_matrix`: Only for classification
- `perplexity`: Only for language models
- `bleu`, `rouge`: Only for generation/translation
- `mae`, `mse`, `rmse`: Only for regression
### EXPORT Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| format | array | βœ… Yes | Each item must be: gguf, onnx, okm, safetensors, tflite |
| path | path | βœ… Yes | Directory must exist or be creatable |
| quantization | enum | ❌ No | Must be: int8, int4, fp16, fp32 |
| optimize_for | enum | ❌ No | Must be: speed, size, accuracy |
**Format-specific requirements:**
- `gguf`: Requires quantization
- `tflite`: Only for mobile-compatible architectures
### FT_LORA Block (v1.1+)
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| base_model | string | βœ… Yes | Valid model identifier or path |
| train_dataset | path | βœ… Yes | File/dir must exist if specified |
| lora_rank | number | βœ… Yes | > 0 and <= 256 |
| lora_alpha | number | βœ… Yes | > 0 |
| dataset_percent | number | ❌ No | 1-100 |
| mix_datasets | array | ❌ No | Array of {path, weight}, total weights = 100 |
| epochs | number | ❌ No | > 0 and <= 1000 |
| batch_size | number | ❌ No | > 0 and <= 1024 |
| learning_rate | decimal | ❌ No | > 0 and <= 1.0 |
| device | enum | ❌ No | Must be: cpu, cuda, mps, auto |
| target_modules | array | ❌ No | Array of module names |
**Validation Rules:**
- If `mix_datasets` is specified, it overrides `train_dataset`
- Total weights in `mix_datasets` must equal exactly 100
- `lora_rank` typically: 4, 8, 16, 32
- `lora_alpha` typically: 16, 32, 64
- Cannot use both `TRAIN` and `FT_LORA` in same file
### MODEL Block β€” ADAPTER Sub-block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| type | enum | βœ… Yes | Must be: lora, qlora, adapter, peft |
| path | path | βœ… Yes | Must exist and be valid adapter path |
| rank | number | ❌ No | > 0, typically 4, 8, 16, 32, 64 |
| alpha | number | ❌ No | > 0, typically 16, 32, 64 |
**Validation Rules:**
- If ADAPTER is defined, it is applied after base model is loaded
- Adapter path must exist and be readable
- ADAPTER is optional within MODEL block
### INFERENCE Block (Expanded)
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| mode | enum | βœ… Yes | Must be: chat, intent, translate, classify, custom |
| format | string | ❌ No | Template string with {input}, {context}, {labels} |
| exit_command | string | ❌ No | Command to exit chat mode |
| params | object | ❌ No | Inference parameters object |
| CONTROL | block | ❌ No | Nested CONTROL block for inference |
**INFERENCE params:**
- `max_length`: > 0 and <= 8192
- `temperature`: >= 0.0 and <= 2.0
- `top_p`: > 0.0 and <= 1.0
- `beams`: >= 1
- `do_sample`: boolean (true/false)
- `top_k`: >= 0 (0 = disabled)
- `repetition_penalty`: > 0.0 and <= 2.0
**Validation Rules:**
- IF INFERENCE exists THEN MODEL is required
- Format string must contain at least {input} for most modes
- CONTROL within INFERENCE can only use: RETRY, REGENERATE, REPLACE
### CONTROL Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| IF | condition | ❌ No | Conditional logic |
| WHEN | condition | ❌ No | Event-based conditional |
| EVERY | number + steps | ❌ No | Periodic actions |
| SET | assignment | ❌ No | Set parameter value |
| STOP | action | ❌ No | Stop operation |
| LOG | metric/string | ❌ No | Log value or message |
| SAVE | target | ❌ No | Save model/checkpoint |
| RETRY | action | ❌ No | Retry inference |
| REGENERATE | action | ❌ No | Regenerate output |
| STOP_TRAINING | action | ❌ No | Stop training |
| DECREASE | parameter + BY + value | ❌ No | Decrease parameter |
| INCREASE | parameter + BY + value | ❌ No | Increase parameter |
| on_step_end | block | ❌ No | Hook executed at step end |
| on_epoch_end | block | ❌ No | Hook executed at epoch end |
| on_memory_low | block | ❌ No | Hook executed when memory low |
| on_nan | block | ❌ No | Hook executed on NaN |
| on_plateau | block | ❌ No | Hook executed on loss plateau |
| validate_every | number | ❌ No | Validate every N steps |
**Validation Rules:**
- IF CONTROL used THEN must contain at least one of: IF | WHEN | EVERY | on_step_end | on_epoch_end
- Boolean values accepted = true | false
- Allowed CONTROL keywords = IF | WHEN | EVERY | SET | STOP | LOG | SAVE | RETRY | REGENERATE | STOP_TRAINING | DECREASE | INCREASE
- validate_every must receive integer
- DECREASE LR requires numeric value
- Conditions must use valid comparison operators: >, <, >=, <=, ==, !=
### MONITOR Block (v1.1+)
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| metrics | array | ❌ No | Array of metric names |
| notify_if | object | ❌ No | Conditions for notifications |
| log_to | path | ❌ No | Path to log file |
| level | enum | ❌ No | Must be: basic, full |
| log_system | array | ❌ No | Array of system metric names |
| log_speed | array | ❌ No | Array of speed metric names |
| refresh_interval | string | ❌ No | Format: number + "s" or "ms", >= 1s |
| export_to | path | ❌ No | Directory must exist or be creatable |
| dashboard | boolean | ❌ No | true or false |
**System Metrics:**
- `gpu_memory_used`, `gpu_memory_free`, `gpu_usage`, `gpu_temperature`: Only if CUDA available
- `temperature`: Only if hardware supports it
**Validation Rules:**
- GPU metrics only validated if CUDA is available
- `refresh_interval` must be >= 1s
- `MONITOR` extends `METRICS` and `LOGGING`, does not replace them
- `notify_if` conditions must use valid comparison operators
- Supported metrics: loss, accuracy, val_loss, val_accuracy, gpu_usage, ram_usage, throughput, latency, confidence, hallucination_score, and all custom metrics
### GUARD Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| prevent | object | ❌ No | Array of prevention types |
| on_violation | object | ❌ No | Action on violation |
**Prevention types:**
- `hallucination`, `toxicity`, `bias`, `data_leak`, `unsafe_code`
**Validation Rules:**
- GUARD.on_violation can only be STOP or ALERT or REPLACE or LOG
- Prevention types must be valid enum values
### BEHAVIOR Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| personality | enum | ❌ No | Must be: professional, friendly, assistant, casual, formal, creative |
| verbosity | enum | ❌ No | Must be: low, medium, high |
| language | enum | ❌ No | Must be: en, pt-BR, es, fr, de, it, ja, zh, multilingual |
| avoid | array | ❌ No | Array of strings to avoid |
| fallback | string | ❌ No | Fallback message |
**Validation Rules:**
- All enum values must match allowed values
- fallback must be a non-empty string if provided
### EXPLORER Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| try | object | βœ… Yes | Parameter combinations to test |
| max_tests | number | ❌ No | Must be <= 50 |
| pick_best_by | string | ❌ No | Must be valid metric name |
**Validation Rules:**
- EXPLORER.max_tests must be <= 50
- pick_best_by must be a valid metric (e.g., "val_loss", "accuracy")
- try object must contain at least one parameter array
- Parameter arrays must contain valid values for their type
### STABILITY Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| stop_if_nan | boolean | ❌ No | true or false |
| stop_if_diverges | boolean | ❌ No | true or false |
| min_improvement | decimal | ❌ No | Must be float >= 0 |
**Validation Rules:**
- STABILITY.min_improvement must be float
- Boolean values must be true or false (lowercase)
### DEPLOY Block
| Field | Type | Required | Constraints |
|-------|------|----------|-------------|
| target | enum | βœ… Yes | Must be: local, cloud, edge, api, android, ios, web, desktop |
| endpoint | string | ❌ No | Required if target is "api" |
| requires_auth | boolean | ❌ No | true or false |
| port | number | ❌ No | Required if target is "api", must be 1024-65535 |
| max_concurrent_requests | number | ❌ No | > 0 |
---
## Dependency Validation
### Model Inheritance
- If `inherit` is specified, parent model must be defined
- Circular inheritance is not allowed
- Inheritance chain depth limited to 10 levels
### Checkpoint Resume
- If `resume_from_checkpoint` is specified:
- Checkpoint directory must exist
- Checkpoint must contain valid model files
- Checkpoint must be compatible with current model architecture
### Export Compatibility
- Model architecture must support export format
- Quantization required for certain formats (gguf)
- Mobile formats (tflite, okm) require compatible architectures
---
## Runtime Validation
### Dataset Validation
**File existence:**
- All dataset paths must exist
- Files must be readable
- Directories must be accessible
**Format validation:**
- JSONL: Each line must be valid JSON
- CSV: Must have header row, consistent columns
- Image+caption: Directory must contain image files and captions
**Size limits:**
- Maximum file size: 10GB per file
- Maximum total dataset size: 100GB
- Minimum examples: 10 for training
**Dataset Mixing (v1.1+):**
- If `mix_datasets` is specified, `train` is ignored
- All paths in `mix_datasets` must exist
- Total weights must equal exactly 100
- `dataset_percent` applies to the mixed dataset
- `sampling: "weighted"` uses weights, `"random"` ignores them
### Model Validation
**Base model:**
- If local path: Must exist and be valid model directory
- If HuggingFace: Must be downloadable
- If URL: Must be accessible
**Architecture compatibility:**
- Model architecture must match dataset type
- Vision models require image datasets
- Language models require text datasets
### Training Validation
**Hardware requirements:**
- GPU required if `device: "cuda"` and `gpu: true`
- Sufficient VRAM for batch size
- Sufficient disk space for checkpoints
**Memory validation:**
- Batch size must fit in available memory
- Effective batch size (batch_size Γ— gradient_accumulation) validated
---
## Error Codes
| Code | Error | Solution |
|------|-------|----------|
| V001 | Dataset file not found | Check file path, use absolute or relative path |
| V002 | Invalid optimizer | Use one of: adam, adamw, sgd, rmsprop, adafactor, lamb |
| V003 | Invalid scheduler | Use one of: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step |
| V004 | Model base not found | Verify model path or HuggingFace model name |
| V005 | Checkpoint not found | Check checkpoint path or remove resume_from_checkpoint |
| V006 | Insufficient memory | Reduce batch_size or enable gradient_accumulation |
| V007 | Invalid metric for task | Use appropriate metrics for task type |
| V008 | Invalid export format | Check format compatibility with model architecture |
| V009 | Circular inheritance | Remove circular model inheritance chain |
| V010 | Invalid field value | Check field constraints and allowed values |
| V011 | Dataset mixing weights invalid | Total weights in mix_datasets must equal 100 |
| V012 | FT_LORA and TRAIN conflict | Cannot use both TRAIN and FT_LORA in same file |
| V013 | Version declaration invalid | Version must be "1.0" or "1.1" |
| V014 | GPU metrics unavailable | GPU metrics requested but CUDA not available |
| V015 | CONTROL block empty | CONTROL must contain at least one directive |
| V016 | Invalid CONTROL keyword | Use only allowed CONTROL keywords |
| V017 | EXPLORER max_tests too high | max_tests must be <= 50 |
| V018 | Invalid boolean value | Boolean must be true or false (lowercase) |
| V019 | INFERENCE without MODEL | INFERENCE block requires MODEL block |
| V020 | Invalid adapter type | ADAPTER type must be: lora, qlora, adapter, peft |
| V021 | GUARD violation action invalid | on_violation must be: STOP, ALERT, REPLACE, or LOG |
---
## Validation Commands
### CLI Validation
```bash
# Validate syntax and structure
okto validate train.okt
# Validate with detailed output
okto validate train.okt --verbose
# Validate dataset only
okto validate train.okt --dataset-only
# Validate model only
okto validate train.okt --model-only
```
### IDE Validation
OktoSeek IDE automatically validates:
- Real-time syntax checking
- Field completion suggestions
- Error highlighting
- Warning messages
---
## Best Practices
1. **Always validate before training**
```bash
okto validate train.okt
```
2. **Check dataset format**
- Use `okto validate --dataset-only` to verify dataset structure
3. **Verify model compatibility**
- Ensure model architecture matches dataset type
- Check export format compatibility
4. **Test with small dataset first**
- Use subset of data for initial validation
- Verify pipeline works before full training
5. **Monitor resource usage**
- Check available GPU memory
- Verify disk space for checkpoints
- Monitor training progress
---
**For more information, see:**
- [Grammar Specification](./docs/grammar.md)
- [Getting Started Guide](./docs/GETTING_STARTED.md)
- [Troubleshooting](./docs/grammar.md#troubleshooting)