oktoscript / VALIDATION_RULES.md

Upload 48 files

5fc8c9d verified about 1 month ago

19.2 kB

	# OktoScript Validation Rules

	Complete reference for validation rules and constraints in OktoScript.

	---

	## File Structure Validation

	### Required Files

	1. okt.yaml (in project root)
	- Must exist
	- Must be valid YAML
	- Must contain `project` field

	2. Dataset Files
	- All paths specified in DATASET block must exist
	- Files must be readable
	- Format must match declared format

	3. Model Files (if using local paths)
	- Base model path must exist (if local)
	- Checkpoint paths must exist (if resuming)

	---

	## Field Validation

	### PROJECT Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| PROJECT \| string \| ✅ Yes \| 1-100 chars, no special chars: `{}[]:"` \|

	### ENV Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| accelerator \| enum \| ❌ No \| Must be: `auto`, `cpu`, `gpu`, `tpu` \|
	\| min_memory \| string \| ❌ No \| Must be: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` (quoted, GB suffix required) \|
	\| precision \| enum \| ❌ No \| Must be: `auto`, `fp16`, `fp32`, `bf16` \|
	\| backend \| enum \| ❌ No \| Must be: `auto`, `oktoseek` \|
	\| install_missing \| boolean \| ❌ No \| Must be: `true` or `false` (lowercase) \|
	\| platform \| enum \| ❌ No \| Must be: `windows`, `linux`, `mac`, `any` \|
	\| network \| enum \| ❌ No \| Must be: `online`, `offline`, `required` \|

	ENV Validation Rules:

	1. Memory format validation:
	- Must use `GB` suffix (e.g., `"8GB"`, not `"8"` or `"8 GB"`)
	- Only values: `"4GB"`, `"8GB"`, `"16GB"`, `"32GB"`, `"64GB"` are allowed
	- Must be quoted string

	2. Accelerator and memory compatibility:
	- If `accelerator = "gpu"` and `min_memory < "8GB"` → warning (GPU training typically requires at least 8GB RAM)
	- If `accelerator = "tpu"` → `min_memory` should be at least `"16GB"` (recommended)

	3. Network and export compatibility:
	- If `network = "offline"` → export formats like `onnx` or `gguf` are allowed (pre-downloaded models)
	- If `network = "required"` → engine must verify internet connectivity before proceeding

	4. Backend preferences:
	- If `backend = "oktoseek"` → preferred default for OktoSeek ecosystem
	- If `backend = "auto"` → engine selects best available backend

	5. Auto-installation:
	- If `install_missing = true` → engine must attempt auto-setup of missing dependencies
	- If `install_missing = false` → engine must fail with clear error if dependencies are missing

	6. Default values:
	- If ENV block is missing, defaults to:
	```okt
	ENV {
	accelerator: "auto"
	min_memory: "8GB"
	backend: "auto"
	}
	```

	7. Platform validation:
	- If `platform = "windows"` → engine must verify Windows OS
	- If `platform = "linux"` → engine must verify Linux OS
	- If `platform = "mac"` → engine must verify macOS
	- If `platform = "any"` → no platform check required

	### DATASET Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| train \| path \| ✅ Yes \| File/dir must exist, readable \|
	\| validation \| path \| ❌ No \| File/dir must exist if specified \|
	\| test \| path \| ❌ No \| File/dir must exist if specified \|
	\| format \| enum \| ❌ No \| Must be: jsonl, csv, txt, parquet, image+caption, qa, instruction, multimodal \|
	\| type \| enum \| ❌ No \| Must be: classification, generation, qa, chat, vision, regression \|
	\| language \| enum \| ❌ No \| Must be: en, pt, es, fr, multilingual \|
	\| augmentation \| array \| ❌ No \| Each item must be valid augmentation type \|
	\| dataset_percent \| number \| ❌ No \| Must be 1-100 (v1.1+) \|
	\| mix_datasets \| array \| ❌ No \| Array of {path, weight} objects (v1.1+) \|
	\| sampling \| enum \| ❌ No \| Must be: weighted, random (v1.1+) \|
	\| shuffle \| boolean \| ❌ No \| true or false (v1.1+) \|

	### MODEL Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| base \| string \| ✅ Yes \| Valid model identifier or path \|
	\| architecture \| enum \| ❌ No \| Must be: transformer, cnn, rnn, diffusion, vision-transformer, bert, gpt, t5 \|
	\| parameters \| string \| ❌ No \| Format: number + (K\\|M\\|B), e.g., "120M" \|
	\| context_window \| number \| ❌ No \| Must be power of 2: 128, 256, 512, 1024, 2048, 4096, 8192 \|
	\| precision \| enum \| ❌ No \| Must be: fp32, fp16, int8, int4 \|
	\| inherit \| string \| ❌ No \| Must reference existing model name \|

	### TRAIN Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| epochs \| number \| ✅ Yes \| > 0 and <= 1000 \|
	\| batch_size \| number \| ✅ Yes \| > 0 and <= 1024 \|
	\| learning_rate \| decimal \| ❌ No \| > 0 and <= 1.0 \|
	\| optimizer \| enum \| ❌ No \| Must be: adam, adamw, sgd, rmsprop, adafactor, lamb \|
	\| scheduler \| enum \| ❌ No \| Must be: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step \|
	\| device \| enum \| ✅ Yes \| Must be: cpu, cuda, mps, auto \|
	\| gradient_accumulation \| number \| ❌ No \| >= 1 \|
	\| early_stopping \| boolean \| ❌ No \| true or false \|
	\| checkpoint_steps \| number \| ❌ No \| > 0 \|
	\| checkpoint_path \| path \| ❌ No \| Directory must exist if specified \|
	\| resume_from_checkpoint \| path \| ❌ No \| Checkpoint must exist if specified \|
	\| loss \| enum \| ❌ No \| Must be: cross_entropy, mse, mae, bce, focal, huber, kl_divergence \|
	\| weight_decay \| decimal \| ❌ No \| >= 0 and <= 1.0 \|
	\| gradient_clip \| decimal \| ❌ No \| > 0 \|
	\| warmup_steps \| number \| ❌ No \| >= 0 \|
	\| save_strategy \| enum \| ❌ No \| Must be: steps, epoch, no \|

	### METRICS Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| Built-in metrics \| identifier \| ❌ No \| Must be valid metric name \|
	\| custom \| string \| ❌ No \| Custom metric identifier \|

	Metric-task compatibility:
	- `accuracy`, `precision`, `recall`, `f1`, `confusion_matrix`: Only for classification
	- `perplexity`: Only for language models
	- `bleu`, `rouge`: Only for generation/translation
	- `mae`, `mse`, `rmse`: Only for regression

	### EXPORT Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| format \| array \| ✅ Yes \| Each item must be: gguf, onnx, okm, safetensors, tflite \|
	\| path \| path \| ✅ Yes \| Directory must exist or be creatable \|
	\| quantization \| enum \| ❌ No \| Must be: int8, int4, fp16, fp32 \|
	\| optimize_for \| enum \| ❌ No \| Must be: speed, size, accuracy \|

	Format-specific requirements:
	- `gguf`: Requires quantization
	- `tflite`: Only for mobile-compatible architectures

	### FT_LORA Block (v1.1+)

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| base_model \| string \| ✅ Yes \| Valid model identifier or path \|
	\| train_dataset \| path \| ✅ Yes \| File/dir must exist if specified \|
	\| lora_rank \| number \| ✅ Yes \| > 0 and <= 256 \|
	\| lora_alpha \| number \| ✅ Yes \| > 0 \|
	\| dataset_percent \| number \| ❌ No \| 1-100 \|
	\| mix_datasets \| array \| ❌ No \| Array of {path, weight}, total weights = 100 \|
	\| epochs \| number \| ❌ No \| > 0 and <= 1000 \|
	\| batch_size \| number \| ❌ No \| > 0 and <= 1024 \|
	\| learning_rate \| decimal \| ❌ No \| > 0 and <= 1.0 \|
	\| device \| enum \| ❌ No \| Must be: cpu, cuda, mps, auto \|
	\| target_modules \| array \| ❌ No \| Array of module names \|

	Validation Rules:
	- If `mix_datasets` is specified, it overrides `train_dataset`
	- Total weights in `mix_datasets` must equal exactly 100
	- `lora_rank` typically: 4, 8, 16, 32
	- `lora_alpha` typically: 16, 32, 64
	- Cannot use both `TRAIN` and `FT_LORA` in same file

	### MODEL Block — ADAPTER Sub-block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| type \| enum \| ✅ Yes \| Must be: lora, qlora, adapter, peft \|
	\| path \| path \| ✅ Yes \| Must exist and be valid adapter path \|
	\| rank \| number \| ❌ No \| > 0, typically 4, 8, 16, 32, 64 \|
	\| alpha \| number \| ❌ No \| > 0, typically 16, 32, 64 \|

	Validation Rules:
	- If ADAPTER is defined, it is applied after base model is loaded
	- Adapter path must exist and be readable
	- ADAPTER is optional within MODEL block

	### INFERENCE Block (Expanded)

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| mode \| enum \| ✅ Yes \| Must be: chat, intent, translate, classify, custom \|
	\| format \| string \| ❌ No \| Template string with {input}, {context}, {labels} \|
	\| exit_command \| string \| ❌ No \| Command to exit chat mode \|
	\| params \| object \| ❌ No \| Inference parameters object \|
	\| CONTROL \| block \| ❌ No \| Nested CONTROL block for inference \|

	INFERENCE params:
	- `max_length`: > 0 and <= 8192
	- `temperature`: >= 0.0 and <= 2.0
	- `top_p`: > 0.0 and <= 1.0
	- `beams`: >= 1
	- `do_sample`: boolean (true/false)
	- `top_k`: >= 0 (0 = disabled)
	- `repetition_penalty`: > 0.0 and <= 2.0

	Validation Rules:
	- IF INFERENCE exists THEN MODEL is required
	- Format string must contain at least {input} for most modes
	- CONTROL within INFERENCE can only use: RETRY, REGENERATE, REPLACE

	### CONTROL Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| IF \| condition \| ❌ No \| Conditional logic \|
	\| WHEN \| condition \| ❌ No \| Event-based conditional \|
	\| EVERY \| number + steps \| ❌ No \| Periodic actions \|
	\| SET \| assignment \| ❌ No \| Set parameter value \|
	\| STOP \| action \| ❌ No \| Stop operation \|
	\| LOG \| metric/string \| ❌ No \| Log value or message \|
	\| SAVE \| target \| ❌ No \| Save model/checkpoint \|
	\| RETRY \| action \| ❌ No \| Retry inference \|
	\| REGENERATE \| action \| ❌ No \| Regenerate output \|
	\| STOP_TRAINING \| action \| ❌ No \| Stop training \|
	\| DECREASE \| parameter + BY + value \| ❌ No \| Decrease parameter \|
	\| INCREASE \| parameter + BY + value \| ❌ No \| Increase parameter \|
	\| on_step_end \| block \| ❌ No \| Hook executed at step end \|
	\| on_epoch_end \| block \| ❌ No \| Hook executed at epoch end \|
	\| on_memory_low \| block \| ❌ No \| Hook executed when memory low \|
	\| on_nan \| block \| ❌ No \| Hook executed on NaN \|
	\| on_plateau \| block \| ❌ No \| Hook executed on loss plateau \|
	\| validate_every \| number \| ❌ No \| Validate every N steps \|

	Validation Rules:
	- IF CONTROL used THEN must contain at least one of: IF \| WHEN \| EVERY \| on_step_end \| on_epoch_end
	- Boolean values accepted = true \| false
	- Allowed CONTROL keywords = IF \| WHEN \| EVERY \| SET \| STOP \| LOG \| SAVE \| RETRY \| REGENERATE \| STOP_TRAINING \| DECREASE \| INCREASE
	- validate_every must receive integer
	- DECREASE LR requires numeric value
	- Conditions must use valid comparison operators: >, <, >=, <=, ==, !=

	### MONITOR Block (v1.1+)

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| metrics \| array \| ❌ No \| Array of metric names \|
	\| notify_if \| object \| ❌ No \| Conditions for notifications \|
	\| log_to \| path \| ❌ No \| Path to log file \|
	\| level \| enum \| ❌ No \| Must be: basic, full \|
	\| log_system \| array \| ❌ No \| Array of system metric names \|
	\| log_speed \| array \| ❌ No \| Array of speed metric names \|
	\| refresh_interval \| string \| ❌ No \| Format: number + "s" or "ms", >= 1s \|
	\| export_to \| path \| ❌ No \| Directory must exist or be creatable \|
	\| dashboard \| boolean \| ❌ No \| true or false \|

	System Metrics:
	- `gpu_memory_used`, `gpu_memory_free`, `gpu_usage`, `gpu_temperature`: Only if CUDA available
	- `temperature`: Only if hardware supports it

	Validation Rules:
	- GPU metrics only validated if CUDA is available
	- `refresh_interval` must be >= 1s
	- `MONITOR` extends `METRICS` and `LOGGING`, does not replace them
	- `notify_if` conditions must use valid comparison operators
	- Supported metrics: loss, accuracy, val_loss, val_accuracy, gpu_usage, ram_usage, throughput, latency, confidence, hallucination_score, and all custom metrics

	### GUARD Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| prevent \| object \| ❌ No \| Array of prevention types \|
	\| on_violation \| object \| ❌ No \| Action on violation \|

	Prevention types:
	- `hallucination`, `toxicity`, `bias`, `data_leak`, `unsafe_code`

	Validation Rules:
	- GUARD.on_violation can only be STOP or ALERT or REPLACE or LOG
	- Prevention types must be valid enum values

	### BEHAVIOR Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| personality \| enum \| ❌ No \| Must be: professional, friendly, assistant, casual, formal, creative \|
	\| verbosity \| enum \| ❌ No \| Must be: low, medium, high \|
	\| language \| enum \| ❌ No \| Must be: en, pt-BR, es, fr, de, it, ja, zh, multilingual \|
	\| avoid \| array \| ❌ No \| Array of strings to avoid \|
	\| fallback \| string \| ❌ No \| Fallback message \|

	Validation Rules:
	- All enum values must match allowed values
	- fallback must be a non-empty string if provided

	### EXPLORER Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| try \| object \| ✅ Yes \| Parameter combinations to test \|
	\| max_tests \| number \| ❌ No \| Must be <= 50 \|
	\| pick_best_by \| string \| ❌ No \| Must be valid metric name \|

	Validation Rules:
	- EXPLORER.max_tests must be <= 50
	- pick_best_by must be a valid metric (e.g., "val_loss", "accuracy")
	- try object must contain at least one parameter array
	- Parameter arrays must contain valid values for their type

	### STABILITY Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| stop_if_nan \| boolean \| ❌ No \| true or false \|
	\| stop_if_diverges \| boolean \| ❌ No \| true or false \|
	\| min_improvement \| decimal \| ❌ No \| Must be float >= 0 \|

	Validation Rules:
	- STABILITY.min_improvement must be float
	- Boolean values must be true or false (lowercase)

	### DEPLOY Block

	\| Field \| Type \| Required \| Constraints \|
	\|-------\|------\|----------\|-------------\|
	\| target \| enum \| ✅ Yes \| Must be: local, cloud, edge, api, android, ios, web, desktop \|
	\| endpoint \| string \| ❌ No \| Required if target is "api" \|
	\| requires_auth \| boolean \| ❌ No \| true or false \|
	\| port \| number \| ❌ No \| Required if target is "api", must be 1024-65535 \|
	\| max_concurrent_requests \| number \| ❌ No \| > 0 \|

	---

	## Dependency Validation

	### Model Inheritance

	- If `inherit` is specified, parent model must be defined
	- Circular inheritance is not allowed
	- Inheritance chain depth limited to 10 levels

	### Checkpoint Resume

	- If `resume_from_checkpoint` is specified:
	- Checkpoint directory must exist
	- Checkpoint must contain valid model files
	- Checkpoint must be compatible with current model architecture

	### Export Compatibility

	- Model architecture must support export format
	- Quantization required for certain formats (gguf)
	- Mobile formats (tflite, okm) require compatible architectures

	---

	## Runtime Validation

	### Dataset Validation

	File existence:
	- All dataset paths must exist
	- Files must be readable
	- Directories must be accessible

	Format validation:
	- JSONL: Each line must be valid JSON
	- CSV: Must have header row, consistent columns
	- Image+caption: Directory must contain image files and captions

	Size limits:
	- Maximum file size: 10GB per file
	- Maximum total dataset size: 100GB
	- Minimum examples: 10 for training

	Dataset Mixing (v1.1+):
	- If `mix_datasets` is specified, `train` is ignored
	- All paths in `mix_datasets` must exist
	- Total weights must equal exactly 100
	- `dataset_percent` applies to the mixed dataset
	- `sampling: "weighted"` uses weights, `"random"` ignores them

	### Model Validation

	Base model:
	- If local path: Must exist and be valid model directory
	- If HuggingFace: Must be downloadable
	- If URL: Must be accessible

	Architecture compatibility:
	- Model architecture must match dataset type
	- Vision models require image datasets
	- Language models require text datasets

	### Training Validation

	Hardware requirements:
	- GPU required if `device: "cuda"` and `gpu: true`
	- Sufficient VRAM for batch size
	- Sufficient disk space for checkpoints

	Memory validation:
	- Batch size must fit in available memory
	- Effective batch size (batch_size × gradient_accumulation) validated

	---

	## Error Codes

	\| Code \| Error \| Solution \|
	\|------\|-------\|----------\|
	\| V001 \| Dataset file not found \| Check file path, use absolute or relative path \|
	\| V002 \| Invalid optimizer \| Use one of: adam, adamw, sgd, rmsprop, adafactor, lamb \|
	\| V003 \| Invalid scheduler \| Use one of: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, step \|
	\| V004 \| Model base not found \| Verify model path or HuggingFace model name \|
	\| V005 \| Checkpoint not found \| Check checkpoint path or remove resume_from_checkpoint \|
	\| V006 \| Insufficient memory \| Reduce batch_size or enable gradient_accumulation \|
	\| V007 \| Invalid metric for task \| Use appropriate metrics for task type \|
	\| V008 \| Invalid export format \| Check format compatibility with model architecture \|
	\| V009 \| Circular inheritance \| Remove circular model inheritance chain \|
	\| V010 \| Invalid field value \| Check field constraints and allowed values \|
	\| V011 \| Dataset mixing weights invalid \| Total weights in mix_datasets must equal 100 \|
	\| V012 \| FT_LORA and TRAIN conflict \| Cannot use both TRAIN and FT_LORA in same file \|
	\| V013 \| Version declaration invalid \| Version must be "1.0" or "1.1" \|
	\| V014 \| GPU metrics unavailable \| GPU metrics requested but CUDA not available \|
	\| V015 \| CONTROL block empty \| CONTROL must contain at least one directive \|
	\| V016 \| Invalid CONTROL keyword \| Use only allowed CONTROL keywords \|
	\| V017 \| EXPLORER max_tests too high \| max_tests must be <= 50 \|
	\| V018 \| Invalid boolean value \| Boolean must be true or false (lowercase) \|
	\| V019 \| INFERENCE without MODEL \| INFERENCE block requires MODEL block \|
	\| V020 \| Invalid adapter type \| ADAPTER type must be: lora, qlora, adapter, peft \|
	\| V021 \| GUARD violation action invalid \| on_violation must be: STOP, ALERT, REPLACE, or LOG \|

	---

	## Validation Commands

	### CLI Validation

	```bash
	# Validate syntax and structure
	okto validate train.okt

	# Validate with detailed output
	okto validate train.okt --verbose

	# Validate dataset only
	okto validate train.okt --dataset-only

	# Validate model only
	okto validate train.okt --model-only
	```

	### IDE Validation

	OktoSeek IDE automatically validates:
	- Real-time syntax checking
	- Field completion suggestions
	- Error highlighting
	- Warning messages

	---

	## Best Practices

	1. Always validate before training
	```bash
	okto validate train.okt
	```

	2. Check dataset format
	- Use `okto validate --dataset-only` to verify dataset structure

	3. Verify model compatibility
	- Ensure model architecture matches dataset type
	- Check export format compatibility

	4. Test with small dataset first
	- Use subset of data for initial validation
	- Verify pipeline works before full training

	5. Monitor resource usage
	- Check available GPU memory
	- Verify disk space for checkpoints
	- Monitor training progress

	---

	For more information, see:
	- [Grammar Specification](./docs/grammar.md)
	- [Getting Started Guide](./docs/GETTING_STARTED.md)
	- [Troubleshooting](./docs/grammar.md#troubleshooting)