Upload 48 files

5fc8c9d verified about 1 month ago

23.3 kB

	# OktoScript – Frequently Asked Questions (FAQ)

	Common questions and answers about OktoScript, a domain-specific language for AI training, evaluation, and deployment.

	---

	## 1. Even if FT_LORA already points to a base model and dataset, why must I still declare the MODEL and DATASET blocks?

	Answer:

	In OktoScript, `MODEL` and `DATASET` blocks define the global context of your project. They represent the default base configuration for the entire pipeline.

	The `FT_LORA` block does not replace them—it only defines how the fine-tuning is performed. This explicit separation makes scripts clearer, more organized, and avoids hidden assumptions.

	Benefits of explicit declaration:
	- ✅ Readability - Anyone can understand the project structure at a glance
	- ✅ Debugging - Clear separation of concerns makes troubleshooting easier
	- ✅ Reproducibility - All configuration is visible and version-controlled
	- ✅ Documentation - The script serves as self-documenting code

	Example:
	```okt
	MODEL {
	base: "oktoseek/base-llm-7b" # Global model context
	}

	DATASET {
	train: "dataset/main.jsonl" # Global dataset context
	}

	FT_LORA {
	base_model: "oktoseek/base-llm-7b" # Explicit for LoRA
	train_dataset: "dataset/main.jsonl" # Explicit for LoRA
	lora_rank: 8
	}
	```

	This design follows the principle: explicit is better than implicit, especially in AI pipelines where assumptions can lead to costly mistakes.

	---

	## 2. If I already use FT_LORA, why is the TRAIN block still mandatory?

	Answer:

	`FT_LORA` defines what kind of training happens (LoRA adapters), but `TRAIN` defines how the training loop is executed (optimizer, batch size, device, etc.).

	Think of it this way:
	- `TRAIN` = The engine (how training runs)
	- `FT_LORA` = The driving mode (what gets trained)

	The TRAIN block controls:
	- Optimizer (adam, adamw, sgd, etc.)
	- Batch size and gradient accumulation
	- Device selection (cpu, cuda, mps)
	- Learning rate and scheduler
	- Training strategy (early stopping, checkpoints)

	Example:
	```okt
	TRAIN {
	epochs: 5
	batch_size: 4
	optimizer: "adamw"
	learning_rate: 0.00003
	device: "cuda"
	}

	FT_LORA {
	lora_rank: 8
	lora_alpha: 32
	target_modules: ["q_proj", "v_proj"]
	}
	```

	Both blocks are required because they serve different purposes in the declarative DSL structure.

	---

	## 3. How do I define the final output of my model in OktoScript?

	Answer:

	The final output is always defined in the `EXPORT` block, regardless of whether you use `TRAIN` or `FT_LORA`.

	For standard training:
	```okt
	EXPORT {
	format: ["gguf", "onnx", "okm"]
	path: "./export/"
	}
	```

	For LoRA fine-tuning:
	```okt
	EXPORT {
	format: ["safetensors", "okm"]
	path: "./export/lora_patch/"
	}
	```

	What gets exported:
	- With `TRAIN`: Full model weights in specified formats
	- With `FT_LORA`: LoRA adapter weights (safetensors) + optional merged model (okm)

	The `EXPORT` block controls:
	- ✅ Adapter generation (LoRA patches via safetensors)
	- ✅ OktoSeek package generation (okm format)
	- ✅ Cross-platform formats (onnx, gguf)
	- ✅ Quantization settings

	Key point: Export responsibility is clearly separated from training logic, keeping the DSL clean and modular.

	---

	## 4. What is the difference between FT_LORA and TRAIN blocks?

	Answer:

	\| Block \| Role \| Purpose \|
	\|-------\|------\|---------\|
	\| `TRAIN` \| Training loop configuration \| Defines how training runs (optimizer, batch size, device) \|
	\| `FT_LORA` \| LoRA adapter configuration \| Defines what gets trained (LoRA rank, alpha, target modules) \|

	Important: `FT_LORA` is not a replacement for `TRAIN`—it's an extension that modifies how training is applied to the model.

	When to use each:
	- Use `TRAIN` alone: Full fine-tuning of all model parameters
	- Use `TRAIN` + `FT_LORA`: Efficient fine-tuning with LoRA adapters (recommended for large models)

	Example:
	```okt
	# Full fine-tuning
	TRAIN {
	epochs: 10
	batch_size: 32
	device: "cuda"
	}

	# LoRA fine-tuning (more efficient)
	TRAIN {
	epochs: 5
	batch_size: 4
	device: "cuda"
	}

	FT_LORA {
	lora_rank: 8
	lora_alpha: 32
	}
	```

	This separation keeps the language modular and scalable.

	---

	## 5. Do I need to repeat the base model inside FT_LORA if it is already declared in MODEL?

	Answer:

	Yes, by design. OktoScript prefers explicit declarations over implicit inference.

	Even though the engine could technically infer the model from `MODEL`, keeping `base_model` inside `FT_LORA`:

	- ✅ Avoids ambiguity - No guessing which model is used
	- ✅ Makes scripts self-contained - Each block is independent
	- ✅ Improves readability - Clear at a glance what's happening
	- ✅ Helps during audits - Easier to review and validate

	Example:
	```okt
	MODEL {
	base: "oktoseek/base-llm-7b" # Global context
	}

	FT_LORA {
	base_model: "oktoseek/base-llm-7b" # Explicit for LoRA
	lora_rank: 8
	}
	```

	This is an intentional design decision to favor clarity and safety over convenience. In AI pipelines, explicit is safer than implicit.

	---

	## 6. What happens if I use both DATASET.train and mix_datasets at the same time?

	Answer:

	Simple rule: `mix_datasets` overrides `DATASET.train` when present.

	Priority order:
	1. `mix_datasets` in `FT_LORA` (highest priority)
	2. `mix_datasets` in `DATASET` block
	3. `DATASET.train` (default, lowest priority)

	Example:
	```okt
	DATASET {
	train: "dataset/main.jsonl" # Default dataset
	}

	FT_LORA {
	mix_datasets: [
	{ path: "dataset/a.jsonl", weight: 70 },
	{ path: "dataset/b.jsonl", weight: 30 }
	]
	# This mix_datasets overrides DATASET.train
	}
	```

	Why this design?
	- Allows flexibility without breaking the main structure
	- Enables dataset-specific configurations per training method
	- Maintains backward compatibility with v1.0

	Best practice: Use `DATASET.train` for the default, and `mix_datasets` when you need weighted mixing.

	---

	## 7. Does OktoScript replace Python?

	Answer:

	No. OktoScript does not replace Python. Instead, it replaces the complex configuration boilerplate typically written in Python.

	The relationship:
	- Python = Coding and programming (general-purpose language)
	- OktoScript = Configuration of AI pipelines (domain-specific language)

	Think of it this way:
	```
	Python (Engine) ← OktoScript (Configuration Layer) ← User
	```

	OktoScript sits above Python as a declarative layer, while Python powers the OktoEngine underneath.

	What OktoScript replaces:
	- ❌ Hundreds of lines of Python configuration code
	- ❌ Complex YAML files with unclear structure
	- ❌ Repetitive training scripts

	What Python still does:
	- ✅ Powers the OktoEngine
	- ✅ Executes the training loop
	- ✅ Handles low-level operations
	- ✅ Provides hooks for custom logic

	Analogy: OktoScript is to Python what Docker Compose is to Docker—a declarative configuration layer that simplifies complex operations.

	---

	## 8. Can I use multiple datasets with different weights?

	Answer:

	Yes! This is one of the key features of OktoScript v1.1.

	Syntax:
	```okt
	DATASET {
	mix_datasets: [
	{ path: "dataset/general.jsonl", weight: 60 },
	{ path: "dataset/technical.jsonl", weight: 30 },
	{ path: "dataset/creative.jsonl", weight: 10 }
	]
	sampling: "weighted"
	shuffle: true
	}
	```

	Benefits:
	- ✅ Balanced training - Control dataset proportions
	- ✅ Domain blending - Combine different data sources
	- ✅ Bias reduction - Weight underrepresented data
	- ✅ Dataset prioritization - Emphasize important data

	Rules:
	- Total weights must equal exactly 100
	- `sampling: "weighted"` uses weights for sampling
	- `sampling: "random"` ignores weights (uniform sampling)
	- `shuffle: true` shuffles datasets before mixing

	Use case example:
	```okt
	# Mix general conversations (60%) with technical Q&A (30%) and creative writing (10%)
	mix_datasets: [
	{ path: "dataset/conversations.jsonl", weight: 60 },
	{ path: "dataset/technical_qa.jsonl", weight: 30 },
	{ path: "dataset/creative.jsonl", weight: 10 }
	]
	```

	---

	## 9. What is the difference between EXPORT: safetensors and EXPORT: okm?

	Answer:

	\| Format \| Purpose \| Use Case \|
	\|--------\|---------\|----------\|
	\| `safetensors` \| Standard PyTorch weights format \| LoRA adapters, model weights, HuggingFace compatibility \|
	\| `okm` \| OktoSeek optimized package \| OktoSeek IDE, Flutter SDK, mobile apps, exclusive tools \|
	\| `onnx` \| Universal inference format \| Production deployment, cross-platform compatibility \|
	\| `gguf` \| Local inference format \| Ollama, Llama.cpp, local deployment \|

	For LoRA fine-tuning:
	- `safetensors` → Saves only the LoRA adapter patch (small file, ~10-100MB)
	- `okm` → Saves a full OktoSeek model package (includes adapter + metadata)

	Example:
	```okt
	FT_LORA {
	lora_rank: 8
	}

	EXPORT {
	format: ["safetensors", "okm"]
	path: "./export/"
	}
	```

	Output:
	- `./export/adapter.safetensors` - LoRA adapter (for HuggingFace/PyTorch)
	- `./export/model.okm` - OktoSeek package (for OktoSeek ecosystem)

	Why both?
	- `safetensors` for compatibility with standard ML tools
	- `okm` for optimized OktoSeek ecosystem integration

	---

	## 10. Is OktoScript a programming language or a DSL?

	Answer:

	OktoScript is a Domain-Specific Language (DSL).

	What it is NOT:
	- ❌ A general-purpose programming language
	- ❌ A scripting language with loops and variables
	- ❌ A replacement for Python or JavaScript

	What it IS:
	- ✅ A declarative configuration language
	- ✅ Purpose-built for AI pipelines
	- ✅ Domain-specific (focused on AI training/deployment)

	Key characteristics:
	- Declarative - You describe what you want, not how to do it
	- No control flow - No loops, conditionals, or functions
	- Block-based - Configuration organized in semantic blocks
	- Type-safe - Validated against grammar specification

	Why call it a DSL?
	- ✅ Technically accurate
	- ✅ Increases professional credibility
	- ✅ Sets correct expectations
	- ✅ Distinguishes from general-purpose languages

	Analogy: OktoScript is to AI pipelines what SQL is to databases—a specialized language for a specific domain.

	---

	## 11. What happens internally when I write FT_LORA?

	Answer:

	When you use `FT_LORA`, the OktoEngine performs these steps:

	1. Model Loading:
	- Loads the base model specified in `base_model`
	- Initializes model architecture

	2. LoRA Adapter Injection:
	- Freezes the main model layers
	- Adds LoRA adapters to selected modules (e.g., `q_proj`, `v_proj`)
	- Adapters are low-rank matrices (rank × alpha)

	3. Training:
	- Trains only the LoRA adapter weights
	- Main model weights remain frozen
	- Uses optimizer and settings from `TRAIN` block

	4. Export:
	- Saves adapter weights via `EXPORT` block
	- Optionally merges adapter into base model (if specified)

	Benefits:
	- ✅ Reduced GPU usage - Up to 90% less VRAM
	- ✅ Faster training - Only small adapters are updated
	- ✅ Smaller files - Adapter weights are tiny (~10-100MB)
	- ✅ Specialization - Multiple adapters for different tasks
	- ✅ Flexibility - Combine adapters at inference time

	Example flow:
	```
	Base Model (7B params, frozen)
	↓
	+ LoRA Adapters (8 rank × 32 alpha = ~256 params per module)
	↓
	Training (only adapters updated)
	↓
	Export adapter.safetensors (~50MB)
	```

	---

	## 12. Why is explicit declaration required instead of auto-inference?

	Answer:

	Because transparency is better than hidden assumptions, especially in AI pipelines.

	Problems with auto-inference:
	- ❌ Hidden assumptions can lead to silent mistakes
	- ❌ Difficult to debug when things go wrong
	- ❌ Unclear what the system is actually doing
	- ❌ Harder to audit and review

	Benefits of explicit declaration:
	- ✅ Self-documenting - Scripts explain themselves
	- ✅ Auditable - Easy to review and validate
	- ✅ Beginner-friendly - Clear what's happening
	- ✅ Safe - No hidden behavior or assumptions

	Example of explicit vs implicit:
	```okt
	# Explicit (OktoScript style)
	MODEL {
	base: "oktoseek/base-llm-7b"
	}

	FT_LORA {
	base_model: "oktoseek/base-llm-7b" # Explicit, even if redundant
	}

	# Implicit (what we avoid)
	FT_LORA {
	# base_model inferred from MODEL block - NOT in OktoScript
	}
	```

	Philosophy: In AI, explicit is safer than implicit. A few extra lines of configuration prevent costly mistakes.

	---

	## 13. Can I run LoRA without EXPORT?

	Answer:

	Technically yes, but it's not recommended.

	What happens without EXPORT:
	- ✅ Training completes successfully
	- ✅ Adapter weights are trained
	- ❌ Adapter weights are not saved
	- ❌ Training becomes useless after process ends

	Best practice:
	```okt
	FT_LORA {
	lora_rank: 8
	lora_alpha: 32
	}

	EXPORT {
	format: ["safetensors", "okm"]
	path: "./export/"
	}
	```

	Why always include EXPORT:
	- ✅ Preserves your work
	- ✅ Enables model reuse
	- ✅ Allows deployment
	- ✅ Supports version control

	Exception: If you're only testing or debugging, you might skip EXPORT temporarily, but always add it before production training.

	---

	## 14. What if I want to merge a LoRA adapter into the final model later?

	Answer:

	Current support (v1.1):

	You can merge LoRA adapters using OktoEngine's internal tools or Python hooks:

	Option 1: Using Hooks (Current)
	```okt
	HOOKS {
	after_train: "scripts/merge_lora.py"
	}
	```

	Option 2: Manual merge with OktoEngine CLI
	```bash
	okto_merge --adapter ./export/adapter.safetensors \
	--base ./models/base-model \
	--output ./export/merged-model
	```

	Future support (v2.0+):

	A dedicated `MERGE` block is planned:

	```okt
	MERGE {
	source: "export/adapter.safetensors"
	target: "models/base-model"
	output: "export/merged-model"
	format: ["okm", "onnx"]
	}
	```

	Why merge?
	- ✅ Single model file (no separate adapter needed)
	- ✅ Faster inference (no adapter loading)
	- ✅ Easier deployment (one file instead of two)
	- ✅ Better compatibility (works with standard tools)

	When to merge:
	- After training is complete
	- Before deployment
	- When you want a standalone model

	---

	## 15. Why choose OktoScript over YAML or Python scripts?

	Answer:

	OktoScript is purpose-built for AI pipelines, while YAML and Python are generic tools.

	Comparison:

	\| Feature \| OktoScript \| YAML \| Python \|
	\|---------\|------------\|------\|--------\|
	\| Purpose \| AI pipelines \| Generic config \| General programming \|
	\| Readability \| ✅ Block-based, semantic \| ⚠️ Flat, no structure \| ❌ Code complexity \|
	\| Validation \| ✅ Grammar-enforced \| ⚠️ Manual validation \| ❌ Runtime errors \|
	\| Type Safety \| ✅ Built-in \| ❌ No types \| ⚠️ Runtime checking \|
	\| AI-Specific \| ✅ LoRA, RAG, monitoring \| ❌ Generic \| ⚠️ Requires libraries \|
	\| Learning Curve \| ✅ Simple blocks \| ⚠️ Syntax learning \| ❌ Programming required \|
	\| IDE Support \| ✅ OktoSeek IDE \| ⚠️ Generic editors \| ✅ IDEs available \|

	Key advantages of OktoScript:

	1. Purpose-built for AI
	- Native support for LoRA, RAG, monitoring
	- AI-specific blocks and concepts
	- Optimized for ML workflows

	2. Human-oriented
	- Readable by non-programmers
	- Self-documenting structure
	- Clear semantic blocks

	3. Less error-prone
	- Grammar validation
	- Type checking
	- Constraint enforcement

	4. Integrated ecosystem
	- OktoSeek IDE support
	- OktoEngine integration
	- Flutter SDK compatibility

	5. Single config file
	- Everything in one `.okt` file
	- No scattered configuration
	- Version control friendly

	Example comparison:

	YAML (generic):
	```yaml
	model:
	base: "oktoseek/base"
	train:
	epochs: 5
	batch_size: 32
	# No validation, no structure, unclear relationships
	```

	Python (complex):
	```python
	from transformers import Trainer, TrainingArguments
	# 100+ lines of code
	# Complex error handling
	# Hard to read and maintain
	```

	OktoScript (focused):
	```okt
	MODEL {
	base: "oktoseek/base"
	}

	TRAIN {
	epochs: 5
	batch_size: 32
	}
	# Clear, validated, self-documenting
	```

	Bottom line: OktoScript is to AI pipelines what Docker Compose is to containers—a declarative DSL that simplifies complex operations.

	---

	## 16. How does OktoScript handle model versioning and checkpoints?

	Answer:

	OktoScript uses the `runs/` directory structure for automatic versioning and checkpoint management.

	Structure:
	```
	runs/
	└── my-model/
	├── checkpoint-100/
	│ └── model.safetensors
	├── checkpoint-200/
	│ └── model.safetensors
	├── tokenizer.json
	├── training_logs.json
	└── metrics.json
	```

	Checkpoint configuration:
	```okt
	TRAIN {
	epochs: 10
	checkpoint_steps: 100 # Save every 100 steps
	checkpoint_path: "./checkpoints"
	}
	```

	Resume from checkpoint:
	```okt
	TRAIN {
	resume_from_checkpoint: "./checkpoints/checkpoint-500"
	epochs: 10
	}
	```

	Benefits:
	- ✅ Automatic versioning by run name
	- ✅ Step-based checkpointing
	- ✅ Easy resume from any checkpoint
	- ✅ Training logs and metrics per run

	Best practice: Use descriptive project names in `PROJECT` block to organize runs.

	---

	## 17. Can I use custom Python code with OktoScript?

	Answer:

	Yes! OktoScript supports custom Python code through the `HOOKS` block.

	Available hooks:
	```okt
	HOOKS {
	before_train: "scripts/preprocess.py"
	after_train: "scripts/postprocess.py"
	before_epoch: "scripts/custom_early_stop.py"
	after_epoch: "scripts/log_custom_metrics.py"
	on_checkpoint: "scripts/backup_checkpoint.sh"
	custom_metric: "scripts/toxicity_calculator.py"
	}
	```

	Hook script interface:
	```python
	# scripts/preprocess.py
	def before_train(config, dataset, model):
	# Custom preprocessing
	# Modify config if needed
	return config

	# scripts/after_epoch.py
	def after_epoch(epoch, metrics, model_state):
	# Custom logging, early stopping logic
	# Return True to stop training
	return False
	```

	Use cases:
	- Custom data preprocessing
	- Custom metrics calculation
	- Custom early stopping logic
	- External API integration
	- Custom logging

	Key point: OktoScript handles the configuration, Python handles the custom logic. Best of both worlds.

	---

	## 18. What happens if I specify conflicting configurations?

	Answer:

	OktoScript has clear priority rules to handle conflicts:

	Priority order (highest to lowest):
	1. Block-specific overrides (e.g., `mix_datasets` in `FT_LORA`)
	2. Block-level settings (e.g., `FT_LORA` over `TRAIN` for LoRA)
	3. Global settings (e.g., `DATASET.train`)

	Example conflicts and resolution:

	Conflict 1: Dataset specification
	```okt
	DATASET {
	train: "dataset/a.jsonl" # Lower priority
	}

	FT_LORA {
	mix_datasets: [...] # Higher priority - overrides DATASET.train
	}
	```
	Resolution: `mix_datasets` is used, `DATASET.train` is ignored.

	Conflict 2: TRAIN vs FT_LORA
	```okt
	TRAIN {
	epochs: 10
	}

	FT_LORA {
	epochs: 5 # This is used for LoRA training
	}
	```
	Resolution: `FT_LORA.epochs` is used, but `TRAIN` optimizer/device settings still apply.

	Validation:
	- OktoEngine validates configurations before training
	- Conflicts are reported with clear error messages
	- Use `okto validate` to check before training

	---

	## 19. How do I debug an OktoScript file?

	Answer:

	Step 1: Validate syntax
	```bash
	okto validate train.okt
	```

	Step 2: Check logs
	```okt
	LOGGING {
	save_logs: true
	log_level: "debug" # Enable debug logging
	log_every: 1
	}
	```

	Step 3: Use MONITOR for system diagnostics
	```okt
	MONITOR {
	level: "full"
	log_system: ["gpu_memory_used", "cpu_usage", "temperature"]
	dashboard: true # Real-time visualization
	}
	```

	Step 4: Check validation errors
	Common errors and solutions:
	- `Dataset file not found` → Check file paths
	- `Invalid optimizer` → Use allowed values (adam, adamw, sgd, etc.)
	- `Model base not found` → Verify model path or HuggingFace name
	- `Dataset mixing weights invalid` → Total must equal 100

	Step 5: Use system diagnostics
	```bash
	okto_doctor # Shows GPU, CUDA, RAM, drivers
	```

	Best practices:
	- Always validate before training
	- Start with `log_level: "debug"`
	- Use `MONITOR` dashboard for real-time insights
	- Check `runs/*/training_logs.json` for detailed logs

	---

	## 20. Is OktoScript production-ready?

	Answer:

	Yes, OktoScript v1.1 is production-ready for AI training and deployment pipelines.

	Production features:
	- ✅ Stable grammar - Well-defined and validated
	- ✅ Error handling - Comprehensive validation
	- ✅ Monitoring - System and training telemetry
	- ✅ Export formats - Production-ready formats (ONNX, GGUF, OKM)
	- ✅ Deployment - API, mobile, edge targets
	- ✅ Security - Model encryption and watermarking
	- ✅ Logging - Comprehensive logging and metrics

	Production checklist:
	```okt
	PROJECT "ProductionModel"
	VERSION "1.0"

	# ... configuration ...

	SECURITY {
	encrypt_model: true
	watermark: true
	}

	MONITOR {
	level: "full"
	dashboard: true
	}

	EXPORT {
	format: ["onnx", "okm"] # Production formats
	optimize_for: "speed"
	}

	DEPLOY {
	target: "api"
	requires_auth: true
	max_concurrent_requests: 100
	}
	```

	Used by:
	- OktoSeek IDE (production)
	- Research institutions
	- AI development teams
	- Educational platforms

	Version stability:
	- v1.0: Stable, production-ready
	- v1.1: Backward compatible, adds LoRA and monitoring

	---

	## Need More Help?

	- 📖 [Complete Grammar Specification](./grammar.md)
	- 🚀 [Getting Started Guide](./GETTING_STARTED.md)
	- ✅ [Validation Rules](../VALIDATION_RULES.md)
	- 💡 [Examples](../examples/)
	- 🐛 [Troubleshooting](./grammar.md#troubleshooting)

	Still have questions? Open an issue on [GitHub](https://github.com/oktoseek/oktoscript/issues) or contact service@oktoseek.com.

	---

	OktoScript is developed and maintained by OktoSeek AI.