oktoscript / docs /FAQ.md
OktoSeek's picture
Upload 48 files
5fc8c9d verified
# OktoScript – Frequently Asked Questions (FAQ)
Common questions and answers about OktoScript, a domain-specific language for AI training, evaluation, and deployment.
---
## 1. Even if FT_LORA already points to a base model and dataset, why must I still declare the MODEL and DATASET blocks?
**Answer:**
In OktoScript, `MODEL` and `DATASET` blocks define the **global context** of your project. They represent the default base configuration for the entire pipeline.
The `FT_LORA` block does not replace themβ€”it only defines **how** the fine-tuning is performed. This explicit separation makes scripts clearer, more organized, and avoids hidden assumptions.
**Benefits of explicit declaration:**
- βœ… **Readability** - Anyone can understand the project structure at a glance
- βœ… **Debugging** - Clear separation of concerns makes troubleshooting easier
- βœ… **Reproducibility** - All configuration is visible and version-controlled
- βœ… **Documentation** - The script serves as self-documenting code
**Example:**
```okt
MODEL {
base: "oktoseek/base-llm-7b" # Global model context
}
DATASET {
train: "dataset/main.jsonl" # Global dataset context
}
FT_LORA {
base_model: "oktoseek/base-llm-7b" # Explicit for LoRA
train_dataset: "dataset/main.jsonl" # Explicit for LoRA
lora_rank: 8
}
```
This design follows the principle: **explicit is better than implicit**, especially in AI pipelines where assumptions can lead to costly mistakes.
---
## 2. If I already use FT_LORA, why is the TRAIN block still mandatory?
**Answer:**
`FT_LORA` defines **what kind of training** happens (LoRA adapters), but `TRAIN` defines **how the training loop is executed** (optimizer, batch size, device, etc.).
**Think of it this way:**
- `TRAIN` = The engine (how training runs)
- `FT_LORA` = The driving mode (what gets trained)
**The TRAIN block controls:**
- Optimizer (adam, adamw, sgd, etc.)
- Batch size and gradient accumulation
- Device selection (cpu, cuda, mps)
- Learning rate and scheduler
- Training strategy (early stopping, checkpoints)
**Example:**
```okt
TRAIN {
epochs: 5
batch_size: 4
optimizer: "adamw"
learning_rate: 0.00003
device: "cuda"
}
FT_LORA {
lora_rank: 8
lora_alpha: 32
target_modules: ["q_proj", "v_proj"]
}
```
Both blocks are required because they serve different purposes in the declarative DSL structure.
---
## 3. How do I define the final output of my model in OktoScript?
**Answer:**
The final output is always defined in the `EXPORT` block, regardless of whether you use `TRAIN` or `FT_LORA`.
**For standard training:**
```okt
EXPORT {
format: ["gguf", "onnx", "okm"]
path: "./export/"
}
```
**For LoRA fine-tuning:**
```okt
EXPORT {
format: ["safetensors", "okm"]
path: "./export/lora_patch/"
}
```
**What gets exported:**
- With `TRAIN`: Full model weights in specified formats
- With `FT_LORA`: LoRA adapter weights (safetensors) + optional merged model (okm)
The `EXPORT` block controls:
- βœ… Adapter generation (LoRA patches via safetensors)
- βœ… OktoSeek package generation (okm format)
- βœ… Cross-platform formats (onnx, gguf)
- βœ… Quantization settings
**Key point:** Export responsibility is clearly separated from training logic, keeping the DSL clean and modular.
---
## 4. What is the difference between FT_LORA and TRAIN blocks?
**Answer:**
| Block | Role | Purpose |
|-------|------|---------|
| `TRAIN` | Training loop configuration | Defines **how** training runs (optimizer, batch size, device) |
| `FT_LORA` | LoRA adapter configuration | Defines **what** gets trained (LoRA rank, alpha, target modules) |
**Important:** `FT_LORA` is **not** a replacement for `TRAIN`β€”it's an **extension** that modifies how training is applied to the model.
**When to use each:**
- **Use `TRAIN` alone:** Full fine-tuning of all model parameters
- **Use `TRAIN` + `FT_LORA`:** Efficient fine-tuning with LoRA adapters (recommended for large models)
**Example:**
```okt
# Full fine-tuning
TRAIN {
epochs: 10
batch_size: 32
device: "cuda"
}
# LoRA fine-tuning (more efficient)
TRAIN {
epochs: 5
batch_size: 4
device: "cuda"
}
FT_LORA {
lora_rank: 8
lora_alpha: 32
}
```
This separation keeps the language modular and scalable.
---
## 5. Do I need to repeat the base model inside FT_LORA if it is already declared in MODEL?
**Answer:**
**Yes, by design.** OktoScript prefers explicit declarations over implicit inference.
Even though the engine could technically infer the model from `MODEL`, keeping `base_model` inside `FT_LORA`:
- βœ… **Avoids ambiguity** - No guessing which model is used
- βœ… **Makes scripts self-contained** - Each block is independent
- βœ… **Improves readability** - Clear at a glance what's happening
- βœ… **Helps during audits** - Easier to review and validate
**Example:**
```okt
MODEL {
base: "oktoseek/base-llm-7b" # Global context
}
FT_LORA {
base_model: "oktoseek/base-llm-7b" # Explicit for LoRA
lora_rank: 8
}
```
**This is an intentional design decision** to favor clarity and safety over convenience. In AI pipelines, explicit is safer than implicit.
---
## 6. What happens if I use both DATASET.train and mix_datasets at the same time?
**Answer:**
**Simple rule:** `mix_datasets` **overrides** `DATASET.train` when present.
**Priority order:**
1. `mix_datasets` in `FT_LORA` (highest priority)
2. `mix_datasets` in `DATASET` block
3. `DATASET.train` (default, lowest priority)
**Example:**
```okt
DATASET {
train: "dataset/main.jsonl" # Default dataset
}
FT_LORA {
mix_datasets: [
{ path: "dataset/a.jsonl", weight: 70 },
{ path: "dataset/b.jsonl", weight: 30 }
]
# This mix_datasets overrides DATASET.train
}
```
**Why this design?**
- Allows flexibility without breaking the main structure
- Enables dataset-specific configurations per training method
- Maintains backward compatibility with v1.0
**Best practice:** Use `DATASET.train` for the default, and `mix_datasets` when you need weighted mixing.
---
## 7. Does OktoScript replace Python?
**Answer:**
**No.** OktoScript does **not** replace Python. Instead, it replaces the **complex configuration boilerplate** typically written in Python.
**The relationship:**
- **Python** = Coding and programming (general-purpose language)
- **OktoScript** = Configuration of AI pipelines (domain-specific language)
**Think of it this way:**
```
Python (Engine) ← OktoScript (Configuration Layer) ← User
```
OktoScript sits **above** Python as a declarative layer, while Python powers the OktoEngine underneath.
**What OktoScript replaces:**
- ❌ Hundreds of lines of Python configuration code
- ❌ Complex YAML files with unclear structure
- ❌ Repetitive training scripts
**What Python still does:**
- βœ… Powers the OktoEngine
- βœ… Executes the training loop
- βœ… Handles low-level operations
- βœ… Provides hooks for custom logic
**Analogy:** OktoScript is to Python what Docker Compose is to Dockerβ€”a declarative configuration layer that simplifies complex operations.
---
## 8. Can I use multiple datasets with different weights?
**Answer:**
**Yes!** This is one of the key features of OktoScript v1.1.
**Syntax:**
```okt
DATASET {
mix_datasets: [
{ path: "dataset/general.jsonl", weight: 60 },
{ path: "dataset/technical.jsonl", weight: 30 },
{ path: "dataset/creative.jsonl", weight: 10 }
]
sampling: "weighted"
shuffle: true
}
```
**Benefits:**
- βœ… **Balanced training** - Control dataset proportions
- βœ… **Domain blending** - Combine different data sources
- βœ… **Bias reduction** - Weight underrepresented data
- βœ… **Dataset prioritization** - Emphasize important data
**Rules:**
- Total weights must equal **exactly 100**
- `sampling: "weighted"` uses weights for sampling
- `sampling: "random"` ignores weights (uniform sampling)
- `shuffle: true` shuffles datasets before mixing
**Use case example:**
```okt
# Mix general conversations (60%) with technical Q&A (30%) and creative writing (10%)
mix_datasets: [
{ path: "dataset/conversations.jsonl", weight: 60 },
{ path: "dataset/technical_qa.jsonl", weight: 30 },
{ path: "dataset/creative.jsonl", weight: 10 }
]
```
---
## 9. What is the difference between EXPORT: safetensors and EXPORT: okm?
**Answer:**
| Format | Purpose | Use Case |
|--------|---------|----------|
| `safetensors` | Standard PyTorch weights format | LoRA adapters, model weights, HuggingFace compatibility |
| `okm` | OktoSeek optimized package | OktoSeek IDE, Flutter SDK, mobile apps, exclusive tools |
| `onnx` | Universal inference format | Production deployment, cross-platform compatibility |
| `gguf` | Local inference format | Ollama, Llama.cpp, local deployment |
**For LoRA fine-tuning:**
- `safetensors` β†’ Saves only the LoRA adapter patch (small file, ~10-100MB)
- `okm` β†’ Saves a full OktoSeek model package (includes adapter + metadata)
**Example:**
```okt
FT_LORA {
lora_rank: 8
}
EXPORT {
format: ["safetensors", "okm"]
path: "./export/"
}
```
**Output:**
- `./export/adapter.safetensors` - LoRA adapter (for HuggingFace/PyTorch)
- `./export/model.okm` - OktoSeek package (for OktoSeek ecosystem)
**Why both?**
- `safetensors` for compatibility with standard ML tools
- `okm` for optimized OktoSeek ecosystem integration
---
## 10. Is OktoScript a programming language or a DSL?
**Answer:**
**OktoScript is a Domain-Specific Language (DSL).**
**What it is NOT:**
- ❌ A general-purpose programming language
- ❌ A scripting language with loops and variables
- ❌ A replacement for Python or JavaScript
**What it IS:**
- βœ… A declarative configuration language
- βœ… Purpose-built for AI pipelines
- βœ… Domain-specific (focused on AI training/deployment)
**Key characteristics:**
- **Declarative** - You describe **what** you want, not **how** to do it
- **No control flow** - No loops, conditionals, or functions
- **Block-based** - Configuration organized in semantic blocks
- **Type-safe** - Validated against grammar specification
**Why call it a DSL?**
- βœ… Technically accurate
- βœ… Increases professional credibility
- βœ… Sets correct expectations
- βœ… Distinguishes from general-purpose languages
**Analogy:** OktoScript is to AI pipelines what SQL is to databasesβ€”a specialized language for a specific domain.
---
## 11. What happens internally when I write FT_LORA?
**Answer:**
When you use `FT_LORA`, the OktoEngine performs these steps:
**1. Model Loading:**
- Loads the base model specified in `base_model`
- Initializes model architecture
**2. LoRA Adapter Injection:**
- Freezes the main model layers
- Adds LoRA adapters to selected modules (e.g., `q_proj`, `v_proj`)
- Adapters are low-rank matrices (rank Γ— alpha)
**3. Training:**
- Trains **only** the LoRA adapter weights
- Main model weights remain frozen
- Uses optimizer and settings from `TRAIN` block
**4. Export:**
- Saves adapter weights via `EXPORT` block
- Optionally merges adapter into base model (if specified)
**Benefits:**
- βœ… **Reduced GPU usage** - Up to 90% less VRAM
- βœ… **Faster training** - Only small adapters are updated
- βœ… **Smaller files** - Adapter weights are tiny (~10-100MB)
- βœ… **Specialization** - Multiple adapters for different tasks
- βœ… **Flexibility** - Combine adapters at inference time
**Example flow:**
```
Base Model (7B params, frozen)
↓
+ LoRA Adapters (8 rank Γ— 32 alpha = ~256 params per module)
↓
Training (only adapters updated)
↓
Export adapter.safetensors (~50MB)
```
---
## 12. Why is explicit declaration required instead of auto-inference?
**Answer:**
**Because transparency is better than hidden assumptions**, especially in AI pipelines.
**Problems with auto-inference:**
- ❌ Hidden assumptions can lead to silent mistakes
- ❌ Difficult to debug when things go wrong
- ❌ Unclear what the system is actually doing
- ❌ Harder to audit and review
**Benefits of explicit declaration:**
- βœ… **Self-documenting** - Scripts explain themselves
- βœ… **Auditable** - Easy to review and validate
- βœ… **Beginner-friendly** - Clear what's happening
- βœ… **Safe** - No hidden behavior or assumptions
**Example of explicit vs implicit:**
```okt
# Explicit (OktoScript style)
MODEL {
base: "oktoseek/base-llm-7b"
}
FT_LORA {
base_model: "oktoseek/base-llm-7b" # Explicit, even if redundant
}
# Implicit (what we avoid)
FT_LORA {
# base_model inferred from MODEL block - NOT in OktoScript
}
```
**Philosophy:** In AI, explicit is safer than implicit. A few extra lines of configuration prevent costly mistakes.
---
## 13. Can I run LoRA without EXPORT?
**Answer:**
**Technically yes, but it's not recommended.**
**What happens without EXPORT:**
- βœ… Training completes successfully
- βœ… Adapter weights are trained
- ❌ Adapter weights are **not saved**
- ❌ Training becomes useless after process ends
**Best practice:**
```okt
FT_LORA {
lora_rank: 8
lora_alpha: 32
}
EXPORT {
format: ["safetensors", "okm"]
path: "./export/"
}
```
**Why always include EXPORT:**
- βœ… Preserves your work
- βœ… Enables model reuse
- βœ… Allows deployment
- βœ… Supports version control
**Exception:** If you're only testing or debugging, you might skip EXPORT temporarily, but always add it before production training.
---
## 14. What if I want to merge a LoRA adapter into the final model later?
**Answer:**
**Current support (v1.1):**
You can merge LoRA adapters using OktoEngine's internal tools or Python hooks:
**Option 1: Using Hooks (Current)**
```okt
HOOKS {
after_train: "scripts/merge_lora.py"
}
```
**Option 2: Manual merge with OktoEngine CLI**
```bash
okto_merge --adapter ./export/adapter.safetensors \
--base ./models/base-model \
--output ./export/merged-model
```
**Future support (v2.0+):**
A dedicated `MERGE` block is planned:
```okt
MERGE {
source: "export/adapter.safetensors"
target: "models/base-model"
output: "export/merged-model"
format: ["okm", "onnx"]
}
```
**Why merge?**
- βœ… Single model file (no separate adapter needed)
- βœ… Faster inference (no adapter loading)
- βœ… Easier deployment (one file instead of two)
- βœ… Better compatibility (works with standard tools)
**When to merge:**
- After training is complete
- Before deployment
- When you want a standalone model
---
## 15. Why choose OktoScript over YAML or Python scripts?
**Answer:**
**OktoScript is purpose-built for AI pipelines**, while YAML and Python are generic tools.
**Comparison:**
| Feature | OktoScript | YAML | Python |
|---------|------------|------|--------|
| **Purpose** | AI pipelines | Generic config | General programming |
| **Readability** | βœ… Block-based, semantic | ⚠️ Flat, no structure | ❌ Code complexity |
| **Validation** | βœ… Grammar-enforced | ⚠️ Manual validation | ❌ Runtime errors |
| **Type Safety** | βœ… Built-in | ❌ No types | ⚠️ Runtime checking |
| **AI-Specific** | βœ… LoRA, RAG, monitoring | ❌ Generic | ⚠️ Requires libraries |
| **Learning Curve** | βœ… Simple blocks | ⚠️ Syntax learning | ❌ Programming required |
| **IDE Support** | βœ… OktoSeek IDE | ⚠️ Generic editors | βœ… IDEs available |
**Key advantages of OktoScript:**
1. **Purpose-built for AI**
- Native support for LoRA, RAG, monitoring
- AI-specific blocks and concepts
- Optimized for ML workflows
2. **Human-oriented**
- Readable by non-programmers
- Self-documenting structure
- Clear semantic blocks
3. **Less error-prone**
- Grammar validation
- Type checking
- Constraint enforcement
4. **Integrated ecosystem**
- OktoSeek IDE support
- OktoEngine integration
- Flutter SDK compatibility
5. **Single config file**
- Everything in one `.okt` file
- No scattered configuration
- Version control friendly
**Example comparison:**
**YAML (generic):**
```yaml
model:
base: "oktoseek/base"
train:
epochs: 5
batch_size: 32
# No validation, no structure, unclear relationships
```
**Python (complex):**
```python
from transformers import Trainer, TrainingArguments
# 100+ lines of code
# Complex error handling
# Hard to read and maintain
```
**OktoScript (focused):**
```okt
MODEL {
base: "oktoseek/base"
}
TRAIN {
epochs: 5
batch_size: 32
}
# Clear, validated, self-documenting
```
**Bottom line:** OktoScript is to AI pipelines what Docker Compose is to containersβ€”a declarative DSL that simplifies complex operations.
---
## 16. How does OktoScript handle model versioning and checkpoints?
**Answer:**
OktoScript uses the `runs/` directory structure for automatic versioning and checkpoint management.
**Structure:**
```
runs/
└── my-model/
β”œβ”€β”€ checkpoint-100/
β”‚ └── model.safetensors
β”œβ”€β”€ checkpoint-200/
β”‚ └── model.safetensors
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ training_logs.json
└── metrics.json
```
**Checkpoint configuration:**
```okt
TRAIN {
epochs: 10
checkpoint_steps: 100 # Save every 100 steps
checkpoint_path: "./checkpoints"
}
```
**Resume from checkpoint:**
```okt
TRAIN {
resume_from_checkpoint: "./checkpoints/checkpoint-500"
epochs: 10
}
```
**Benefits:**
- βœ… Automatic versioning by run name
- βœ… Step-based checkpointing
- βœ… Easy resume from any checkpoint
- βœ… Training logs and metrics per run
**Best practice:** Use descriptive project names in `PROJECT` block to organize runs.
---
## 17. Can I use custom Python code with OktoScript?
**Answer:**
**Yes!** OktoScript supports custom Python code through the `HOOKS` block.
**Available hooks:**
```okt
HOOKS {
before_train: "scripts/preprocess.py"
after_train: "scripts/postprocess.py"
before_epoch: "scripts/custom_early_stop.py"
after_epoch: "scripts/log_custom_metrics.py"
on_checkpoint: "scripts/backup_checkpoint.sh"
custom_metric: "scripts/toxicity_calculator.py"
}
```
**Hook script interface:**
```python
# scripts/preprocess.py
def before_train(config, dataset, model):
# Custom preprocessing
# Modify config if needed
return config
# scripts/after_epoch.py
def after_epoch(epoch, metrics, model_state):
# Custom logging, early stopping logic
# Return True to stop training
return False
```
**Use cases:**
- Custom data preprocessing
- Custom metrics calculation
- Custom early stopping logic
- External API integration
- Custom logging
**Key point:** OktoScript handles the configuration, Python handles the custom logic. Best of both worlds.
---
## 18. What happens if I specify conflicting configurations?
**Answer:**
OktoScript has **clear priority rules** to handle conflicts:
**Priority order (highest to lowest):**
1. Block-specific overrides (e.g., `mix_datasets` in `FT_LORA`)
2. Block-level settings (e.g., `FT_LORA` over `TRAIN` for LoRA)
3. Global settings (e.g., `DATASET.train`)
**Example conflicts and resolution:**
**Conflict 1: Dataset specification**
```okt
DATASET {
train: "dataset/a.jsonl" # Lower priority
}
FT_LORA {
mix_datasets: [...] # Higher priority - overrides DATASET.train
}
```
**Resolution:** `mix_datasets` is used, `DATASET.train` is ignored.
**Conflict 2: TRAIN vs FT_LORA**
```okt
TRAIN {
epochs: 10
}
FT_LORA {
epochs: 5 # This is used for LoRA training
}
```
**Resolution:** `FT_LORA.epochs` is used, but `TRAIN` optimizer/device settings still apply.
**Validation:**
- OktoEngine validates configurations before training
- Conflicts are reported with clear error messages
- Use `okto validate` to check before training
---
## 19. How do I debug an OktoScript file?
**Answer:**
**Step 1: Validate syntax**
```bash
okto validate train.okt
```
**Step 2: Check logs**
```okt
LOGGING {
save_logs: true
log_level: "debug" # Enable debug logging
log_every: 1
}
```
**Step 3: Use MONITOR for system diagnostics**
```okt
MONITOR {
level: "full"
log_system: ["gpu_memory_used", "cpu_usage", "temperature"]
dashboard: true # Real-time visualization
}
```
**Step 4: Check validation errors**
Common errors and solutions:
- `Dataset file not found` β†’ Check file paths
- `Invalid optimizer` β†’ Use allowed values (adam, adamw, sgd, etc.)
- `Model base not found` β†’ Verify model path or HuggingFace name
- `Dataset mixing weights invalid` β†’ Total must equal 100
**Step 5: Use system diagnostics**
```bash
okto_doctor # Shows GPU, CUDA, RAM, drivers
```
**Best practices:**
- Always validate before training
- Start with `log_level: "debug"`
- Use `MONITOR` dashboard for real-time insights
- Check `runs/*/training_logs.json` for detailed logs
---
## 20. Is OktoScript production-ready?
**Answer:**
**Yes, OktoScript v1.1 is production-ready** for AI training and deployment pipelines.
**Production features:**
- βœ… **Stable grammar** - Well-defined and validated
- βœ… **Error handling** - Comprehensive validation
- βœ… **Monitoring** - System and training telemetry
- βœ… **Export formats** - Production-ready formats (ONNX, GGUF, OKM)
- βœ… **Deployment** - API, mobile, edge targets
- βœ… **Security** - Model encryption and watermarking
- βœ… **Logging** - Comprehensive logging and metrics
**Production checklist:**
```okt
PROJECT "ProductionModel"
VERSION "1.0"
# ... configuration ...
SECURITY {
encrypt_model: true
watermark: true
}
MONITOR {
level: "full"
dashboard: true
}
EXPORT {
format: ["onnx", "okm"] # Production formats
optimize_for: "speed"
}
DEPLOY {
target: "api"
requires_auth: true
max_concurrent_requests: 100
}
```
**Used by:**
- OktoSeek IDE (production)
- Research institutions
- AI development teams
- Educational platforms
**Version stability:**
- v1.0: Stable, production-ready
- v1.1: Backward compatible, adds LoRA and monitoring
---
## Need More Help?
- πŸ“– [Complete Grammar Specification](./grammar.md)
- πŸš€ [Getting Started Guide](./GETTING_STARTED.md)
- βœ… [Validation Rules](../VALIDATION_RULES.md)
- πŸ’‘ [Examples](../examples/)
- πŸ› [Troubleshooting](./grammar.md#troubleshooting)
**Still have questions?** Open an issue on [GitHub](https://github.com/oktoseek/oktoscript/issues) or contact **service@oktoseek.com**.
---
**OktoScript** is developed and maintained by **OktoSeek AI**.