File size: 23,305 Bytes

5fc8c9d

# OktoScript – Frequently Asked Questions (FAQ)

Common questions and answers about OktoScript, a domain-specific language for AI training, evaluation, and deployment.

---

## 1. Even if FT_LORA already points to a base model and dataset, why must I still declare the MODEL and DATASET blocks?



**Answer:**



In OktoScript, `MODEL` and `DATASET` blocks define the **global context** of your project. They represent the default base configuration for the entire pipeline.



The `FT_LORA` block does not replace them—it only defines **how** the fine-tuning is performed. This explicit separation makes scripts clearer, more organized, and avoids hidden assumptions.

**Benefits of explicit declaration:**
- ✅ **Readability** - Anyone can understand the project structure at a glance
- ✅ **Debugging** - Clear separation of concerns makes troubleshooting easier
- ✅ **Reproducibility** - All configuration is visible and version-controlled
- ✅ **Documentation** - The script serves as self-documenting code

**Example:**
```okt

MODEL {

    base: "oktoseek/base-llm-7b"  # Global model context

}



DATASET {

    train: "dataset/main.jsonl"   # Global dataset context

}



FT_LORA {

    base_model: "oktoseek/base-llm-7b"  # Explicit for LoRA

    train_dataset: "dataset/main.jsonl"  # Explicit for LoRA

    lora_rank: 8

}

```

This design follows the principle: **explicit is better than implicit**, especially in AI pipelines where assumptions can lead to costly mistakes.

---

## 2. If I already use FT_LORA, why is the TRAIN block still mandatory?



**Answer:**



`FT_LORA` defines **what kind of training** happens (LoRA adapters), but `TRAIN` defines **how the training loop is executed** (optimizer, batch size, device, etc.).

**Think of it this way:**
- `TRAIN` = The engine (how training runs)
- `FT_LORA` = The driving mode (what gets trained)

**The TRAIN block controls:**
- Optimizer (adam, adamw, sgd, etc.)
- Batch size and gradient accumulation
- Device selection (cpu, cuda, mps)
- Learning rate and scheduler
- Training strategy (early stopping, checkpoints)

**Example:**
```okt

TRAIN {

    epochs: 5

    batch_size: 4

    optimizer: "adamw"

    learning_rate: 0.00003

    device: "cuda"

}



FT_LORA {

    lora_rank: 8

    lora_alpha: 32

    target_modules: ["q_proj", "v_proj"]

}

```

Both blocks are required because they serve different purposes in the declarative DSL structure.

---

## 3. How do I define the final output of my model in OktoScript?

**Answer:**

The final output is always defined in the `EXPORT` block, regardless of whether you use `TRAIN` or `FT_LORA`.

**For standard training:**
```okt

EXPORT {

    format: ["gguf", "onnx", "okm"]

    path: "./export/"

}

```

**For LoRA fine-tuning:**
```okt

EXPORT {

    format: ["safetensors", "okm"]

    path: "./export/lora_patch/"

}

```

**What gets exported:**
- With `TRAIN`: Full model weights in specified formats
- With `FT_LORA`: LoRA adapter weights (safetensors) + optional merged model (okm)

The `EXPORT` block controls:
- ✅ Adapter generation (LoRA patches via safetensors)
- ✅ OktoSeek package generation (okm format)
- ✅ Cross-platform formats (onnx, gguf)
- ✅ Quantization settings

**Key point:** Export responsibility is clearly separated from training logic, keeping the DSL clean and modular.

---

## 4. What is the difference between FT_LORA and TRAIN blocks?



**Answer:**



| Block | Role | Purpose |

|-------|------|---------|

| `TRAIN` | Training loop configuration | Defines **how** training runs (optimizer, batch size, device) |

| `FT_LORA` | LoRA adapter configuration | Defines **what** gets trained (LoRA rank, alpha, target modules) |

**Important:** `FT_LORA` is **not** a replacement for `TRAIN`—it's an **extension** that modifies how training is applied to the model.

**When to use each:**
- **Use `TRAIN` alone:** Full fine-tuning of all model parameters
- **Use `TRAIN` + `FT_LORA`:** Efficient fine-tuning with LoRA adapters (recommended for large models)



**Example:**

```okt

# Full fine-tuning

TRAIN {

    epochs: 10

    batch_size: 32

    device: "cuda"

}



# LoRA fine-tuning (more efficient)

TRAIN {

    epochs: 5

    batch_size: 4

    device: "cuda"

}



FT_LORA {

    lora_rank: 8

    lora_alpha: 32

}

```



This separation keeps the language modular and scalable.



---



## 5. Do I need to repeat the base model inside FT_LORA if it is already declared in MODEL?



**Answer:**



**Yes, by design.** OktoScript prefers explicit declarations over implicit inference.



Even though the engine could technically infer the model from `MODEL`, keeping `base_model` inside `FT_LORA`:



- ✅ **Avoids ambiguity** - No guessing which model is used

- ✅ **Makes scripts self-contained** - Each block is independent

- ✅ **Improves readability** - Clear at a glance what's happening

- ✅ **Helps during audits** - Easier to review and validate



**Example:**

```okt

MODEL {

    base: "oktoseek/base-llm-7b"  # Global context

}



FT_LORA {

    base_model: "oktoseek/base-llm-7b"  # Explicit for LoRA

    lora_rank: 8

}

```



**This is an intentional design decision** to favor clarity and safety over convenience. In AI pipelines, explicit is safer than implicit.

---

## 6. What happens if I use both DATASET.train and mix_datasets at the same time?



**Answer:**



**Simple rule:** `mix_datasets` **overrides** `DATASET.train` when present.

**Priority order:**
1. `mix_datasets` in `FT_LORA` (highest priority)
2. `mix_datasets` in `DATASET` block
3. `DATASET.train` (default, lowest priority)

**Example:**
```okt

DATASET {

    train: "dataset/main.jsonl"  # Default dataset

}



FT_LORA {

    mix_datasets: [

        { path: "dataset/a.jsonl", weight: 70 },

        { path: "dataset/b.jsonl", weight: 30 }

    ]

    # This mix_datasets overrides DATASET.train

}

```

**Why this design?**
- Allows flexibility without breaking the main structure
- Enables dataset-specific configurations per training method
- Maintains backward compatibility with v1.0

**Best practice:** Use `DATASET.train` for the default, and `mix_datasets` when you need weighted mixing.

---

## 7. Does OktoScript replace Python?

**Answer:**

**No.** OktoScript does **not** replace Python. Instead, it replaces the **complex configuration boilerplate** typically written in Python.

**The relationship:**
- **Python** = Coding and programming (general-purpose language)
- **OktoScript** = Configuration of AI pipelines (domain-specific language)

**Think of it this way:**
```

Python (Engine) ← OktoScript (Configuration Layer) ← User

```

OktoScript sits **above** Python as a declarative layer, while Python powers the OktoEngine underneath.

**What OktoScript replaces:**
- ❌ Hundreds of lines of Python configuration code
- ❌ Complex YAML files with unclear structure
- ❌ Repetitive training scripts

**What Python still does:**
- ✅ Powers the OktoEngine
- ✅ Executes the training loop
- ✅ Handles low-level operations
- ✅ Provides hooks for custom logic

**Analogy:** OktoScript is to Python what Docker Compose is to Docker—a declarative configuration layer that simplifies complex operations.

---

## 8. Can I use multiple datasets with different weights?

**Answer:**

**Yes!** This is one of the key features of OktoScript v1.1.

**Syntax:**
```okt

DATASET {

    mix_datasets: [

        { path: "dataset/general.jsonl", weight: 60 },

        { path: "dataset/technical.jsonl", weight: 30 },

        { path: "dataset/creative.jsonl", weight: 10 }

    ]

    sampling: "weighted"

    shuffle: true

}

```

**Benefits:**
- ✅ **Balanced training** - Control dataset proportions
- ✅ **Domain blending** - Combine different data sources
- ✅ **Bias reduction** - Weight underrepresented data
- ✅ **Dataset prioritization** - Emphasize important data

**Rules:**
- Total weights must equal **exactly 100**
- `sampling: "weighted"` uses weights for sampling
- `sampling: "random"` ignores weights (uniform sampling)
- `shuffle: true` shuffles datasets before mixing

**Use case example:**
```okt

# Mix general conversations (60%) with technical Q&A (30%) and creative writing (10%)

mix_datasets: [

    { path: "dataset/conversations.jsonl", weight: 60 },

    { path: "dataset/technical_qa.jsonl", weight: 30 },

    { path: "dataset/creative.jsonl", weight: 10 }

]

```

---

## 9. What is the difference between EXPORT: safetensors and EXPORT: okm?

**Answer:**

| Format | Purpose | Use Case |
|--------|---------|----------|
| `safetensors` | Standard PyTorch weights format | LoRA adapters, model weights, HuggingFace compatibility |
| `okm` | OktoSeek optimized package | OktoSeek IDE, Flutter SDK, mobile apps, exclusive tools |
| `onnx` | Universal inference format | Production deployment, cross-platform compatibility |
| `gguf` | Local inference format | Ollama, Llama.cpp, local deployment |

**For LoRA fine-tuning:**
- `safetensors` → Saves only the LoRA adapter patch (small file, ~10-100MB)
- `okm` → Saves a full OktoSeek model package (includes adapter + metadata)

**Example:**
```okt

FT_LORA {

    lora_rank: 8

}



EXPORT {

    format: ["safetensors", "okm"]

    path: "./export/"

}

```

**Output:**
- `./export/adapter.safetensors` - LoRA adapter (for HuggingFace/PyTorch)
- `./export/model.okm` - OktoSeek package (for OktoSeek ecosystem)

**Why both?**
- `safetensors` for compatibility with standard ML tools
- `okm` for optimized OktoSeek ecosystem integration

---

## 10. Is OktoScript a programming language or a DSL?

**Answer:**

**OktoScript is a Domain-Specific Language (DSL).**

**What it is NOT:**
- ❌ A general-purpose programming language
- ❌ A scripting language with loops and variables
- ❌ A replacement for Python or JavaScript

**What it IS:**
- ✅ A declarative configuration language
- ✅ Purpose-built for AI pipelines
- ✅ Domain-specific (focused on AI training/deployment)

**Key characteristics:**
- **Declarative** - You describe **what** you want, not **how** to do it
- **No control flow** - No loops, conditionals, or functions
- **Block-based** - Configuration organized in semantic blocks
- **Type-safe** - Validated against grammar specification

**Why call it a DSL?**
- ✅ Technically accurate
- ✅ Increases professional credibility
- ✅ Sets correct expectations
- ✅ Distinguishes from general-purpose languages

**Analogy:** OktoScript is to AI pipelines what SQL is to databases—a specialized language for a specific domain.

---

## 11. What happens internally when I write FT_LORA?



**Answer:**



When you use `FT_LORA`, the OktoEngine performs these steps:

**1. Model Loading:**
- Loads the base model specified in `base_model`
- Initializes model architecture

**2. LoRA Adapter Injection:**
- Freezes the main model layers
- Adds LoRA adapters to selected modules (e.g., `q_proj`, `v_proj`)
- Adapters are low-rank matrices (rank × alpha)

**3. Training:**
- Trains **only** the LoRA adapter weights
- Main model weights remain frozen
- Uses optimizer and settings from `TRAIN` block

**4. Export:**
- Saves adapter weights via `EXPORT` block
- Optionally merges adapter into base model (if specified)

**Benefits:**
- ✅ **Reduced GPU usage** - Up to 90% less VRAM
- ✅ **Faster training** - Only small adapters are updated
- ✅ **Smaller files** - Adapter weights are tiny (~10-100MB)
- ✅ **Specialization** - Multiple adapters for different tasks
- ✅ **Flexibility** - Combine adapters at inference time

**Example flow:**
```

Base Model (7B params, frozen)

    ↓

+ LoRA Adapters (8 rank × 32 alpha = ~256 params per module)

    ↓

Training (only adapters updated)

    ↓

Export adapter.safetensors (~50MB)

```

---

## 12. Why is explicit declaration required instead of auto-inference?

**Answer:**

**Because transparency is better than hidden assumptions**, especially in AI pipelines.

**Problems with auto-inference:**
- ❌ Hidden assumptions can lead to silent mistakes
- ❌ Difficult to debug when things go wrong
- ❌ Unclear what the system is actually doing
- ❌ Harder to audit and review

**Benefits of explicit declaration:**
- ✅ **Self-documenting** - Scripts explain themselves
- ✅ **Auditable** - Easy to review and validate
- ✅ **Beginner-friendly** - Clear what's happening
- ✅ **Safe** - No hidden behavior or assumptions

**Example of explicit vs implicit:**
```okt

# Explicit (OktoScript style)

MODEL {

    base: "oktoseek/base-llm-7b"

}



FT_LORA {

    base_model: "oktoseek/base-llm-7b"  # Explicit, even if redundant

}



# Implicit (what we avoid)

FT_LORA {

    # base_model inferred from MODEL block - NOT in OktoScript

}

```

**Philosophy:** In AI, explicit is safer than implicit. A few extra lines of configuration prevent costly mistakes.

---

## 13. Can I run LoRA without EXPORT?

**Answer:**

**Technically yes, but it's not recommended.**

**What happens without EXPORT:**
- ✅ Training completes successfully
- ✅ Adapter weights are trained
- ❌ Adapter weights are **not saved**
- ❌ Training becomes useless after process ends

**Best practice:**
```okt

FT_LORA {

    lora_rank: 8

    lora_alpha: 32

}



EXPORT {

    format: ["safetensors", "okm"]

    path: "./export/"

}

```

**Why always include EXPORT:**
- ✅ Preserves your work
- ✅ Enables model reuse
- ✅ Allows deployment
- ✅ Supports version control

**Exception:** If you're only testing or debugging, you might skip EXPORT temporarily, but always add it before production training.

---

## 14. What if I want to merge a LoRA adapter into the final model later?

**Answer:**

**Current support (v1.1):**

You can merge LoRA adapters using OktoEngine's internal tools or Python hooks:

**Option 1: Using Hooks (Current)**
```okt

HOOKS {

    after_train: "scripts/merge_lora.py"

}

```

**Option 2: Manual merge with OktoEngine CLI**
```bash

okto_merge --adapter ./export/adapter.safetensors \

           --base ./models/base-model \

           --output ./export/merged-model

```

**Future support (v2.0+):**

A dedicated `MERGE` block is planned:

```okt

MERGE {

    source: "export/adapter.safetensors"

    target: "models/base-model"

    output: "export/merged-model"

    format: ["okm", "onnx"]

}

```

**Why merge?**
- ✅ Single model file (no separate adapter needed)
- ✅ Faster inference (no adapter loading)
- ✅ Easier deployment (one file instead of two)
- ✅ Better compatibility (works with standard tools)

**When to merge:**
- After training is complete
- Before deployment
- When you want a standalone model

---

## 15. Why choose OktoScript over YAML or Python scripts?

**Answer:**

**OktoScript is purpose-built for AI pipelines**, while YAML and Python are generic tools.

**Comparison:**

| Feature | OktoScript | YAML | Python |
|---------|------------|------|--------|
| **Purpose** | AI pipelines | Generic config | General programming |
| **Readability** | ✅ Block-based, semantic | ⚠️ Flat, no structure | ❌ Code complexity |
| **Validation** | ✅ Grammar-enforced | ⚠️ Manual validation | ❌ Runtime errors |
| **Type Safety** | ✅ Built-in | ❌ No types | ⚠️ Runtime checking |
| **AI-Specific** | ✅ LoRA, RAG, monitoring | ❌ Generic | ⚠️ Requires libraries |
| **Learning Curve** | ✅ Simple blocks | ⚠️ Syntax learning | ❌ Programming required |
| **IDE Support** | ✅ OktoSeek IDE | ⚠️ Generic editors | ✅ IDEs available |

**Key advantages of OktoScript:**

1. **Purpose-built for AI**
   - Native support for LoRA, RAG, monitoring
   - AI-specific blocks and concepts
   - Optimized for ML workflows

2. **Human-oriented**
   - Readable by non-programmers
   - Self-documenting structure
   - Clear semantic blocks

3. **Less error-prone**
   - Grammar validation
   - Type checking
   - Constraint enforcement

4. **Integrated ecosystem**
   - OktoSeek IDE support
   - OktoEngine integration
   - Flutter SDK compatibility

5. **Single config file**
   - Everything in one `.okt` file
   - No scattered configuration
   - Version control friendly

**Example comparison:**

**YAML (generic):**
```yaml

model:

  base: "oktoseek/base"

train:

  epochs: 5

  batch_size: 32

# No validation, no structure, unclear relationships

```

**Python (complex):**
```python

from transformers import Trainer, TrainingArguments

# 100+ lines of code

# Complex error handling

# Hard to read and maintain

```

**OktoScript (focused):**
```okt

MODEL {

    base: "oktoseek/base"

}



TRAIN {

    epochs: 5

    batch_size: 32

}

# Clear, validated, self-documenting

```

**Bottom line:** OktoScript is to AI pipelines what Docker Compose is to containers—a declarative DSL that simplifies complex operations.

---

## 16. How does OktoScript handle model versioning and checkpoints?

**Answer:**

OktoScript uses the `runs/` directory structure for automatic versioning and checkpoint management.

**Structure:**
```

runs/

  └── my-model/

      ├── checkpoint-100/

      │   └── model.safetensors

      ├── checkpoint-200/

      │   └── model.safetensors

      ├── tokenizer.json

      ├── training_logs.json

      └── metrics.json

```

**Checkpoint configuration:**
```okt

TRAIN {

    epochs: 10

    checkpoint_steps: 100  # Save every 100 steps

    checkpoint_path: "./checkpoints"

}

```

**Resume from checkpoint:**
```okt

TRAIN {

    resume_from_checkpoint: "./checkpoints/checkpoint-500"

    epochs: 10

}

```

**Benefits:**
- ✅ Automatic versioning by run name
- ✅ Step-based checkpointing
- ✅ Easy resume from any checkpoint
- ✅ Training logs and metrics per run

**Best practice:** Use descriptive project names in `PROJECT` block to organize runs.

---

## 17. Can I use custom Python code with OktoScript?

**Answer:**

**Yes!** OktoScript supports custom Python code through the `HOOKS` block.

**Available hooks:**
```okt

HOOKS {

    before_train: "scripts/preprocess.py"

    after_train: "scripts/postprocess.py"

    before_epoch: "scripts/custom_early_stop.py"

    after_epoch: "scripts/log_custom_metrics.py"

    on_checkpoint: "scripts/backup_checkpoint.sh"

    custom_metric: "scripts/toxicity_calculator.py"

}

```

**Hook script interface:**
```python

# scripts/preprocess.py

def before_train(config, dataset, model):

    # Custom preprocessing

    # Modify config if needed

    return config



# scripts/after_epoch.py

def after_epoch(epoch, metrics, model_state):

    # Custom logging, early stopping logic

    # Return True to stop training

    return False

```

**Use cases:**
- Custom data preprocessing
- Custom metrics calculation
- Custom early stopping logic
- External API integration
- Custom logging

**Key point:** OktoScript handles the configuration, Python handles the custom logic. Best of both worlds.

---

## 18. What happens if I specify conflicting configurations?

**Answer:**

OktoScript has **clear priority rules** to handle conflicts:

**Priority order (highest to lowest):**
1. Block-specific overrides (e.g., `mix_datasets` in `FT_LORA`)
2. Block-level settings (e.g., `FT_LORA` over `TRAIN` for LoRA)
3. Global settings (e.g., `DATASET.train`)

**Example conflicts and resolution:**

**Conflict 1: Dataset specification**
```okt

DATASET {

    train: "dataset/a.jsonl"  # Lower priority

}



FT_LORA {

    mix_datasets: [...]  # Higher priority - overrides DATASET.train

}

```
**Resolution:** `mix_datasets` is used, `DATASET.train` is ignored.

**Conflict 2: TRAIN vs FT_LORA**

```okt

TRAIN {

    epochs: 10

}



FT_LORA {

    epochs: 5  # This is used for LoRA training

}

```

**Resolution:** `FT_LORA.epochs` is used, but `TRAIN` optimizer/device settings still apply.

**Validation:**
- OktoEngine validates configurations before training
- Conflicts are reported with clear error messages
- Use `okto validate` to check before training

---

## 19. How do I debug an OktoScript file?

**Answer:**

**Step 1: Validate syntax**
```bash

okto validate train.okt

```

**Step 2: Check logs**
```okt

LOGGING {

    save_logs: true

    log_level: "debug"  # Enable debug logging

    log_every: 1

}

```

**Step 3: Use MONITOR for system diagnostics**
```okt

MONITOR {

    level: "full"

    log_system: ["gpu_memory_used", "cpu_usage", "temperature"]

    dashboard: true  # Real-time visualization

}

```

**Step 4: Check validation errors**
Common errors and solutions:
- `Dataset file not found` → Check file paths
- `Invalid optimizer` → Use allowed values (adam, adamw, sgd, etc.)
- `Model base not found` → Verify model path or HuggingFace name
- `Dataset mixing weights invalid` → Total must equal 100

**Step 5: Use system diagnostics**
```bash

okto_doctor  # Shows GPU, CUDA, RAM, drivers

```

**Best practices:**
- Always validate before training
- Start with `log_level: "debug"`
- Use `MONITOR` dashboard for real-time insights
- Check `runs/*/training_logs.json` for detailed logs

---

## 20. Is OktoScript production-ready?

**Answer:**

**Yes, OktoScript v1.1 is production-ready** for AI training and deployment pipelines.

**Production features:**
- ✅ **Stable grammar** - Well-defined and validated
- ✅ **Error handling** - Comprehensive validation
- ✅ **Monitoring** - System and training telemetry
- ✅ **Export formats** - Production-ready formats (ONNX, GGUF, OKM)
- ✅ **Deployment** - API, mobile, edge targets
- ✅ **Security** - Model encryption and watermarking
- ✅ **Logging** - Comprehensive logging and metrics

**Production checklist:**
```okt

PROJECT "ProductionModel"

VERSION "1.0"



# ... configuration ...



SECURITY {

    encrypt_model: true

    watermark: true

}



MONITOR {

    level: "full"

    dashboard: true

}



EXPORT {

    format: ["onnx", "okm"]  # Production formats

    optimize_for: "speed"

}



DEPLOY {

    target: "api"

    requires_auth: true

    max_concurrent_requests: 100

}

```

**Used by:**
- OktoSeek IDE (production)
- Research institutions
- AI development teams
- Educational platforms

**Version stability:**
- v1.0: Stable, production-ready
- v1.1: Backward compatible, adds LoRA and monitoring

---

## Need More Help?

- 📖 [Complete Grammar Specification](./grammar.md)
- 🚀 [Getting Started Guide](./GETTING_STARTED.md)
- ✅ [Validation Rules](../VALIDATION_RULES.md)
- 💡 [Examples](../examples/)
- 🐛 [Troubleshooting](./grammar.md#troubleshooting)

**Still have questions?** Open an issue on [GitHub](https://github.com/oktoseek/oktoscript/issues) or contact **service@oktoseek.com**.

---

**OktoScript** is developed and maintained by **OktoSeek AI**.