JARVIS-Mistral-Phase1a: Macedonian Language Foundation

Model ID: Miki-T/JARVIS-Mistral-Phase1a

A QLoRA fine-tuned Mistral 7B model trained on 500k rows of Macedonian web text to build language fluency as the foundation for JARVIS — a locally-hosted AI assistant inspired by Iron Man's JARVIS.


Model Details

Model Description

  • Developed by: Miki Trajkovski
  • Model type: Causal Language Model (fine-tuned via QLoRA)
  • Base model: mistralai/Mistral-7B-v0.1
  • Language(s): Macedonian (mk), with English support
  • License: MIT
  • Finetuned from model: Mistral 7B v0.1
  • Adapter type: LoRA (Low-Rank Adaptation)

Model Architecture

  • Base: Mistral 7B (7 billion parameters)
  • Fine-tuning method: QLoRA (4-bit quantization + LoRA adapters)
  • LoRA rank: 16
  • LoRA alpha: 32
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Max sequence length: 1024 tokens

Model Sources


Uses

Direct Use

This model is designed for:

  • Macedonian text generation — generates fluent Macedonian sentences
  • Language understanding — comprehends Macedonian grammar and semantics
  • Foundation for downstream tasks — serves as Phase 1a of the JARVIS training pipeline

Example usage:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import AutoPeftModelForCausalLM

# Load adapter
model = AutoPeftModelForCausalLM.from_pretrained(
    "Miki-T/JARVIS-Mistral-Phase1a",
    device_map="auto",
    torch_dtype="auto",
)

# Merge for inference
model = model.merge_and_unload()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Miki-T/JARVIS-Mistral-Phase1a")

# Generate
prompt = "Македонија е земја позната по"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Downstream Use (Phase 1b, 1c)

This model is Phase 1a of a multi-phase training pipeline:

  • Phase 1a (current): Macedonian language foundation
  • Phase 1b (next): Instruction following
  • Phase 1c (planned): Reasoning and problem-solving
  • Phase 2 (planned): Macedonian law domain expertise (RAG)

Each phase builds on the previous one. Do NOT train Phase 1b on a fresh base model.

Out-of-Scope Use

  • Not for production: This is a research/learning model
  • Not instruction-tuned: Phase 1a only teaches language fluency, not instruction following
  • Not domain-specific: Use Phase 2 for legal/specialized Macedonian tasks
  • Not multilingual: Optimized for Macedonian; English support varies

Limitations and Bias

Known Limitations

  1. Phase 1a only teaches language fluency — the model does NOT understand instructions yet

    • Input: "Дај ми преводот" (Give me a translation)
    • Output: Likely continues generating Macedonian text instead of translating
    • This is fixed in Phase 1b
  2. Training data bias — trained on Macedonian web text (Wikipedia, news, etc.)

    • May reflect biases present in those sources
    • Limited exposure to specialized domains (legal, medical, technical)
  3. Context window: 1024 tokens max — cannot process very long Macedonian texts

  4. No fine-grained reasoning: Phase 1c adds reasoning capability; Phase 1a lacks it

Recommendations

  • Use this model only as a foundation for downstream phases
  • For production Macedonian tasks, wait for Phase 1b (instruction following) and Phase 1c (reasoning)
  • Fine-tune on domain-specific data if targeting legal, medical, or technical Macedonian
  • Always validate outputs for accuracy and bias

Training Details

Training Data

Dataset Rows Source Purpose
LVSTCK/macedonian-corpus-cleaned-dedup 500,000 HuggingFace Macedonian language foundation
  • Data format: Plain text (one document per line in JSONL)
  • Quality: Cleaned and deduplicated (lower quality than raw)
  • Language: 100% Macedonian (Cyrillic script)
  • Size: ~500k rows, ~2.5GB uncompressed

Training Procedure

Preprocessing

  • Tokenized with Mistral tokenizer
  • Max sequence length: 1024 tokens
  • Packing enabled (multiple short texts combined into context window)
  • No removal of special tokens or data cleaning beyond source dataset

Hyperparameters

Parameter Value Reasoning
Learning rate 2e-4 Standard QLoRA starting point
Warmup ratio 5% Prevent large initial updates
Learning rate scheduler Cosine decay Smooth decay to ~0 by end
Batch size 2 Fits in 12GB VRAM with QLoRA
Gradient accumulation 8 Effective batch = 16
Epochs 1 Single pass through data (avoid overfitting)
Optimization AdamW 8-bit Memory efficient
Gradient checkpointing Enabled Save VRAM at cost of speed

Training Regime

  • Hardware: NVIDIA RTX 5070 (12GB VRAM)
  • Framework: PyTorch 2.2.0 + Hugging Face Transformers
  • Fine-tuning framework: TRL SFTTrainer + PEFT LoRA
  • Precision: 4-bit quantization (NF4) + bfloat16 math

Speeds, Sizes, Times

Metric Value
Training duration 5 days, 23 hours, 29 minutes
Total steps 9,502
Throughput ~12-15 tokens/second
Adapter size ~200 MB
Total VRAM used ~8.5 GB / 12 GB
Total tokens processed 7.6M tokens

Note: Throughput was artificially limited by gradient checkpointing. Phase 1b will disable this for 10x speedup.


Evaluation

Testing Data

Evaluated on:

  • Manual test: 3 Macedonian prompts (verified fluent generation)
  • Benchmark: LVSTCK/macedonian-llm-eval (83 questions) — dataset unavailable due to HuggingFace deprecation

Metrics

Metric Value Interpretation
Final loss 1.2543 Excellent convergence
Starting loss 2.0910 Model improved 40%
Final perplexity 3.51 Model is as uncertain as picking from ~4 equally likely tokens
Best loss achieved 1.2460 Fully converged
Gradient norm (avg) 0.583 Stable training (healthy range: 0.1-2.0)
Gradient norm (max) 1.258 No exploding gradients

Sample Outputs

Test prompt: "Скопје е главен град на"
Model output: "Република Македонија и има околу 600.000 жители."
Interpretation: ✅ Fluent Macedonian text, maintains context, grammatically correct


Model Card Details

Environmental Impact

Factor Value
Hardware NVIDIA RTX 5070 (12GB VRAM)
Training duration 5 days, 23 hours
Power consumption (estimated) ~150W continuous × 143.5 hours ≈ 21.5 kWh
Carbon emitted (estimated) ~10-15 kg CO2e (depends on grid carbon intensity)
Cloud provider None (local desktop GPU)

Compute Infrastructure

  • CPU: AMD Ryzen 7 7800X3D (8-core)
  • GPU: NVIDIA RTX 5070 (12GB GDDR6X VRAM)
  • RAM: 32GB DDR5
  • Storage: NVMe SSD (assumed)
  • OS: Windows 11
  • CUDA: CUDA 12.x

Software

  • PyTorch: 2.7.0+cu128
  • Transformers: 4.40.0
  • PEFT: 0.10.0
  • TRL: 0.8.6
  • Accelerate: 0.29.0
  • Bitsandbytes: 0.43.0
  • CTranslate2: (for Whisper STT, not used in this model)

See full requirements.txt in the JARVIS repository.


How to Use

Load the Model

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

# Load with adapter (no merge)
model = AutoPeftModelForCausalLM.from_pretrained(
    "Miki-T/JARVIS-Mistral-Phase1a",
    device_map="auto",
    torch_dtype=torch.float16,
)

# Or merge for faster inference
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained("Miki-T/JARVIS-Mistral-Phase1a")

Generate Text

prompt = "Македонија е земја позната по"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
input_ids = inputs["input_ids"].to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=50,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

Fine-tune Further (Phase 1b)

from peft import get_peft_model, LoraConfig

# Load base model + existing adapter
model = AutoPeftModelForCausalLM.from_pretrained("Miki-T/JARVIS-Mistral-Phase1a")

# Use as starting point for Phase 1b training
# See: github.com/MikiTrajkovski/JARVIS/blob/main/tools/training_pipeline/train_phase1b.py

Citation

If you use this model, please cite:

BibTeX:

@misc{trajkovski2024jarvis,
  author = {Trajkovski, Miki},
  title = {JARVIS: Macedonian Language Foundation (Phase 1a)},
  year = {2024},
  publisher = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/Miki-T/JARVIS-Mistral-Phase1a}},
}

APA:

Trajkovski, M. (2024). JARVIS: Macedonian language foundation (Phase 1a) [Model]. Hugging Face Hub. https://huggingface.co/Miki-T/JARVIS-Mistral-Phase1a

Acknowledgments

  • Base model: Mistral AI (Mistral 7B v0.1)
  • Fine-tuning: Hugging Face TRL + PEFT libraries
  • Data: LVSTCK Macedonian corpus
  • Inspiration: Tony Stark's JARVIS from Marvel

License

This model is provided under the MIT License, same as the JARVIS project.


Model Card Contact

Author: Miki Trajkovski
GitHub: https://github.com/MikiTrajkovski/JARVIS
HuggingFace: https://huggingface.co/Miki-T


Framework Versions

  • PEFT: 0.10.0
  • Transformers: 4.40.0
  • PyTorch: 2.7.0+cu128
  • CUDA: 12.x
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Miki-T/JARVIS-Mistral-Phase1a

Adapter
(2474)
this model

Dataset used to train Miki-T/JARVIS-Mistral-Phase1a