Extending MK-LLM
This guide shows how to plug in different base models, datasets, and adapters.
Swap base model
- Set
MODEL_PATHin.envto a local dir or HF repo id. - If using a HF repo, set
TRUST_REMOTE_CODE=truewhen custom code is required. - Low-VRAM: set
LOAD_IN_4BIT=true(orLOAD_IN_8BIT=true).
Add datasets
- Place cleaned text into
data/cleaned/*.txtor generatedata/cleaned/mk_combined_data.txtviapython -m data.process_all_data. - The trainer uses
examples/data_loader.load_mk_dataset()which prefers the combined file.
Instruction tuning
- Convert text into chat turns and use
tokenizer.apply_chat_templatein the training collator. - Provide Macedonian system prompts and stop sequences as needed.
Custom inference params
- Use
POST /v1/chat/completionswithtemperature,top_p,max_tokens,stream. - Configure defaults via
.env.
Contribute plugins
- Add new data collectors under
data/and document flags in README. - Add new generation strategies or safety middlewares in
inference/.