MK-LLM-Mistral / README.md

Update README.md

9f919d1 verified 4 months ago

6.66 kB

	---
	language:
	- mk
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- macedonian
	- cyrillic
	- mistral
	- qlora
	- peft
	base_model: mistralai/Mistral-7B-v0.1
	datasets:
	- ainowmk/MK-LLM-Mistral-data
	metrics:
	- perplexity
	pretty_name: MK-LLM (Mistral)
	model-index:
	- name: MK-LLM-Mistral
	results: []
	---
	# 🇲🇰 MK-LLM: The First Open Macedonian Language Model

	## 🌍 About This Project
	MK-LLM is Macedonia's first open-source Large Language Model (LLM), developed for the community, by the community. This project is led by AI Now - Association for Artificial Intelligence in Macedonia.

	📌 Website: [www.ainow.mk](https://www.ainow.mk)
	📩 Contact: [contact@ainow.mk](mailto:contact@ainow.mk)
	🛠 Model: [MK-LLM-Mistral](https://huggingface.co/ainowmk/MK-LLM-Mistral)
	💻 GitHub: [MK-LLM](https://github.com/AI-now-mk/MK-LLM)

	## 🆕 Latest Updates (14.10.2025)
	- OpenAI-compatible endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/models` with JSON SSE streaming
	- QLoRA training pipeline (4-bit) with LoRA adapters and gradient checkpointing
	- Upgraded Macedonian data pipeline: cleaner extraction (trafilatura), gcld3 language filter, MinHash dedup
	- Gradio demo UI and improved FastAPI server (env-based config, lazy model load, quantization toggles)
	- Repository hygiene: LICENSE, model/dataset cards, Makefile, package inits, `.gitkeep` for data/models

	## 📂 Repository Structure
	```plaintext
	MK-LLM/
	├── data/
	│ ├── wikipedia/
	│ │ ├── download_wiki.py
	│ │ └── parse_wiki.py
	│ ├── cleaned/
	│ ├── processed/
	│ ├── raw/
	│ ├── tokenized/
	│ ├── eval/
	│ │ └── mk_eval.jsonl
	│ ├── process_all_data.py
	│ └── clean_wikipedia.py
	├── examples/
	│ ├── client_python.py
	│ ├── client_js.mjs
	│ ├── data_loader.py
	│ └── train_mistral_mk.py
	├── inference/
	│ ├── api.py
	│ ├── gradio_app.py
	│ └── chatbot.py
	├── training/
	│ ├── train_pipeline.py
	│ └── fine_tune_mistral.py
	├── scripts/
	│ ├── preprocess_data.py
	│ └── evaluate.py
	├── configs/
	│ ├── train_small.yaml
	│ └── train_full.yaml
	├── tests/
	│ ├── test_api.py
	│ ├── test_model.py
	│ └── test_dataset.py
	├── docs/
	│ ├── EXTENDING.md
	│ └── GITHUB_ISSUES.md
	├── .github/
	│ ├── workflows/ci.yml
	│ ├── ISSUE_TEMPLATE/
	│ │ ├── bug_report.yml
	│ │ └── feature_request.yml
	│ └── PULL_REQUEST_TEMPLATE.md
	├── models/
	├── notebooks/
	│ └── evaluation.ipynb
	├── Dockerfile
	├── docker-compose.yml
	├── Makefile
	├── requirements.txt
	├── constraints.txt
	├── LICENSE
	├── MODEL_CARD.md
	├── DATASET_CARD.md
	├── CODE_OF_CONDUCT.md
	├── SECURITY.md
	└── README.md
	```

	## Getting Started
	1. Clone the repository:
	```bash
	git clone https://github.com/AI-now-mk/MK-LLM.git
	cd MK-LLM
	```

	2) Install dependencies
	```bash
	pip install -r requirements.txt
	```

	Optional (recommended): use a virtual environment
	```bash
	python -m venv .venv
	# Windows
	.\.venv\Scripts\activate
	# macOS/Linux
	source .venv/bin/activate
	python -m pip install --upgrade pip
	pip install -r requirements.txt
	```

	3) Configure environment (optional)
	Create a `.env` file in the project root:
	```bash
	HOST=0.0.0.0
	PORT=8000
	ALLOW_ORIGINS=*
	MODEL_PATH=./models/mistral-finetuned-mk
	MODEL_ID=mk-llm
	TRUST_REMOTE_CODE=true
	LOAD_IN_4BIT=false
	LOAD_IN_8BIT=false
	TORCH_DTYPE=float16
	```

	4) Quick run: inference API
	```bash
	# Ensure a model exists at ./models/mistral-finetuned-mk (train or download)
	python -m inference.api
	# In another terminal, call the API
	curl -X POST http://localhost:8000/generate \
	-H "Content-Type: application/json" \
	-d '{"prompt":"Здраво Македонија!", "max_new_tokens":128}'
	```

	5) Optional: Gradio demo UI
	```bash
	python -m inference.gradio_app
	# Open http://localhost:7860
	```

	6) Prepare data (Macedonian)
	```bash
	# Download and extract Macedonian Wikipedia
	python -m data.wikipedia.download_wiki
	# Parse Wikipedia dump into clean text
	python -m data.wikipedia.parse_wiki
	# Collect web + combine + clean + mk language filter
	python -m data.process_all_data
	```

	7) Train (example)
	```bash
	python -m training.train_pipeline
	# or
	python -m training.fine_tune_mistral
	```

	### Docker
	Build and run the API with Docker:
	```bash
	docker build -t mk-llm .
	docker run --gpus all -p 8000:8000 -e MODEL_PATH=./models/mistral-finetuned-mk mk-llm
	```

	Or via docker-compose:
	```bash
	docker-compose up --build
	```

	### Continuous Integration
	This repository includes a GitHub Actions CI to lint, type-check, and run tests on PRs/commits to `main`.

	### Constraints (reproducible installs)
	To install with pinned versions:
	```bash
	pip install -r requirements.txt -c constraints.txt
	```

	### OpenAI-compatible endpoints
	This server exposes OpenAI-style routes so common clients (incl. gpt-oss-compatible tooling) can connect.

	- Chat Completions (streaming supported):
	```bash
	curl http://localhost:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"model": "mk-llm",
	"messages": [
	{"role": "system", "content": "Ти си помошник кој зборува на македонски."},
	{"role": "user", "content": "Која е историјата на Охрид?"}
	],
	"stream": false
	}'
	```

	- Text Completions:
	```bash
	curl http://localhost:8000/v1/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"prompt": "Здраво Македонија!",
	"max_tokens": 128
	}'
	```

	Related project: OpenAI gpt-oss (open-weight models, client compatibility notes). See `https://github.com/openai/gpt-oss`.

	### Use with gpt-oss-compatible clients
	Point any OpenAI-compatible client to this server.

	Example (Python OpenAI SDK environment):
	```bash
	export OPENAI_API_KEY=dummy
	export OPENAI_BASE_URL=http://localhost:8000/v1
	```

	Example Chat Completions (curl):
	```bash
	curl "$OPENAI_BASE_URL/chat/completions" \
	-H 'Content-Type: application/json' \
	-H "Authorization: Bearer $OPENAI_API_KEY" \
	-d '{
	"model": "mk-llm",
	"messages": [
	{"role": "system", "content": "Ти си помошник кој зборува на македонски."},
	{"role": "user", "content": "Која е историјата на Охрид?"}
	],
	"stream": true
	}'
	```