Instructions to use mshojaei77/Gemma-2-2b-fa with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mshojaei77/Gemma-2-2b-fa with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mshojaei77/Gemma-2-2b-fa") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mshojaei77/Gemma-2-2b-fa") model = AutoModelForCausalLM.from_pretrained("mshojaei77/Gemma-2-2b-fa") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use mshojaei77/Gemma-2-2b-fa with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mshojaei77/Gemma-2-2b-fa", filename="Gemma_fa_2b_q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use mshojaei77/Gemma-2-2b-fa with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mshojaei77/Gemma-2-2b-fa:Q8_0 # Run inference directly in the terminal: llama-cli -hf mshojaei77/Gemma-2-2b-fa:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mshojaei77/Gemma-2-2b-fa:Q8_0 # Run inference directly in the terminal: llama-cli -hf mshojaei77/Gemma-2-2b-fa:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mshojaei77/Gemma-2-2b-fa:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf mshojaei77/Gemma-2-2b-fa:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mshojaei77/Gemma-2-2b-fa:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf mshojaei77/Gemma-2-2b-fa:Q8_0
Use Docker
docker model run hf.co/mshojaei77/Gemma-2-2b-fa:Q8_0
- LM Studio
- Jan
- vLLM
How to use mshojaei77/Gemma-2-2b-fa with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mshojaei77/Gemma-2-2b-fa" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mshojaei77/Gemma-2-2b-fa", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mshojaei77/Gemma-2-2b-fa:Q8_0
- SGLang
How to use mshojaei77/Gemma-2-2b-fa with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mshojaei77/Gemma-2-2b-fa" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mshojaei77/Gemma-2-2b-fa", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mshojaei77/Gemma-2-2b-fa" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mshojaei77/Gemma-2-2b-fa", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use mshojaei77/Gemma-2-2b-fa with Ollama:
ollama run hf.co/mshojaei77/Gemma-2-2b-fa:Q8_0
- Unsloth Studio new
How to use mshojaei77/Gemma-2-2b-fa with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mshojaei77/Gemma-2-2b-fa to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mshojaei77/Gemma-2-2b-fa to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mshojaei77/Gemma-2-2b-fa to start chatting
- Docker Model Runner
How to use mshojaei77/Gemma-2-2b-fa with Docker Model Runner:
docker model run hf.co/mshojaei77/Gemma-2-2b-fa:Q8_0
- Lemonade
How to use mshojaei77/Gemma-2-2b-fa with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mshojaei77/Gemma-2-2b-fa:Q8_0
Run and chat with the model
lemonade run user.Gemma-2-2b-fa-Q8_0
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Persian Gemma 2b - Conversational AI Experiment (Early Stage)
This repository presents Persian Gemma 2b, an early-stage experimental model derived from Google's Gemma-2-2b-it. It has been fine-tuned using QLoRA on the mshojaei77/Persian_sft dataset to explore its capabilities in Persian language conversational tasks.
1. Model Architecture
- Base Model: google/gemma-2-2b-it
- Architecture Type: Gemma2ForCausalLM
- Model Size: 2 billion parameters.
- Description: Persian Gemma 2b inherits the architecture of Gemma-2-2b-it, a lightweight yet capable model known for its efficiency and strong performance for its size. It is designed for text generation tasks and is particularly suited for conversational applications. The model uses standard transformer layers with attention mechanisms, enabling it to process and generate text in Persian.
2. Training Details
- Fine-tuning Method: QLoRA (Quantization-aware Low-Rank Adaptation)
- QLoRA is used for parameter-efficient fine-tuning, allowing adaptation of the base model with reduced computational resources and memory footprint.
- LoRA Rank (r): 32
- LoRA Alpha: 16
- LoRA Dropout: 0.05
- LoRA Target Modules:
['down_proj', 'gate_proj', 'k_proj', 'o_proj', 'q_proj', 'up_proj', 'v_proj'](linear layers)
- Training Dataset: mshojaei77/Persian_sft
- Training Steps: 20 (Extremely limited - Proof of Concept)
- Hardware: Kaggle Notebook, T4 GPU
- Software: Axolotl library
- Optimizer: paged_adamw_32bit
- Learning Rate Scheduler: cosine
- Learning Rate: 0.0002
- Micro Batch Size: 1
- Gradient Accumulation Steps: 1
- Sequence Length: 2048
- Sample Packing: Enabled (
sample_packing: true) - Mixed Precision: FP16 (
fp16: true), Load in 4bit (load_in_4bit: true), BF16: Disabled (bf16: false) - Gradient Checkpointing: Enabled (
gradient_checkpointing: true) - Attention Implementation: SDPA (default, Flash Attention: explicitly disabled -
flash_attention: false) - Tokenizer: Uses the tokenizer from the base model
google/gemma-2-2b-it. - Chat Template: gemma
- Training Objective: Supervised Fine-tuning (SFT) to adapt the base model for Persian conversational responses, guided by the
Persian_sftdataset. - Validation Set: None used in this preliminary experiment.
Critical Note: The model was trained for an exceptionally short duration (20 steps). This is insufficient for robust learning and generalization. Expect significantly under-optimized performance.
3. Dataset Information
- Dataset Name: mshojaei77/Persian_sft
- Dataset Description: The
Persian_sftdataset is a collection of Persian conversations designed for instruction fine-tuning of language models. It likely contains examples of user queries and desired model responses in Persian, formatted for conversational fine-tuning. - Dataset Type: Supervised Fine-tuning (SFT) dataset for conversational AI.
- Language: Primarily Persian (fa).
4. Intended Use
Intended Use Cases:
- Research & Experimentation: Primary use is to investigate the feasibility of fine-tuning Gemma-2-2b-it for Persian language conversational tasks and to serve as a starting point for further research.
- Educational Purposes: Demonstration of QLoRA fine-tuning techniques using Axolotl, and a practical example for learning about Persian language model development.
- Community Development: To encourage community contributions towards building better Persian language models and resources.
- Prototyping (with caution): For rapid prototyping and exploring potential applications of Persian conversational AI, strictly acknowledging the model's limitations and preliminary state.
5. Limitations
- Severe Under-training: Trained for only 20 steps, leading to significantly sub-optimal performance across all aspects.
- Lack of Validation: Absence of a validation set hinders monitoring of generalization and increases the risk of overfitting.
- Limited Fluency and Coherence: May produce grammatically incorrect, disfluent, or incoherent Persian text, especially in complex or lengthy conversations.
- Hallucinations and Factual Errors: Prone to generating factually incorrect or nonsensical information. Verification of output is crucial.
- Bias: Likely inherits and potentially amplifies biases from the base model and the fine-tuning dataset, leading to biased or unfair outputs.
- Poor Generalization: Performance is expected to degrade significantly on data outside the training distribution (different conversational styles, topics, or domains).
- Limited Conversational Abilities: May struggle with complex conversational turns, context maintenance, and nuanced understanding of user intent.
- Ethical Concerns: Potential for biased, inaccurate, or inappropriate output raises ethical concerns, especially in sensitive applications.
6. Performance Metrics
Current Evaluation:
- No formal evaluation has been conducted for this preliminary model due to its extremely limited training. Performance is expected to be significantly below optimal.
7. How to Use
import torch
from transformers import pipeline
# Initialize the text generation pipeline
pipe = pipeline(
"text-generation",
model="mshojaei77/Gemma-2b-fa",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda", # Or "mps" for Macs with Apple Silicon
)
# Prepare input messages (using the gemma chat template implicitly)
messages = [
{"role": "user", "content": "سلام چطوری؟"},
]
# Generate a response with a maximum of 512 new tokens
outputs = pipe(messages, max_new_tokens=512, chat_template="gemma") # Explicitly using chat_template for clarity
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
# Example Output (Illustrative - Output quality may vary significantly):
# سلام! من خوبم، ممنون. شما چطوری؟ 😊
Important Usage Notes:
library_name: transformersandpipeline_tag: text-generation: Specified in metadata and Model Details for discoverability and clarity.chat_template="gemma": Use the correct chat template for Gemma models.- Hardware Recommendations: CUDA GPU recommended.
device="mps"for Apple Silicon (performance may vary). - Output Quality: Expect highly variable and often suboptimal output due to limited training. Critical evaluation of generated text is essential.
- Downloads last month
- 40

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mshojaei77/Gemma-2-2b-fa", filename="Gemma_fa_2b_q8_0.gguf", )