| # Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter | |
| This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint. | |
| ## How to Deploy as an Endpoint | |
| 1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.** | |
| - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files. | |
| 2. **Add a `handler.py` file to define the endpoint logic.** | |
| 3. **Push to the Hugging Face Hub.** | |
| 4. **Deploy as an Inference Endpoint via the Hugging Face UI.** | |
| --- | |
| ## Example `handler.py` | |
| This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference. | |
| ```python | |
| from typing import Dict, Any | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel, PeftConfig | |
| import torch | |
| class EndpointHandler: | |
| def __init__(self, path="."): | |
| # Load base model and tokenizer | |
| base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b" | |
| self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) | |
| base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True) | |
| # Load LoRA adapter | |
| self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter") | |
| self.model.eval() | |
| self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| self.model.to(self.device) | |
| def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]: | |
| prompt = data["inputs"] if isinstance(data, dict) else data | |
| inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device) | |
| with torch.no_grad(): | |
| output = self.model.generate(**inputs, max_new_tokens=256) | |
| decoded = self.tokenizer.decode(output[0], skip_special_tokens=True) | |
| return {"generated_text": decoded} | |
| ``` | |
| - Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`). | |
| - The endpoint will accept a JSON payload with an `inputs` field containing the prompt. | |
| --- | |
| ## Notes | |
| - Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`. | |
| - For large models, use an Inference Endpoint with GPU. | |
| - You can customize the handler for chat formatting, streaming, etc. | |
| --- | |
| ## Quickstart | |
| 1. Train your adapter with `train_gemma_unsloth.py`. | |
| 2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo. | |
| 3. Deploy as an Inference Endpoint. | |
| 4. Send requests to your endpoint! | |
| ```` | |
| # Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter | |
| This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint. | |
| ## How to Deploy as an Endpoint | |
| 1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.** | |
| - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files. | |
| 2. **Add a `handler.py` file to define the endpoint logic.** | |
| 3. **Push to the Hugging Face Hub.** | |
| 4. **Deploy as an Inference Endpoint via the Hugging Face UI.** | |
| --- | |
| ## Example `handler.py` | |
| This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference. | |
| ```python | |
| from typing import Dict, Any | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel, PeftConfig | |
| import torch | |
| class EndpointHandler: | |
| def __init__(self, path="."): | |
| # Load base model and tokenizer | |
| base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b" | |
| self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) | |
| base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True) | |
| # Load LoRA adapter | |
| self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter") | |
| self.model.eval() | |
| self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| self.model.to(self.device) | |
| def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]: | |
| prompt = data["inputs"] if isinstance(data, dict) else data | |
| inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device) | |
| with torch.no_grad(): | |
| output = self.model.generate(**inputs, max_new_tokens=256) | |
| decoded = self.tokenizer.decode(output[0], skip_special_tokens=True) | |
| return {"generated_text": decoded} | |
| ```` | |
| - Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`). | |
| - The endpoint will accept a JSON payload with an `inputs` field containing the prompt. | |
| --- | |
| ## Notes | |
| - Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`. | |
| - For large models, use an Inference Endpoint with GPU. | |
| - You can customize the handler for chat formatting, streaming, etc. | |
| --- | |
| ## Quickstart | |
| 1. Train your adapter with `train_gemma_unsloth.py`. | |
| 2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo. | |
| 3. Deploy as an Inference Endpoint. | |
| 4. Send requests to your endpoint! | |