--- base_model: unsloth/llama-3.1-8b-instruct-bnb-4bit library_name: peft pipeline_tag: text-generation tags: - lora - llama-3.1 - api-generation - function-calling - structured-output - fine-tuned language: - en license: llama3.1 --- # Llama 3.1 8B - Structured API Generation (LoRA Adapter) **Fine-tuned adapter for generating structured JSON API calls from natural language queries** This LoRA adapter demonstrates that **context-engineered small models** can outperform generic large models on structured tasks: **40% vs 20.5% exact match** compared to GPT-4 class baseline on our evaluation set. ## Model Overview This is a LoRA adapter fine-tuned on [unsloth/llama-3.1-8b-instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3.1-8b-instruct-bnb-4bit) for structured API generation. The model takes natural language queries and tool specifications as input, and generates JSON objects with `query`, `tool_name`, and `arguments` fields. **Context Engineering Approach:** Instead of relying on a massive generic model, we teach a small 8B model to understand and maintain structured output constraints through domain-specific fine-tuning. This demonstrates the power of task-specific context engineering over general-purpose scale. ### Key Performance Metrics | Metric | Our Model | Azure GPT-4o | Improvement | |--------|-----------|--------------|-------------| | Exact Match Accuracy | **40.0%** (20/50) | 20.5% (10/50) | **+95%** | | Tool Name Accuracy | **98.0%** (49/50) | ~90% | **+8.9%** | | Arguments Partial Match | **76.0%** | 60.2% | **+26%** | | JSON Validity | **100%** (50/50) | 100% | - | | Model Size | 8B params | ~120B params | **15x smaller** | | Training Time | 4m 52s | N/A | - | **Baseline Details:** Azure GPT-4o (GPT-4 Optimized, ~120B parameters) evaluated on the same 50 test examples with temperature=0.7, using standard chat completion API with JSON schema enforcement. ## Quick Start ### Installation ```bash pip install torch transformers peft bitsandbytes accelerate ``` ### Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model and adapter base_model = "unsloth/llama-3.1-8b-instruct-bnb-4bit" adapter_path = "kineticdrive/llama-structured-api-adapter" model = AutoModelForCausalLM.from_pretrained( base_model, device_map="auto", torch_dtype=torch.bfloat16, load_in_4bit=True ) model = PeftModel.from_pretrained(model, adapter_path) model.eval() tokenizer = AutoTokenizer.from_pretrained(base_model) # Generate API call prompt = """Return a JSON object with keys query, tool_name, arguments describing the API call. Query: Fetch the first 100 countries in ascending order. Chosen tool: getallcountry Arguments should mirror the assistant's recommendation.""" messages = [{"role": "user", "content": prompt}] inputs = tokenizer.apply_chat_template( messages, return_tensors="pt", add_generation_prompt=True ).to(model.device) with torch.no_grad(): outputs = model.generate( inputs, max_new_tokens=256, temperature=0.0, do_sample=False, pad_token_id=tokenizer.pad_token_id ) result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True) print(result) ``` **Output:** ```json { "arguments": {"limit": 100, "order": "asc"}, "query": "Fetch the first 100 countries in ascending order.", "tool_name": "getallcountry" } ``` ## Training Details ### Dataset **⚠️ Note:** This is a proof-of-concept with a small, domain-specific dataset: - **Training**: 300 examples (~6 examples per tool on average) - **Validation**: 60 examples - **Test**: 50 examples (held-out from training) - **Domains**: API calls, math functions, data processing, web services - **Tool Coverage**: 50+ unique functions **Why this works:** The base Llama 3.1 8B Instruct model already has strong reasoning and JSON generation capabilities. We're teaching it task-specific structure preservation, not training from scratch. With ~6 examples per tool, the model learns to maintain the structured format while generalizing across similar API patterns. ### Training Hyperparameters ```yaml LoRA Configuration: r: 32 # Low-rank dimension alpha: 64 # LoRA scaling factor dropout: 0.1 target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] trainable_params: 84M (1.04% of base model) Training: max_epochs: 3 actual_steps: 39 # Early convergence after ~1.2 epochs batch_size: 2 gradient_accumulation_steps: 4 effective_batch_size: 8 # 2 * 4 learning_rate: 2e-4 lr_scheduler: linear warmup_steps: 10 optimizer: adamw_8bit weight_decay: 0.01 max_seq_length: 2048 ``` ### Training Results - Final Training Loss: 0.50 - Final Validation Loss: 0.58 - Training Time: 4m 52s - GPU: 2x RTX 3090 (21.8GB/24GB per GPU) - Total Steps: 39 (early stopping due to loss convergence) - Steps per Epoch: ~37 (300 examples / effective batch size 8) ## Evaluation ### Overall Results Tested on 50 held-out examples with diverse API calls: | Metric | Score | Definition | |--------|-------|------------| | Exact Match | 40.0% (20/50) | Strict JSON equality after key normalization | | Tool Name Accuracy | 98.0% (49/50) | Correct function name selected | | Query Preservation | 92.0% (46/50) | Original user query maintained in output | | Args Partial Match | 76.0% | Key-wise F1 score on arguments dict | | JSON Validity | 100% (50/50) | Parseable JSON with no syntax errors | | Functional Correctness | 71.0% | Tool call would succeed (correct tool + has required args) | **Baseline (Azure GPT-4o):** 20.5% exact match (10/50), 60.2% field F1 ### Metric Definitions Each metric measures a different aspect of **context engineering** — how well the model maintains structured constraints: 1. **Exact Match Accuracy** - **What**: Strict string equality after whitespace normalization and key sorting - **Why**: Measures perfect adherence to schema and value formats - **Context Engineering**: Tests whether model learned exact output templates - **Example**: `{"query": "...", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}` must match exactly 2. **Tool Name Accuracy** - **What**: Percentage of predictions with correct `tool_name` field matching expected function - **Why**: Most critical metric — wrong tool = complete failure - **Context Engineering**: Tests tool routing learned from examples - **Example**: Query "fetch countries" → must output `"tool_name": "getallcountry"`, not `"getAllCountries"` or `"get_country"` 3. **Query Preservation** - **What**: Original user query appears verbatim (or case-normalized) in output `query` field - **Why**: Ensures no information loss in pipeline - **Context Engineering**: Tests whether model maintains input fidelity vs paraphrasing - **Example**: Input "Fetch the first 100 countries" → Output must contain `"query": "Fetch the first 100 countries"` (not "Get 100 countries") 4. **Arguments Partial Match** - **What**: Key-wise F1 score — for each expected argument key, check if present with correct value - **Why**: Captures "mostly correct" calls where 1-2 args differ - **Context Engineering**: Tests parameter mapping consistency - **Example**: Expected `{"limit": 100, "order": "asc"}` vs Predicted `{"limit": 100, "order": "ASC"}` = 1.0 key match, 0.5 value match 5. **JSON Validity** - **What**: Output is parseable JSON (no syntax errors, bracket matching, valid escaping) - **Why**: Invalid JSON = parsing error in production - **Context Engineering**: Tests structural constraint adherence - **Example**: Must output `{"key": "value"}` not `{key: value}` or `{"key": "value"` (missing brace) 6. **Functional Correctness** - **What**: Tool call would execute successfully — correct tool name + all required arguments present - **Why**: Captures "usable" outputs even if not exact match - **Context Engineering**: Tests minimum viable output quality - **Example**: `{"tool_name": "getallcountry", "arguments": {"limit": 100}}` is functional even if `"order"` is missing (assuming it's optional) ### Evaluation Setup Transparency **Test Set:** 50 examples held-out from training, covering diverse API calls across 50+ tools **Our Model:** - Base: `unsloth/llama-3.1-8b-instruct-bnb-4bit` - Adapter: This LoRA fine-tune - Temperature: 0.0 (deterministic) - Max tokens: 256 - Prompt format: Same as training (query + tool spec → JSON output) **Baseline (Azure GPT-4o):** - Model: Azure OpenAI GPT-4o (gpt-4o-2024-08-06, ~120B params) - Temperature: 0.7 (as per Azure defaults) - Max tokens: 256 - Prompt format: Chat completion with system message describing JSON schema - JSON mode: Enabled via API parameter **⚠️ Evaluation Limitations:** - **Small test set (n=50):** With 20/50 vs 10/50 exact matches, confidence intervals overlap. A larger test set (200-300 examples) would provide more robust comparisons. - **Baseline prompt optimization:** Azure GPT-4o was evaluated with standard JSON schema enforcement but not extensively prompt-engineered. A more optimized baseline prompt might close the gap. - **In-distribution generalization:** Test set covers same API domains as training. Out-of-distribution tools or phrasing patterns may degrade performance. ### Context Engineering Examples **Example 1: Exact Match (Both models)** Input: ``` Query: Get all documents sorted by date Tool: getDocuments Args: {"sort": "date", "order": "desc"} ``` Our Model Output: ```json {"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}} ``` GPT-4o Output: ```json {"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}} ``` ✅ Both models: Exact match --- **Example 2: Our model wins (Case normalization)** Input: ``` Query: Fetch first 100 countries in ascending order Tool: getallcountry Args: {"limit": 100, "order": "asc"} ``` Our Model Output: ```json {"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}} ``` GPT-4o Output: ```json {"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "ASC"}} ``` ✅ Our model: Exact match (learned lowercase "asc" from examples) ⚠️ GPT-4o: Functional correctness, but not exact match (case differs) --- **Example 3: Both models functional but not exact** Input: ``` Query: Calculate sum of [1, 2, 3, 4, 5] Tool: calculate Args: {"operation": "sum", "values": [1, 2, 3, 4, 5]} ``` Our Model Output: ```json {"query": "Calculate sum of [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"operation": "sum", "numbers": [1, 2, 3, 4, 5]}} ``` GPT-4o Output: ```json {"query": "Calculate the sum of the array [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"op": "sum", "values": [1, 2, 3, 4, 5]}} ``` ⚠️ Our model: Wrong key name (`"numbers"` instead of `"values"`) but correct tool ⚠️ GPT-4o: Paraphrased query + abbreviated arg key (`"op"`) Both: Functional correctness ✅, Not exact match ❌ ## Use Cases - **AI Agent API generation**: Route user queries to appropriate backend APIs - **Structured data extraction**: Convert natural language to database queries - **Function calling for LLMs**: Generate tool invocations for agent frameworks - **Tool routing and parameter extraction**: Map intents to functions with correct arguments - **API request generation**: Transform conversational requests into structured API calls **Best for:** High-volume, latency-sensitive, cost-constrained deployments where you control the API schema and need consistent structured output. ## Limitations ### Scope Limitations - **Single API calls only**: Optimized for one tool per query (not multi-step workflows) - **English language only**: Not tested on non-English queries - **Domain-specific**: Best performance on APIs similar to training distribution (REST APIs, CRUD operations, math functions) - **Proof-of-concept scale**: Trained on 300 examples across 50+ tools (~6 examples/tool average) ### Known Failure Modes - **Optional parameters**: May omit optional arguments not seen in training examples - **Case sensitivity**: Generally learns lowercase conventions from training data (e.g., "asc" not "ASC") - **Synonym handling**: May not recognize alternative phrasings for same tool (e.g., "retrieve" vs "fetch" vs "get") - **Argument key variations**: Expects exact key names from training (e.g., won't map "num" → "number") - **Complex nested args**: Struggles with deeply nested JSON structures (>2 levels) ### Evaluation Caveats - **Small test set (n=50)**: Statistical confidence is limited; need 200-300 examples for robust claims - **In-distribution bias**: Test set covers same domains as training; OOD generalization untested - **Baseline comparison**: Azure GPT-4o not extensively prompt-optimized for this specific task ## Future Work & Next Steps To strengthen this proof-of-concept into a production-grade system: ### Evaluation Robustness - [ ] **Expand test set to 200-300 examples** for statistically significant comparisons - [ ] **Hold-out tool evaluation**: Train on subset of tools, test on completely unseen tools - [ ] **OOD phrasing evaluation**: Test with paraphrased queries (synonyms, different word order, extra context) - [ ] **Fair baseline comparison**: Lock in Azure GPT-4o prompt template, temperature=0, optimize for this task ### Model Improvements - [ ] **Ablation study**: Evaluate base Llama 3.1 8B (no LoRA) to quantify adapter contribution - [ ] **Larger training set**: Scale to 1,000-5,000 examples for better generalization - [ ] **Multi-turn support**: Extend to conversational API generation (clarifying questions, follow-ups) - [ ] **Error recovery**: Fine-tune on failure cases to handle edge cases ### Deployment Hardening - [ ] **Latency optimization**: Quantize to INT4 or deploy with vLLM for sub-second inference - [ ] **Monitoring**: Add production metrics (latency P99, error rates, schema violations) - [ ] **A/B testing framework**: Compare SLM vs LLM in production traffic - [ ] **Fallback strategy**: Route complex queries to GPT-4 when confidence is low ## Model Details - **Developed by:** AI_ATL25 Team - **Model type:** LoRA Adapter for Llama 3.1 8B - **Language:** English - **License:** Llama 3.1 Community License - **Finetuned from:** unsloth/llama-3.1-8b-instruct-bnb-4bit - **Adapter Size:** 335MB - **Trainable Parameters:** 84M (1.04% of base model) - **Proof-of-concept**: Yes — intended to demonstrate feasibility, not production-ready without further evaluation ## Citation ```bibtex @misc{llama31-structured-api-adapter, title={Fine-tuned Llama 3.1 8B for Structured API Generation}, author={AI_ATL25 Team}, year={2025}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/kineticdrive/llama-structured-api-adapter}} } ``` ## Contact - GitHub: [AI_ATL25](https://github.com/kineticdrive/AI_ATL25) - HuggingFace: [@kineticdrive](https://huggingface.co/kineticdrive) ### Framework versions - PEFT 0.17.1