File size: 15,374 Bytes

---
base_model: unsloth/llama-3.1-8b-instruct-bnb-4bit
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- llama-3.1
- api-generation
- function-calling
- structured-output
- fine-tuned
language:
- en
license: llama3.1
---

# Llama 3.1 8B - Structured API Generation (LoRA Adapter)

**Fine-tuned adapter for generating structured JSON API calls from natural language queries**

This LoRA adapter demonstrates that **context-engineered small models** can outperform generic large models on structured tasks: **40% vs 20.5% exact match** compared to GPT-4 class baseline on our evaluation set.

## Model Overview

This is a LoRA adapter fine-tuned on [unsloth/llama-3.1-8b-instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3.1-8b-instruct-bnb-4bit) for structured API generation. The model takes natural language queries and tool specifications as input, and generates JSON objects with `query`, `tool_name`, and `arguments` fields.

**Context Engineering Approach:** Instead of relying on a massive generic model, we teach a small 8B model to understand and maintain structured output constraints through domain-specific fine-tuning. This demonstrates the power of task-specific context engineering over general-purpose scale.

### Key Performance Metrics

| Metric | Our Model | Azure GPT-4o | Improvement |
|--------|-----------|--------------|-------------|
| Exact Match Accuracy | **40.0%** (20/50) | 20.5% (10/50) | **+95%** |
| Tool Name Accuracy | **98.0%** (49/50) | ~90% | **+8.9%** |
| Arguments Partial Match | **76.0%** | 60.2% | **+26%** |
| JSON Validity | **100%** (50/50) | 100% | - |
| Model Size | 8B params | ~120B params | **15x smaller** |
| Training Time | 4m 52s | N/A | - |

**Baseline Details:** Azure GPT-4o (GPT-4 Optimized, ~120B parameters) evaluated on the same 50 test examples with temperature=0.7, using standard chat completion API with JSON schema enforcement.

## Quick Start

### Installation

```bash
pip install torch transformers peft bitsandbytes accelerate
```

### Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and adapter
base_model = "unsloth/llama-3.1-8b-instruct-bnb-4bit"
adapter_path = "kineticdrive/llama-structured-api-adapter"

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True
)
model = PeftModel.from_pretrained(model, adapter_path)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(base_model)

# Generate API call
prompt = """Return a JSON object with keys query, tool_name, arguments describing the API call.
Query: Fetch the first 100 countries in ascending order.
Chosen tool: getallcountry
Arguments should mirror the assistant's recommendation."""

messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=256,
        temperature=0.0,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id
    )

result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)
```

**Output:**
```json
{
  "arguments": {"limit": 100, "order": "asc"},
  "query": "Fetch the first 100 countries in ascending order.",
  "tool_name": "getallcountry"
}
```

## Training Details

### Dataset

**⚠️ Note:** This is a proof-of-concept with a small, domain-specific dataset:
- **Training**: 300 examples (~6 examples per tool on average)
- **Validation**: 60 examples
- **Test**: 50 examples (held-out from training)
- **Domains**: API calls, math functions, data processing, web services
- **Tool Coverage**: 50+ unique functions

**Why this works:** The base Llama 3.1 8B Instruct model already has strong reasoning and JSON generation capabilities. We're teaching it task-specific structure preservation, not training from scratch. With ~6 examples per tool, the model learns to maintain the structured format while generalizing across similar API patterns.

### Training Hyperparameters

```yaml
LoRA Configuration:
  r: 32                    # Low-rank dimension
  alpha: 64                # LoRA scaling factor
  dropout: 0.1
  target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
  trainable_params: 84M (1.04% of base model)

Training:
  max_epochs: 3
  actual_steps: 39         # Early convergence after ~1.2 epochs
  batch_size: 2
  gradient_accumulation_steps: 4
  effective_batch_size: 8  # 2 * 4
  learning_rate: 2e-4
  lr_scheduler: linear
  warmup_steps: 10
  optimizer: adamw_8bit
  weight_decay: 0.01
  max_seq_length: 2048
```

### Training Results
- Final Training Loss: 0.50
- Final Validation Loss: 0.58
- Training Time: 4m 52s
- GPU: 2x RTX 3090 (21.8GB/24GB per GPU)
- Total Steps: 39 (early stopping due to loss convergence)
- Steps per Epoch: ~37 (300 examples / effective batch size 8)

## Evaluation

### Overall Results

Tested on 50 held-out examples with diverse API calls:

| Metric | Score | Definition |
|--------|-------|------------|
| Exact Match | 40.0% (20/50) | Strict JSON equality after key normalization |
| Tool Name Accuracy | 98.0% (49/50) | Correct function name selected |
| Query Preservation | 92.0% (46/50) | Original user query maintained in output |
| Args Partial Match | 76.0% | Key-wise F1 score on arguments dict |
| JSON Validity | 100% (50/50) | Parseable JSON with no syntax errors |
| Functional Correctness | 71.0% | Tool call would succeed (correct tool + has required args) |

**Baseline (Azure GPT-4o):** 20.5% exact match (10/50), 60.2% field F1

### Metric Definitions

Each metric measures a different aspect of **context engineering** — how well the model maintains structured constraints:

1. **Exact Match Accuracy**
   - **What**: Strict string equality after whitespace normalization and key sorting
   - **Why**: Measures perfect adherence to schema and value formats
   - **Context Engineering**: Tests whether model learned exact output templates
   - **Example**: `{"query": "...", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}` must match exactly

2. **Tool Name Accuracy**
   - **What**: Percentage of predictions with correct `tool_name` field matching expected function
   - **Why**: Most critical metric — wrong tool = complete failure
   - **Context Engineering**: Tests tool routing learned from examples
   - **Example**: Query "fetch countries" → must output `"tool_name": "getallcountry"`, not `"getAllCountries"` or `"get_country"`

3. **Query Preservation**
   - **What**: Original user query appears verbatim (or case-normalized) in output `query` field
   - **Why**: Ensures no information loss in pipeline
   - **Context Engineering**: Tests whether model maintains input fidelity vs paraphrasing
   - **Example**: Input "Fetch the first 100 countries" → Output must contain `"query": "Fetch the first 100 countries"` (not "Get 100 countries")

4. **Arguments Partial Match**
   - **What**: Key-wise F1 score — for each expected argument key, check if present with correct value
   - **Why**: Captures "mostly correct" calls where 1-2 args differ
   - **Context Engineering**: Tests parameter mapping consistency
   - **Example**: Expected `{"limit": 100, "order": "asc"}` vs Predicted `{"limit": 100, "order": "ASC"}` = 1.0 key match, 0.5 value match

5. **JSON Validity**
   - **What**: Output is parseable JSON (no syntax errors, bracket matching, valid escaping)
   - **Why**: Invalid JSON = parsing error in production
   - **Context Engineering**: Tests structural constraint adherence
   - **Example**: Must output `{"key": "value"}` not `{key: value}` or `{"key": "value"` (missing brace)

6. **Functional Correctness**
   - **What**: Tool call would execute successfully — correct tool name + all required arguments present
   - **Why**: Captures "usable" outputs even if not exact match
   - **Context Engineering**: Tests minimum viable output quality
   - **Example**: `{"tool_name": "getallcountry", "arguments": {"limit": 100}}` is functional even if `"order"` is missing (assuming it's optional)

### Evaluation Setup Transparency

**Test Set:** 50 examples held-out from training, covering diverse API calls across 50+ tools

**Our Model:**
- Base: `unsloth/llama-3.1-8b-instruct-bnb-4bit`
- Adapter: This LoRA fine-tune
- Temperature: 0.0 (deterministic)
- Max tokens: 256
- Prompt format: Same as training (query + tool spec → JSON output)

**Baseline (Azure GPT-4o):**
- Model: Azure OpenAI GPT-4o (gpt-4o-2024-08-06, ~120B params)
- Temperature: 0.7 (as per Azure defaults)
- Max tokens: 256
- Prompt format: Chat completion with system message describing JSON schema
- JSON mode: Enabled via API parameter

**⚠️ Evaluation Limitations:**
- **Small test set (n=50):** With 20/50 vs 10/50 exact matches, confidence intervals overlap. A larger test set (200-300 examples) would provide more robust comparisons.
- **Baseline prompt optimization:** Azure GPT-4o was evaluated with standard JSON schema enforcement but not extensively prompt-engineered. A more optimized baseline prompt might close the gap.
- **In-distribution generalization:** Test set covers same API domains as training. Out-of-distribution tools or phrasing patterns may degrade performance.

### Context Engineering Examples

**Example 1: Exact Match (Both models)**

Input:
```
Query: Get all documents sorted by date
Tool: getDocuments
Args: {"sort": "date", "order": "desc"}
```

Our Model Output:
```json
{"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}}
```

GPT-4o Output:
```json
{"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}}
```

✅ Both models: Exact match

---

**Example 2: Our model wins (Case normalization)**

Input:
```
Query: Fetch first 100 countries in ascending order
Tool: getallcountry
Args: {"limit": 100, "order": "asc"}
```

Our Model Output:
```json
{"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}
```

GPT-4o Output:
```json
{"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "ASC"}}
```

✅ Our model: Exact match (learned lowercase "asc" from examples)
⚠️ GPT-4o: Functional correctness, but not exact match (case differs)

---

**Example 3: Both models functional but not exact**

Input:
```
Query: Calculate sum of [1, 2, 3, 4, 5]
Tool: calculate
Args: {"operation": "sum", "values": [1, 2, 3, 4, 5]}
```

Our Model Output:
```json
{"query": "Calculate sum of [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"operation": "sum", "numbers": [1, 2, 3, 4, 5]}}
```

GPT-4o Output:
```json
{"query": "Calculate the sum of the array [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"op": "sum", "values": [1, 2, 3, 4, 5]}}
```

⚠️ Our model: Wrong key name (`"numbers"` instead of `"values"`) but correct tool
⚠️ GPT-4o: Paraphrased query + abbreviated arg key (`"op"`)

Both: Functional correctness ✅, Not exact match ❌

## Use Cases

- **AI Agent API generation**: Route user queries to appropriate backend APIs
- **Structured data extraction**: Convert natural language to database queries
- **Function calling for LLMs**: Generate tool invocations for agent frameworks
- **Tool routing and parameter extraction**: Map intents to functions with correct arguments
- **API request generation**: Transform conversational requests into structured API calls

**Best for:** High-volume, latency-sensitive, cost-constrained deployments where you control the API schema and need consistent structured output.

## Limitations

### Scope Limitations
- **Single API calls only**: Optimized for one tool per query (not multi-step workflows)
- **English language only**: Not tested on non-English queries
- **Domain-specific**: Best performance on APIs similar to training distribution (REST APIs, CRUD operations, math functions)
- **Proof-of-concept scale**: Trained on 300 examples across 50+ tools (~6 examples/tool average)

### Known Failure Modes
- **Optional parameters**: May omit optional arguments not seen in training examples
- **Case sensitivity**: Generally learns lowercase conventions from training data (e.g., "asc" not "ASC")
- **Synonym handling**: May not recognize alternative phrasings for same tool (e.g., "retrieve" vs "fetch" vs "get")
- **Argument key variations**: Expects exact key names from training (e.g., won't map "num" → "number")
- **Complex nested args**: Struggles with deeply nested JSON structures (>2 levels)

### Evaluation Caveats
- **Small test set (n=50)**: Statistical confidence is limited; need 200-300 examples for robust claims
- **In-distribution bias**: Test set covers same domains as training; OOD generalization untested
- **Baseline comparison**: Azure GPT-4o not extensively prompt-optimized for this specific task

## Future Work & Next Steps

To strengthen this proof-of-concept into a production-grade system:

### Evaluation Robustness
- [ ] **Expand test set to 200-300 examples** for statistically significant comparisons
- [ ] **Hold-out tool evaluation**: Train on subset of tools, test on completely unseen tools
- [ ] **OOD phrasing evaluation**: Test with paraphrased queries (synonyms, different word order, extra context)
- [ ] **Fair baseline comparison**: Lock in Azure GPT-4o prompt template, temperature=0, optimize for this task

### Model Improvements
- [ ] **Ablation study**: Evaluate base Llama 3.1 8B (no LoRA) to quantify adapter contribution
- [ ] **Larger training set**: Scale to 1,000-5,000 examples for better generalization
- [ ] **Multi-turn support**: Extend to conversational API generation (clarifying questions, follow-ups)
- [ ] **Error recovery**: Fine-tune on failure cases to handle edge cases

### Deployment Hardening
- [ ] **Latency optimization**: Quantize to INT4 or deploy with vLLM for sub-second inference
- [ ] **Monitoring**: Add production metrics (latency P99, error rates, schema violations)
- [ ] **A/B testing framework**: Compare SLM vs LLM in production traffic
- [ ] **Fallback strategy**: Route complex queries to GPT-4 when confidence is low

## Model Details

- **Developed by:** AI_ATL25 Team
- **Model type:** LoRA Adapter for Llama 3.1 8B
- **Language:** English
- **License:** Llama 3.1 Community License
- **Finetuned from:** unsloth/llama-3.1-8b-instruct-bnb-4bit
- **Adapter Size:** 335MB
- **Trainable Parameters:** 84M (1.04% of base model)
- **Proof-of-concept**: Yes — intended to demonstrate feasibility, not production-ready without further evaluation

## Citation

```bibtex
@misc{llama31-structured-api-adapter,
  title={Fine-tuned Llama 3.1 8B for Structured API Generation},
  author={AI_ATL25 Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/kineticdrive/llama-structured-api-adapter}}
}
```

## Contact

- GitHub: [AI_ATL25](https://github.com/kineticdrive/AI_ATL25)
- HuggingFace: [@kineticdrive](https://huggingface.co/kineticdrive)
### Framework versions

- PEFT 0.17.1