llama-structured-api-adapter / README.md

Update model card with precise metrics, evaluation transparency, and context engineering examples

737ac9a verified 3 months ago

15.4 kB

	---
	base_model: unsloth/llama-3.1-8b-instruct-bnb-4bit
	library_name: peft
	pipeline_tag: text-generation
	tags:
	- lora
	- llama-3.1
	- api-generation
	- function-calling
	- structured-output
	- fine-tuned
	language:
	- en
	license: llama3.1
	---

	# Llama 3.1 8B - Structured API Generation (LoRA Adapter)

	Fine-tuned adapter for generating structured JSON API calls from natural language queries

	This LoRA adapter demonstrates that context-engineered small models can outperform generic large models on structured tasks: 40% vs 20.5% exact match compared to GPT-4 class baseline on our evaluation set.

	## Model Overview

	This is a LoRA adapter fine-tuned on [unsloth/llama-3.1-8b-instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3.1-8b-instruct-bnb-4bit) for structured API generation. The model takes natural language queries and tool specifications as input, and generates JSON objects with `query`, `tool_name`, and `arguments` fields.

	Context Engineering Approach: Instead of relying on a massive generic model, we teach a small 8B model to understand and maintain structured output constraints through domain-specific fine-tuning. This demonstrates the power of task-specific context engineering over general-purpose scale.

	### Key Performance Metrics

	\| Metric \| Our Model \| Azure GPT-4o \| Improvement \|
	\|--------\|-----------\|--------------\|-------------\|
	\| Exact Match Accuracy \| 40.0% (20/50) \| 20.5% (10/50) \| +95% \|
	\| Tool Name Accuracy \| 98.0% (49/50) \| ~90% \| +8.9% \|
	\| Arguments Partial Match \| 76.0% \| 60.2% \| +26% \|
	\| JSON Validity \| 100% (50/50) \| 100% \| - \|
	\| Model Size \| 8B params \| ~120B params \| 15x smaller \|
	\| Training Time \| 4m 52s \| N/A \| - \|

	Baseline Details: Azure GPT-4o (GPT-4 Optimized, ~120B parameters) evaluated on the same 50 test examples with temperature=0.7, using standard chat completion API with JSON schema enforcement.

	## Quick Start

	### Installation

	```bash
	pip install torch transformers peft bitsandbytes accelerate
	```

	### Usage

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model and adapter
	base_model = "unsloth/llama-3.1-8b-instruct-bnb-4bit"
	adapter_path = "kineticdrive/llama-structured-api-adapter"

	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	load_in_4bit=True
	)
	model = PeftModel.from_pretrained(model, adapter_path)
	model.eval()

	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# Generate API call
	prompt = """Return a JSON object with keys query, tool_name, arguments describing the API call.
	Query: Fetch the first 100 countries in ascending order.
	Chosen tool: getallcountry
	Arguments should mirror the assistant's recommendation."""

	messages = [{"role": "user", "content": prompt}]
	inputs = tokenizer.apply_chat_template(
	messages,
	return_tensors="pt",
	add_generation_prompt=True
	).to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_new_tokens=256,
	temperature=0.0,
	do_sample=False,
	pad_token_id=tokenizer.pad_token_id
	)

	result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
	print(result)
	```

	Output:
	```json
	{
	"arguments": {"limit": 100, "order": "asc"},
	"query": "Fetch the first 100 countries in ascending order.",
	"tool_name": "getallcountry"
	}
	```

	## Training Details

	### Dataset

	⚠️ Note: This is a proof-of-concept with a small, domain-specific dataset:
	- Training: 300 examples (~6 examples per tool on average)
	- Validation: 60 examples
	- Test: 50 examples (held-out from training)
	- Domains: API calls, math functions, data processing, web services
	- Tool Coverage: 50+ unique functions

	Why this works: The base Llama 3.1 8B Instruct model already has strong reasoning and JSON generation capabilities. We're teaching it task-specific structure preservation, not training from scratch. With ~6 examples per tool, the model learns to maintain the structured format while generalizing across similar API patterns.

	### Training Hyperparameters

	```yaml
	LoRA Configuration:
	r: 32 # Low-rank dimension
	alpha: 64 # LoRA scaling factor
	dropout: 0.1
	target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
	trainable_params: 84M (1.04% of base model)

	Training:
	max_epochs: 3
	actual_steps: 39 # Early convergence after ~1.2 epochs
	batch_size: 2
	gradient_accumulation_steps: 4
	effective_batch_size: 8 # 2 * 4
	learning_rate: 2e-4
	lr_scheduler: linear
	warmup_steps: 10
	optimizer: adamw_8bit
	weight_decay: 0.01
	max_seq_length: 2048
	```

	### Training Results
	- Final Training Loss: 0.50
	- Final Validation Loss: 0.58
	- Training Time: 4m 52s
	- GPU: 2x RTX 3090 (21.8GB/24GB per GPU)
	- Total Steps: 39 (early stopping due to loss convergence)
	- Steps per Epoch: ~37 (300 examples / effective batch size 8)

	## Evaluation

	### Overall Results

	Tested on 50 held-out examples with diverse API calls:

	\| Metric \| Score \| Definition \|
	\|--------\|-------\|------------\|
	\| Exact Match \| 40.0% (20/50) \| Strict JSON equality after key normalization \|
	\| Tool Name Accuracy \| 98.0% (49/50) \| Correct function name selected \|
	\| Query Preservation \| 92.0% (46/50) \| Original user query maintained in output \|
	\| Args Partial Match \| 76.0% \| Key-wise F1 score on arguments dict \|
	\| JSON Validity \| 100% (50/50) \| Parseable JSON with no syntax errors \|
	\| Functional Correctness \| 71.0% \| Tool call would succeed (correct tool + has required args) \|

	Baseline (Azure GPT-4o): 20.5% exact match (10/50), 60.2% field F1

	### Metric Definitions

	Each metric measures a different aspect of context engineering — how well the model maintains structured constraints:

	1. Exact Match Accuracy
	- What: Strict string equality after whitespace normalization and key sorting
	- Why: Measures perfect adherence to schema and value formats
	- Context Engineering: Tests whether model learned exact output templates
	- Example: `{"query": "...", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}` must match exactly

	2. Tool Name Accuracy
	- What: Percentage of predictions with correct `tool_name` field matching expected function
	- Why: Most critical metric — wrong tool = complete failure
	- Context Engineering: Tests tool routing learned from examples
	- Example: Query "fetch countries" → must output `"tool_name": "getallcountry"`, not `"getAllCountries"` or `"get_country"`

	3. Query Preservation
	- What: Original user query appears verbatim (or case-normalized) in output `query` field
	- Why: Ensures no information loss in pipeline
	- Context Engineering: Tests whether model maintains input fidelity vs paraphrasing
	- Example: Input "Fetch the first 100 countries" → Output must contain `"query": "Fetch the first 100 countries"` (not "Get 100 countries")

	4. Arguments Partial Match
	- What: Key-wise F1 score — for each expected argument key, check if present with correct value
	- Why: Captures "mostly correct" calls where 1-2 args differ
	- Context Engineering: Tests parameter mapping consistency
	- Example: Expected `{"limit": 100, "order": "asc"}` vs Predicted `{"limit": 100, "order": "ASC"}` = 1.0 key match, 0.5 value match

	5. JSON Validity
	- What: Output is parseable JSON (no syntax errors, bracket matching, valid escaping)
	- Why: Invalid JSON = parsing error in production
	- Context Engineering: Tests structural constraint adherence
	- Example: Must output `{"key": "value"}` not `{key: value}` or `{"key": "value"` (missing brace)

	6. Functional Correctness
	- What: Tool call would execute successfully — correct tool name + all required arguments present
	- Why: Captures "usable" outputs even if not exact match
	- Context Engineering: Tests minimum viable output quality
	- Example: `{"tool_name": "getallcountry", "arguments": {"limit": 100}}` is functional even if `"order"` is missing (assuming it's optional)

	### Evaluation Setup Transparency

	Test Set: 50 examples held-out from training, covering diverse API calls across 50+ tools

	Our Model:
	- Base: `unsloth/llama-3.1-8b-instruct-bnb-4bit`
	- Adapter: This LoRA fine-tune
	- Temperature: 0.0 (deterministic)
	- Max tokens: 256
	- Prompt format: Same as training (query + tool spec → JSON output)

	Baseline (Azure GPT-4o):
	- Model: Azure OpenAI GPT-4o (gpt-4o-2024-08-06, ~120B params)
	- Temperature: 0.7 (as per Azure defaults)
	- Max tokens: 256
	- Prompt format: Chat completion with system message describing JSON schema
	- JSON mode: Enabled via API parameter

	⚠️ Evaluation Limitations:
	- Small test set (n=50): With 20/50 vs 10/50 exact matches, confidence intervals overlap. A larger test set (200-300 examples) would provide more robust comparisons.
	- Baseline prompt optimization: Azure GPT-4o was evaluated with standard JSON schema enforcement but not extensively prompt-engineered. A more optimized baseline prompt might close the gap.
	- In-distribution generalization: Test set covers same API domains as training. Out-of-distribution tools or phrasing patterns may degrade performance.

	### Context Engineering Examples

	Example 1: Exact Match (Both models)

	Input:
	```
	Query: Get all documents sorted by date
	Tool: getDocuments
	Args: {"sort": "date", "order": "desc"}
	```

	Our Model Output:
	```json
	{"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}}
	```

	GPT-4o Output:
	```json
	{"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}}
	```

	✅ Both models: Exact match

	---

	Example 2: Our model wins (Case normalization)

	Input:
	```
	Query: Fetch first 100 countries in ascending order
	Tool: getallcountry
	Args: {"limit": 100, "order": "asc"}
	```

	Our Model Output:
	```json
	{"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}
	```

	GPT-4o Output:
	```json
	{"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "ASC"}}
	```

	✅ Our model: Exact match (learned lowercase "asc" from examples)
	⚠️ GPT-4o: Functional correctness, but not exact match (case differs)

	---

	Example 3: Both models functional but not exact

	Input:
	```
	Query: Calculate sum of [1, 2, 3, 4, 5]
	Tool: calculate
	Args: {"operation": "sum", "values": [1, 2, 3, 4, 5]}
	```

	Our Model Output:
	```json
	{"query": "Calculate sum of [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"operation": "sum", "numbers": [1, 2, 3, 4, 5]}}
	```

	GPT-4o Output:
	```json
	{"query": "Calculate the sum of the array [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"op": "sum", "values": [1, 2, 3, 4, 5]}}
	```

	⚠️ Our model: Wrong key name (`"numbers"` instead of `"values"`) but correct tool
	⚠️ GPT-4o: Paraphrased query + abbreviated arg key (`"op"`)

	Both: Functional correctness ✅, Not exact match ❌

	## Use Cases

	- AI Agent API generation: Route user queries to appropriate backend APIs
	- Structured data extraction: Convert natural language to database queries
	- Function calling for LLMs: Generate tool invocations for agent frameworks
	- Tool routing and parameter extraction: Map intents to functions with correct arguments
	- API request generation: Transform conversational requests into structured API calls

	Best for: High-volume, latency-sensitive, cost-constrained deployments where you control the API schema and need consistent structured output.

	## Limitations

	### Scope Limitations
	- Single API calls only: Optimized for one tool per query (not multi-step workflows)
	- English language only: Not tested on non-English queries
	- Domain-specific: Best performance on APIs similar to training distribution (REST APIs, CRUD operations, math functions)
	- Proof-of-concept scale: Trained on 300 examples across 50+ tools (~6 examples/tool average)

	### Known Failure Modes
	- Optional parameters: May omit optional arguments not seen in training examples
	- Case sensitivity: Generally learns lowercase conventions from training data (e.g., "asc" not "ASC")
	- Synonym handling: May not recognize alternative phrasings for same tool (e.g., "retrieve" vs "fetch" vs "get")
	- Argument key variations: Expects exact key names from training (e.g., won't map "num" → "number")
	- Complex nested args: Struggles with deeply nested JSON structures (>2 levels)

	### Evaluation Caveats
	- Small test set (n=50): Statistical confidence is limited; need 200-300 examples for robust claims
	- In-distribution bias: Test set covers same domains as training; OOD generalization untested
	- Baseline comparison: Azure GPT-4o not extensively prompt-optimized for this specific task

	## Future Work & Next Steps

	To strengthen this proof-of-concept into a production-grade system:

	### Evaluation Robustness
	- [ ] Expand test set to 200-300 examples for statistically significant comparisons
	- [ ] Hold-out tool evaluation: Train on subset of tools, test on completely unseen tools
	- [ ] OOD phrasing evaluation: Test with paraphrased queries (synonyms, different word order, extra context)
	- [ ] Fair baseline comparison: Lock in Azure GPT-4o prompt template, temperature=0, optimize for this task

	### Model Improvements
	- [ ] Ablation study: Evaluate base Llama 3.1 8B (no LoRA) to quantify adapter contribution
	- [ ] Larger training set: Scale to 1,000-5,000 examples for better generalization
	- [ ] Multi-turn support: Extend to conversational API generation (clarifying questions, follow-ups)
	- [ ] Error recovery: Fine-tune on failure cases to handle edge cases

	### Deployment Hardening
	- [ ] Latency optimization: Quantize to INT4 or deploy with vLLM for sub-second inference
	- [ ] Monitoring: Add production metrics (latency P99, error rates, schema violations)
	- [ ] A/B testing framework: Compare SLM vs LLM in production traffic
	- [ ] Fallback strategy: Route complex queries to GPT-4 when confidence is low

	## Model Details

	- Developed by: AI_ATL25 Team
	- Model type: LoRA Adapter for Llama 3.1 8B
	- Language: English
	- License: Llama 3.1 Community License
	- Finetuned from: unsloth/llama-3.1-8b-instruct-bnb-4bit
	- Adapter Size: 335MB
	- Trainable Parameters: 84M (1.04% of base model)
	- Proof-of-concept: Yes — intended to demonstrate feasibility, not production-ready without further evaluation

	## Citation

	```bibtex
	@misc{llama31-structured-api-adapter,
	title={Fine-tuned Llama 3.1 8B for Structured API Generation},
	author={AI_ATL25 Team},
	year={2025},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/kineticdrive/llama-structured-api-adapter}}
	}
	```

	## Contact

	- GitHub: [AI_ATL25](https://github.com/kineticdrive/AI_ATL25)
	- HuggingFace: [@kineticdrive](https://huggingface.co/kineticdrive)
	### Framework versions

	- PEFT 0.17.1