🚀 QVAC Cross-Platform LoRA Adapters

Fine-tuned LoRA adapters trained using qvac-finetune - the first truly cross-platform LoRA fine-tuning framework for Large Language Models. These adapters work on any GPU (Adreno, Mali, Apple Silicon, AMD, Intel, NVIDIA) using Vulkan, Metal, or CUDA backends.

⚠️ Important Disclaimer These adapters are domain-specific and intended for biomedical Q&A tasks only.

The LoRA adapters were fine-tuned on PubMedQA biomedical data using a structured Q: ... A: prompt format. They are not general-purpose conversational models.

What to expect with off-topic prompts: If you provide casual or unrelated input the model will not crash, but it will produce nonsensical or hallucinated biomedical-sounding text. This is expected behavior — the adapter has shifted the model's output distribution toward medical literature, so it will attempt to generate biomedical content regardless of the input.

For best results:

Use the structured format: "Q: \nA:" Keep prompts within the biomedical/clinical domain Use recommended temperature settings (0.3–0.5 for factual answers) This model is a research artifact and must NOT be used for actual medical diagnosis, treatment decisions, or clinical advice. The outputs may contain inaccuracies, hallucinations, or contradictory statements. Always consult qualified healthcare professionals for medical guidance.

📦 Available Adapters

Adapter	Size	Base Model
Qwen3-0.6B LoRA Adapter	20.5 MB	Qwen3-0.6B-GGUF
Qwen3-1.7B LoRA Adapter	35.1 MB	Qwen3-1.7B-GGUF
Qwen3-4B LoRA Adapter	60+ MB	Qwen3-4B-GGUF
Gemma3-1B LoRA Adapter	26.6 MB	google/gemma-3-1b-gguf
Gemma3-4B LoRA Adapter	60.2 MB	google/gemma-3-4b-gguf

🚀 Empowering the Community with Open Resources

To accelerate development and innovation, Tether Data is publicly releasing:

Multi‑Platform Binaries
👉 qvac‑rnd‑fabric‑llm‑finetune
Source Code
👉 qvac‑fabric‑llm.cpp

🎯 Quick Start Guide

Option 1: Direct Inference (Recommended)

Use the adapter directly without merging - this is faster and uses less memory.

Step 1: Download Platform-Specific Binary

🖥️ Linux/Windows (AMD/Intel/NVIDIA)

# Download Vulkan binary (works on all GPUs)
wget https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-linux-vulkan-x64-v1.0.zip
unzip qvac-linux-vulkan-x64-v1.0.zip
cd qvac-linux-vulkan-x64-v1.0

🍎 macOS (Apple Silicon)

# Download Metal binary
curl -L https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-macos-apple-silicon-v1.0.zip -o qvac-macos.zip
unzip qvac-macos.zip
cd qvac-macos-apple-silicon-v1.0

📱 Android (Termux)

# Download Adreno/Mali binary
wget https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-android-adreno-arm64-v1.0.zip
unzip qvac-android-adreno-arm64-v1.0.zip
cd qvac-android-adreno-arm64-v1.0
export LD_LIBRARY_PATH=.

Step 2: Download Base Model & Adapter

Choose your model and download both base model and adapter:

# Create directories
mkdir -p models adapters

# === CHOOSE ONE MODEL ===

# Option 1: Qwen3-1.7B (recommended for most use cases)
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/base.gguf
wget https://huggingface.co/qvac/finetune/resolve/main/qwen3-1.7b-qkvo-ffn-lora-adapter.gguf -O adapters/adapter.gguf

Step 3: Run Inference with Adapter

# Interactive chat mode
./bin/llama-cli \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -ngl 999 \
  -c 2048 \
  --temp 0.7 \
  -p "Q: Does vitamin D supplementation prevent fractures?\nA:"

# Single prompt mode
./bin/llama-cli \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -ngl 999 \
  -p "Explain the mechanism of action for beta-blockers in treating hypertension."

Expected Output:

Q: Does vitamin D supplementation prevent fractures?
A: Yes. Rationale: Meta-analysis of randomized controlled trials shows that 
vitamin D supplementation, particularly when combined with calcium, significantly 
reduces the risk of hip fractures and other non-vertebral fractures in elderly 
populations...

Option 2: Merge Adapter into Base Model

Merge the adapter permanently into the base model for distribution or if you don't need to switch adapters.

Step 1-2: Same as Option 1

Follow Steps 1-2 from Option 1 to download binaries and models.

Step 3: Export & Merge Adapter

# Export LoRA adapter to base model format
./bin/llama-export-lora \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -o models/merged.gguf

# Verify merged model
ls -lh models/merged.gguf

Step 4: Run Inference with Merged Model

# Use merged model directly (no --lora flag needed)
./bin/llama-cli \
  -m models/merged.gguf \
  -ngl 999 \
  -c 2048 \
  -p "Q: What are the contraindications for aspirin therapy?\nA:"

Custom Temperature & Sampling

Fine-tune the generation parameters for your use case:

./bin/llama-cli \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -ngl 999 \
  --temp 0.3 \        # Lower = more focused (good for medical)
  --top-p 0.9 \       # Nucleus sampling
  --top-k 40 \        # Top-k sampling
  --repeat-penalty 1.1 \
  -n 512 \            # Max tokens to generate
  -p "Your prompt"

Recommended settings for biomedical Q&A:

Temperature: 0.3-0.5 (deterministic, factual)
Temperature: 0.7-0.9 (creative explanations)

Batch Processing

Process multiple prompts from a file:

# Create prompts file
cat > prompts.txt << 'EOF'
Q: Does vitamin D supplementation prevent fractures?
Q: Is aspirin effective for primary prevention of cardiovascular disease?
Q: Do statins reduce mortality in patients with heart failure?
EOF

# Process all prompts
cat prompts.txt | while read prompt; do
  echo "=== Processing: $prompt ==="
  ./bin/llama-cli \
    -m models/base.gguf \
    --lora adapters/adapter.gguf \
    -ngl 999 \
    --temp 0.4 \
    -p "$prompt\nA:"
  echo ""
done

📋 Command Line Reference

Essential Flags

Flag	Description	Example	Default
`-m`	Base model path (REQUIRED)	`-m model.gguf`	-
`--lora`	LoRA adapter path	`--lora adapter.gguf`	none
`-ngl`	GPU layers (999 = all)	`-ngl 999`	0
`-c`	Context size	`-c 2048`	512
`-p`	Prompt text	`-p "Question"`	-
`--temp`	Temperature (0-2)	`--temp 0.7`	0.8
`-n`	Max tokens to generate	`-n 512`	-1
`-b`	Batch size	`-b 512`	512
`-fa`	Flash attention	`-fa off`	on

Mobile-Specific Flags

For Android/iOS with limited memory:

./bin/llama-cli \
  -m model.gguf \
  --lora adapter.gguf \
  -ngl 99 \           # Partial GPU offload
  -c 512 \            # Smaller context
  -b 128 \            # Smaller batch
  -fa off \           # Disable flash attention (Vulkan)
  -ub 128             # Uniform batch size

🌍 Cross-Platform Compatibility

Supported Platforms

These adapters work identically across:

Platform	Hardware	Backend	Status
✅ Android	Qualcomm Adreno, ARM Mali	Vulkan	Supported
✅ iOS	Apple A-series	Metal	Supported
✅ macOS	Apple M1/M2/M3/M4	Metal	Supported
✅ Linux	AMD, Intel, NVIDIA	Vulkan	Supported
✅ Windows	AMD, Intel, NVIDIA	Vulkan	Supported
✅ CPU	Any x86_64, ARM64	CPU	Fallback

No Conversion Needed

Unlike traditional frameworks:

❌ No need to convert between different frameworks
❌ No platform-specific model formats
❌ No separate training for each device
✅ Train once, run everywhere!

📚 Additional Resources

Documentation

Platform-Specific Guides

Community

🔍 Troubleshooting

Common Issues

1. "DeviceLost" error on Android/Adreno:

# Use smaller batch size and disable flash attention
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 99 -c 512 -b 128 -ub 128 -fa off

2. Out of Memory (OOM) errors:

# Reduce context size or use smaller model
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 50 -c 512

3. Slow inference on mobile:

# Offload fewer layers to GPU
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 20

4. Adapter not loading:

# Verify adapter file exists and matches model architecture
ls -lh adapters/
./bin/llama-cli -m model.gguf --lora adapter.gguf --verbose

📝 Citation

If you use these adapters in your research, please cite:

@article{qvac-finetune,
  title={An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs},
  author={Subash, Akshay, Patrik, Milan, Nurman},
  journal={arXiv preprint},
  year={2025}
}

📄 License

LoRA Adapters: Apache 2.0 License
Base Models:
- Qwen3: Apache 2.0 license
- Gemma3: Gemma Terms of Use
Training Framework (qvac-fabric-llm): Apache 2.0 License

🙏 Acknowledgments

llama.cpp - Foundation inference engine by Georgi Gerganov
LoRA - Parameter-efficient fine-tuning method (Hu et al., 2021)
PubMedQA - Biomedical dataset source (Jin et al., 2019)
Qwen Team - Base models
Google - Gemma base models
Hardware vendors who provided testing devices

Making LLM fine-tuning accessible to everyone, everywhere

From smartphones to datacenters • No vendor lock-in • Privacy-preserving

⭐ Star the qvac-rnd-fabric-llm-finetune repo if you find it useful!

Downloads last month: 153

GGUF

Model size

6.63M params

Architecture

gemma3

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Article mentioning qvac/fabric-llm-finetune

An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs

qvac

•

Dec 1, 2025

• 17