Text-to-MongoDB QLoRA

LoRA adapter that translates natural language questions into MongoDB find and aggregate queries.

Given a collection schema, allowed operators, and a plain-English intent, the model produces a valid MongoDB query as JSON β€” running locally on a single GPU with ~1 second inference latency.

Try it on Google Colab | Dataset | GitHub

Results

Model Split Syntax Operators Fields Overall
Qwen 7B baseline (zero-shot) eval (171) 51.0% 51.0% 40.0% 40.0%
Qwen 7B baseline (zero-shot) held-out (423) 53.8% 53.5% 39.9% 39.9%
Qwen 7B + LoRA r=8 eval (171) 98.8% 98.8% 98.8% 98.8%
Qwen 7B + LoRA r=8 held-out (423) 98.6% 98.6% 98.6% 98.6%

The held-out set uses 3 collection schemas the model never saw during training, testing generalization to unseen domains.

4-Layer Evaluation

Each generated query is validated through four layers applied sequentially:

  1. Syntax β€” Valid JSON, correct structure (find needs filter, aggregate needs pipeline)
  2. Operators β€” Every $-operator used is in the allowed list, no unsafe operators ($where, $merge, $out)
  3. Fields β€” Every field reference exists in the schema (catches hallucinated field names)
  4. Generalization β€” Compares train vs held-out pass rates to detect overfitting

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter = "jmorenas/text-to-mongo-qlora"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)

schema = """Collection: orders (ecommerce)
Fields:
- order_id: string [identifier] β€” Order ID
- total: double [measure] β€” Order total
- status: string [enum] β€” Order status (pending, shipped, delivered)
- created_at: date [timestamp] β€” Creation date"""

allowed_ops = """Stage: $match, $group, $sort, $limit, $project
Expression: $sum, $avg, $gt, $gte, $lt, $lte, $in, $eq"""

intent = "Find all pending orders over $100"

messages = [
    {"role": "system", "content": f"You are a MongoDB query generator.\n\nSchema:\n{schema}\n\nAllowed operators:\n{allowed_ops}"},
    {"role": "user", "content": intent},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Output:

{"type": "find", "filter": {"status": "pending", "total": {"$gt": 100}}}

How It Works

User: "Find orders over $100 shipped last month"
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Schema: orders collection      β”‚
β”‚  Fields: order_id, total,       β”‚
β”‚          status, created_at     β”‚
β”‚  Allowed ops: $match, $gte...   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Qwen 7B + LoRA adapter        β”‚
β”‚  (4-bit quantized, ~5GB VRAM)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
{
  "type": "find",
  "filter": {
    "total": {"$gt": 100},
    "status": "shipped",
    "created_at": {"$gte": {"$date": "2025-01-01T00:00:00Z"}}
  }
}

The model reads the schema (field names, types, roles, enum values) and the allowed operators, then composes a query that respects both. It generalizes to schemas it has never seen.

Training Details

Parameter Value
Base model Qwen/Qwen2.5-Coder-7B-Instruct
Quantization 4-bit NF4 + double quantization
LoRA rank 8 (alpha=16)
LoRA targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj
LoRA dropout 0.05
Epochs 3
Effective batch size 16 (batch=4, grad_accum=4)
Learning rate 2e-4 (cosine schedule, 5% warmup)
Max sequence length 512
Optimizer paged_adamw_8bit
Framework trl SFTTrainer (prompt-completion format)
Loss Completion-only (prompt tokens masked)
Training time ~11 minutes on RTX 5090
VRAM usage ~5 GB
Train loss 0.013

Training Data

jmorenas/text-to-mongo-dataset-qlora β€” 1,544 train / 171 eval / 423 held-out examples across 19 collection schemas (16 train + 3 held-out).

The dataset is fully synthetic β€” generated deterministically from hand-crafted schemas and intent patterns. 12 generator patterns produce query types including filters, aggregations, projections, time ranges, top-N, counts, exists checks, enum filters, and date bucketing. Augmentation strategies (field name shuffling, date variation, operator subsetting, negatives) multiply the base examples ~7x.

Limitations

  • Designed for find and aggregate queries only β€” does not generate update, delete, or insertOne
  • Field descriptions in the schema should be 2-5 words; longer descriptions degrade output quality
  • The adapter cannot be merged into 4-bit weights (merge_and_unload() produces garbage) β€” must be used as a PeftModel wrapper

License

MIT

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jmorenas/text-to-mongo-qlora

Base model

Qwen/Qwen2.5-7B
Adapter
(494)
this model

Dataset used to train jmorenas/text-to-mongo-qlora