PabloAccuosto's picture
Upload README.md with huggingface_hub
a6e8ece verified
metadata
license: apache-2.0
language:
  - ca
  - es
  - en
base_model: BSC-LT/salamandra-7b-instruct-tools
library_name: transformers
pipeline_tag: text-generation
tags:
  - query-parsing
  - semantic-search
  - structured-output
  - json-generation
  - multilingual
  - catalan
  - spanish
  - LoRA
  - fine-tuned
  - AINA
  - R&D
datasets:
  - SIRIS-Lab/impuls-query-parsing
metrics:
  - accuracy
model-index:
  - name: IMPULS-Salamandra-7B-Query-Parser
    results:
      - task:
          type: text-generation
          name: Query Parsing
        metrics:
          - name: JSON Validity
            type: accuracy
            value: 1
          - name: Strict Accuracy
            type: accuracy
            value: 0.51
          - name: Relaxed Accuracy
            type: accuracy
            value: 0.65
          - name: Language Match
            type: accuracy
            value: 0.87

IMPULS-Salamandra-7B-Query-Parser

A fine-tuned version of BSC-LT/salamandra-7b-instruct-tools for converting natural language queries into structured JSON for R&D project semantic search.

Model Description

This model was developed as part of the IMPULS project (AINA Challenge 2024), a collaboration between SIRIS Academic and Generalitat de Catalunya to build a multilingual semantic search system for Catalonia's R&D ecosystem (RIS3-MCAT platform).

The model converts natural language queries in Catalan, Spanish, and English into structured JSON containing:

  • Semantic query: Core thematic content for vector search
  • Filters: Structured metadata (funding programme, year range, location, organization type)
  • Query rewrite: Human-readable interpretation of the query
  • Metadata: Language detection and processing notes

Example

Input (Catalan):

projectes d'IA en salut finançats per H2020 des de 2020

Output:

{
  "doc_type": "projects",
  "filters": {
    "programme": "Horizon 2020",
    "year": ">=2020"
  },
  "organisations": [],
  "semantic_query": "intel·ligència artificial salut",
  "query_rewrite": "Projectes sobre IA en salut del programa H2020 des de 2020",
  "meta": {
    "lang": "CA"
  }
}

Training Details

Base Model

Fine-tuning Method

  • Technique: LoRA (Low-Rank Adaptation)
  • Trainable parameters: 1% of total (50MB adapter)

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Hyperparameters

Parameter Value
Epochs 3
Batch size 16 (effective)
Learning rate 2e-4
Sequence length 2048
Precision FP16 (mixed)
Optimizer AdamW
LR scheduler Cosine
Warmup ratio 0.1

Training Data

  • Dataset: SIRIS-Lab/impuls-query-parsing
  • Training split: 682 multilingual queries (synthetic, template-generated)
  • Language distribution: ~33% Catalan, ~33% Spanish, ~33% English
  • Query types: Discover (88%), Quantify (12%)

Evaluation Data

  • Test split: 100 real queries from domain experts (SIRIS Academic)
  • Annotation: Manual gold-standard JSON for each query

Evaluation Results

Overall Performance

Metric Base Model Fine-tuned
JSON Validity 100% 100%
Strict Accuracy 15% 51%
Relaxed Accuracy 29% 65%
Language Match 53% 87%
Semantic Query Accuracy 44% 86%

Component-level Accuracy

Component Accuracy
Programme (H2020, FEDER, etc.) 96%
Year extraction 98%
Location 91%
Organizations 77%
Semantic Query 86%

Performance by Language

Language Relaxed Accuracy
English 72%
Catalan 64%
Spanish 52%

Comparison with Other Models

Model Strict Accuracy Relaxed Accuracy JSON Valid
Salamandra-7B (ours) 51% 65% 100%
Qwen 2.5-7B 47% 65% 100%
Mistral-7B 24% 55% 100%

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "SIRIS-Lab/impuls-salamandra-7b-query-parser"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# System prompt (simplified version)
system_prompt = """Convert natural language queries into structured JSON for R&D project search.
Output only valid JSON with the required schema."""

query = "projectes d'hidrogen finançats per H2020 des de 2020"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": query}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.1,
        do_sample=True
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

With 4-bit Quantization (Recommended for limited VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "SIRIS-Lab/impuls-salamandra-7b-query-parser",
    quantization_config=quantization_config,
    device_map="auto"
)
# Reduces memory from ~14GB to ~3.5GB

Output Schema

{
  "doc_type": "projects",
  "filters": {
    "programme": "string | null",
    "funding_level": "string | null", 
    "year": "string | null",
    "location": "string | null",
    "location_level": "region | province | country | null"
  },
  "organisations": [
    {
      "type": "university | research_center | hospital | company | null",
      "name": "string | null",
      "location": "string | null",
      "location_level": "string | null"
    }
  ],
  "semantic_query": "string | null",
  "query_rewrite": "string",
  "meta": {
    "lang": "CA | ES | EN",
    "notes": "string | null"
  }
}

Hardware Requirements

Configuration VRAM Required
FP16 (full precision) ~14 GB
4-bit quantization ~3.5 GB

Recommended: GPU with 24GB+ VRAM (A100) or 4-bit quantization on consumer GPUs.

Limitations

  • Domain-specific: Optimized for R&D project search queries; may not generalize well to other domains
  • Schema-bound: Outputs follow a fixed JSON schema; cannot handle arbitrary structured formats
  • Language coverage: Best performance on Catalan and English; Spanish accuracy is lower
  • Complex queries: Struggles with queries requiring numerical aggregation or ranking operations

Intended Use

This model is designed for:

  • R&D project discovery platforms (RIS3CAT, Horizon Europe portals)
  • Scientific literature search systems
  • Multilingual semantic search applications
  • Query understanding in Catalan, Spanish, and English

Ethical Considerations

  • The model was trained on synthetic queries generated from templates and real queries from domain experts
  • No personal or sensitive data was used in training
  • The model is intended for search query parsing and does not generate harmful content

Citation

If you use this model, please cite:

@misc{impuls-salamandra-2024,
  author = {SIRIS Academic},
  title = {IMPULS-Salamandra-7B-Query-Parser: Multilingual Query Parsing for R&D Semantic Search},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser}}
}

Acknowledgments

License

This model is released under the Apache 2.0 License, consistent with the base Salamandra model.

Links