impuls-salamandra-7b-query-parser / README.md

PabloAccuosto

Upload README.md with huggingface_hub

a6e8ece verified 2 days ago

preview code

raw

history blame contribute delete

9.36 kB

metadata

license: apache-2.0
language:
  - ca
  - es
  - en
base_model: BSC-LT/salamandra-7b-instruct-tools
library_name: transformers
pipeline_tag: text-generation
tags:
  - query-parsing
  - semantic-search
  - structured-output
  - json-generation
  - multilingual
  - catalan
  - spanish
  - LoRA
  - fine-tuned
  - AINA
  - R&D
datasets:
  - SIRIS-Lab/impuls-query-parsing
metrics:
  - accuracy
model-index:
  - name: IMPULS-Salamandra-7B-Query-Parser
    results:
      - task:
          type: text-generation
          name: Query Parsing
        metrics:
          - name: JSON Validity
            type: accuracy
            value: 1
          - name: Strict Accuracy
            type: accuracy
            value: 0.51
          - name: Relaxed Accuracy
            type: accuracy
            value: 0.65
          - name: Language Match
            type: accuracy
            value: 0.87

IMPULS-Salamandra-7B-Query-Parser

A fine-tuned version of BSC-LT/salamandra-7b-instruct-tools for converting natural language queries into structured JSON for R&D project semantic search.

Model Description

This model was developed as part of the IMPULS project (AINA Challenge 2024), a collaboration between SIRIS Academic and Generalitat de Catalunya to build a multilingual semantic search system for Catalonia's R&D ecosystem (RIS3-MCAT platform).

The model converts natural language queries in Catalan, Spanish, and English into structured JSON containing:

Semantic query: Core thematic content for vector search
Filters: Structured metadata (funding programme, year range, location, organization type)
Query rewrite: Human-readable interpretation of the query
Metadata: Language detection and processing notes

Example

Input (Catalan):

projectes d'IA en salut finançats per H2020 des de 2020

Output:

{
  "doc_type": "projects",
  "filters": {
    "programme": "Horizon 2020",
    "year": ">=2020"
  },
  "organisations": [],
  "semantic_query": "intel·ligència artificial salut",
  "query_rewrite": "Projectes sobre IA en salut del programa H2020 des de 2020",
  "meta": {
    "lang": "CA"
  }
}

Training Details

Base Model

Model: BSC-LT/salamandra-7b-instruct-tools
Architecture: LlamaForCausalLM (7B parameters)

Fine-tuning Method

Technique: LoRA (Low-Rank Adaptation)
Trainable parameters: ~~1% of total (~~50MB adapter)

LoRA Configuration

Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Hyperparameters

Parameter	Value
Epochs	3
Batch size	16 (effective)
Learning rate	2e-4
Sequence length	2048
Precision	FP16 (mixed)
Optimizer	AdamW
LR scheduler	Cosine
Warmup ratio	0.1

Training Data

Dataset: SIRIS-Lab/impuls-query-parsing
Training split: 682 multilingual queries (synthetic, template-generated)
Language distribution: ~33% Catalan, ~33% Spanish, ~33% English
Query types: Discover (88%), Quantify (12%)

Evaluation Data

Test split: 100 real queries from domain experts (SIRIS Academic)
Annotation: Manual gold-standard JSON for each query

Evaluation Results

Overall Performance

Metric	Base Model	Fine-tuned
JSON Validity	100%	100%
Strict Accuracy	15%	51%
Relaxed Accuracy	29%	65%
Language Match	53%	87%
Semantic Query Accuracy	44%	86%

Component-level Accuracy

Component	Accuracy
Programme (H2020, FEDER, etc.)	96%
Year extraction	98%
Location	91%
Organizations	77%
Semantic Query	86%

Performance by Language

Language	Relaxed Accuracy
English	72%
Catalan	64%
Spanish	52%

Comparison with Other Models

Model	Strict Accuracy	Relaxed Accuracy	JSON Valid
Salamandra-7B (ours)	51%	65%	100%
Qwen 2.5-7B	47%	65%	100%
Mistral-7B	24%	55%	100%

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "SIRIS-Lab/impuls-salamandra-7b-query-parser"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# System prompt (simplified version)
system_prompt = """Convert natural language queries into structured JSON for R&D project search.
Output only valid JSON with the required schema."""

query = "projectes d'hidrogen finançats per H2020 des de 2020"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": query}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.1,
        do_sample=True
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

With 4-bit Quantization (Recommended for limited VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "SIRIS-Lab/impuls-salamandra-7b-query-parser",
    quantization_config=quantization_config,
    device_map="auto"
)
# Reduces memory from ~14GB to ~3.5GB

Output Schema

{
  "doc_type": "projects",
  "filters": {
    "programme": "string | null",
    "funding_level": "string | null", 
    "year": "string | null",
    "location": "string | null",
    "location_level": "region | province | country | null"
  },
  "organisations": [
    {
      "type": "university | research_center | hospital | company | null",
      "name": "string | null",
      "location": "string | null",
      "location_level": "string | null"
    }
  ],
  "semantic_query": "string | null",
  "query_rewrite": "string",
  "meta": {
    "lang": "CA | ES | EN",
    "notes": "string | null"
  }
}

Hardware Requirements

Configuration	VRAM Required
FP16 (full precision)	~14 GB
4-bit quantization	~3.5 GB

Recommended: GPU with 24GB+ VRAM (A100) or 4-bit quantization on consumer GPUs.

Limitations

Domain-specific: Optimized for R&D project search queries; may not generalize well to other domains
Schema-bound: Outputs follow a fixed JSON schema; cannot handle arbitrary structured formats
Language coverage: Best performance on Catalan and English; Spanish accuracy is lower
Complex queries: Struggles with queries requiring numerical aggregation or ranking operations

Intended Use

This model is designed for:

R&D project discovery platforms (RIS3CAT, Horizon Europe portals)
Scientific literature search systems
Multilingual semantic search applications
Query understanding in Catalan, Spanish, and English

Ethical Considerations

The model was trained on synthetic queries generated from templates and real queries from domain experts
No personal or sensitive data was used in training
The model is intended for search query parsing and does not generate harmful content

Citation

If you use this model, please cite:

@misc{impuls-salamandra-2024,
  author = {SIRIS Academic},
  title = {IMPULS-Salamandra-7B-Query-Parser: Multilingual Query Parsing for R&D Semantic Search},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser}}
}

Acknowledgments

Barcelona Supercomputing Center (BSC) - For the Salamandra base model and AINA infrastructure
Generalitat de Catalunya - For funding and the RIS3-MCAT platform
AINA Project - For the AINA Challenge 2024 framework

License

This model is released under the Apache 2.0 License, consistent with the base Salamandra model.

SIRIS-Lab
/

impuls-salamandra-7b-query-parser