|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ca |
|
|
- es |
|
|
- en |
|
|
base_model: BSC-LT/salamandra-7b-instruct-tools |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- query-parsing |
|
|
- semantic-search |
|
|
- structured-output |
|
|
- json-generation |
|
|
- multilingual |
|
|
- catalan |
|
|
- spanish |
|
|
- LoRA |
|
|
- fine-tuned |
|
|
- AINA |
|
|
- R&D |
|
|
datasets: |
|
|
- SIRIS-Lab/impuls-query-parsing |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: IMPULS-Salamandra-7B-Query-Parser |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Query Parsing |
|
|
metrics: |
|
|
- name: JSON Validity |
|
|
type: accuracy |
|
|
value: 1.0 |
|
|
- name: Strict Accuracy |
|
|
type: accuracy |
|
|
value: 0.51 |
|
|
- name: Relaxed Accuracy |
|
|
type: accuracy |
|
|
value: 0.65 |
|
|
- name: Language Match |
|
|
type: accuracy |
|
|
value: 0.87 |
|
|
--- |
|
|
|
|
|
# IMPULS-Salamandra-7B-Query-Parser |
|
|
|
|
|
A fine-tuned version of [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) for converting natural language queries into structured JSON for R&D project semantic search. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model was developed as part of the **IMPULS project** (AINA Challenge 2024), a collaboration between [SIRIS Academic](https://sirisacademic.com/) and [Generalitat de Catalunya](https://web.gencat.cat/) to build a multilingual semantic search system for Catalonia's R&D ecosystem (RIS3-MCAT platform). |
|
|
|
|
|
The model converts natural language queries in **Catalan, Spanish, and English** into structured JSON containing: |
|
|
- **Semantic query**: Core thematic content for vector search |
|
|
- **Filters**: Structured metadata (funding programme, year range, location, organization type) |
|
|
- **Query rewrite**: Human-readable interpretation of the query |
|
|
- **Metadata**: Language detection and processing notes |
|
|
|
|
|
### Example |
|
|
|
|
|
**Input (Catalan):** |
|
|
``` |
|
|
projectes d'IA en salut finançats per H2020 des de 2020 |
|
|
``` |
|
|
|
|
|
**Output:** |
|
|
```json |
|
|
{ |
|
|
"doc_type": "projects", |
|
|
"filters": { |
|
|
"programme": "Horizon 2020", |
|
|
"year": ">=2020" |
|
|
}, |
|
|
"organisations": [], |
|
|
"semantic_query": "intel·ligència artificial salut", |
|
|
"query_rewrite": "Projectes sobre IA en salut del programa H2020 des de 2020", |
|
|
"meta": { |
|
|
"lang": "CA" |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Base Model |
|
|
- **Model**: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) |
|
|
- **Architecture**: LlamaForCausalLM (7B parameters) |
|
|
|
|
|
### Fine-tuning Method |
|
|
- **Technique**: LoRA (Low-Rank Adaptation) |
|
|
- **Trainable parameters**: ~1% of total (~50MB adapter) |
|
|
|
|
|
### LoRA Configuration |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Rank (r) | 16 | |
|
|
| Alpha | 32 | |
|
|
| Dropout | 0.05 | |
|
|
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | |
|
|
|
|
|
### Training Hyperparameters |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Epochs | 3 | |
|
|
| Batch size | 16 (effective) | |
|
|
| Learning rate | 2e-4 | |
|
|
| Sequence length | 2048 | |
|
|
| Precision | FP16 (mixed) | |
|
|
| Optimizer | AdamW | |
|
|
| LR scheduler | Cosine | |
|
|
| Warmup ratio | 0.1 | |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing) |
|
|
- **Training split**: 682 multilingual queries (synthetic, template-generated) |
|
|
- **Language distribution**: ~33% Catalan, ~33% Spanish, ~33% English |
|
|
- **Query types**: Discover (88%), Quantify (12%) |
|
|
|
|
|
### Evaluation Data |
|
|
|
|
|
- **Test split**: 100 real queries from domain experts (SIRIS Academic) |
|
|
- **Annotation**: Manual gold-standard JSON for each query |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Overall Performance |
|
|
|
|
|
| Metric | Base Model | Fine-tuned | |
|
|
|--------|------------|------------| |
|
|
| JSON Validity | 100% | **100%** | |
|
|
| Strict Accuracy | 15% | **51%** | |
|
|
| Relaxed Accuracy | 29% | **65%** | |
|
|
| Language Match | 53% | **87%** | |
|
|
| Semantic Query Accuracy | 44% | **86%** | |
|
|
|
|
|
### Component-level Accuracy |
|
|
|
|
|
| Component | Accuracy | |
|
|
|-----------|----------| |
|
|
| Programme (H2020, FEDER, etc.) | 96% | |
|
|
| Year extraction | 98% | |
|
|
| Location | 91% | |
|
|
| Organizations | 77% | |
|
|
| Semantic Query | 86% | |
|
|
|
|
|
### Performance by Language |
|
|
|
|
|
| Language | Relaxed Accuracy | |
|
|
|----------|------------------| |
|
|
| English | 72% | |
|
|
| Catalan | 64% | |
|
|
| Spanish | 52% | |
|
|
|
|
|
### Comparison with Other Models |
|
|
|
|
|
| Model | Strict Accuracy | Relaxed Accuracy | JSON Valid | |
|
|
|-------|-----------------|------------------|------------| |
|
|
| **Salamandra-7B (ours)** | **51%** | **65%** | 100% | |
|
|
| Qwen 2.5-7B | 47% | 65% | 100% | |
|
|
| Mistral-7B | 24% | 55% | 100% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_id = "SIRIS-Lab/impuls-salamandra-7b-query-parser" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# System prompt (simplified version) |
|
|
system_prompt = """Convert natural language queries into structured JSON for R&D project search. |
|
|
Output only valid JSON with the required schema.""" |
|
|
|
|
|
query = "projectes d'hidrogen finançats per H2020 des de 2020" |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": query} |
|
|
] |
|
|
|
|
|
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
temperature=0.1, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### With 4-bit Quantization (Recommended for limited VRAM) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
import torch |
|
|
|
|
|
quantization_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_compute_dtype=torch.float16, |
|
|
bnb_4bit_quant_type="nf4" |
|
|
) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"SIRIS-Lab/impuls-salamandra-7b-query-parser", |
|
|
quantization_config=quantization_config, |
|
|
device_map="auto" |
|
|
) |
|
|
# Reduces memory from ~14GB to ~3.5GB |
|
|
``` |
|
|
|
|
|
## Output Schema |
|
|
|
|
|
```json |
|
|
{ |
|
|
"doc_type": "projects", |
|
|
"filters": { |
|
|
"programme": "string | null", |
|
|
"funding_level": "string | null", |
|
|
"year": "string | null", |
|
|
"location": "string | null", |
|
|
"location_level": "region | province | country | null" |
|
|
}, |
|
|
"organisations": [ |
|
|
{ |
|
|
"type": "university | research_center | hospital | company | null", |
|
|
"name": "string | null", |
|
|
"location": "string | null", |
|
|
"location_level": "string | null" |
|
|
} |
|
|
], |
|
|
"semantic_query": "string | null", |
|
|
"query_rewrite": "string", |
|
|
"meta": { |
|
|
"lang": "CA | ES | EN", |
|
|
"notes": "string | null" |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
| Configuration | VRAM Required | |
|
|
|---------------|---------------| |
|
|
| FP16 (full precision) | ~14 GB | |
|
|
| 4-bit quantization | ~3.5 GB | |
|
|
|
|
|
**Recommended**: GPU with 24GB+ VRAM (A100) or 4-bit quantization on consumer GPUs. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Domain-specific**: Optimized for R&D project search queries; may not generalize well to other domains |
|
|
- **Schema-bound**: Outputs follow a fixed JSON schema; cannot handle arbitrary structured formats |
|
|
- **Language coverage**: Best performance on Catalan and English; Spanish accuracy is lower |
|
|
- **Complex queries**: Struggles with queries requiring numerical aggregation or ranking operations |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- R&D project discovery platforms (RIS3CAT, Horizon Europe portals) |
|
|
- Scientific literature search systems |
|
|
- Multilingual semantic search applications |
|
|
- Query understanding in Catalan, Spanish, and English |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- The model was trained on synthetic queries generated from templates and real queries from domain experts |
|
|
- No personal or sensitive data was used in training |
|
|
- The model is intended for search query parsing and does not generate harmful content |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{impuls-salamandra-2024, |
|
|
author = {SIRIS Academic}, |
|
|
title = {IMPULS-Salamandra-7B-Query-Parser: Multilingual Query Parsing for R&D Semantic Search}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **[Barcelona Supercomputing Center (BSC)](https://www.bsc.es/)** - For the Salamandra base model and AINA infrastructure |
|
|
- **[Generalitat de Catalunya](https://web.gencat.cat/)** - For funding and the RIS3-MCAT platform |
|
|
- **[AINA Project](https://projecteaina.cat/)** - For the AINA Challenge 2024 framework |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Salamandra model. |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Training Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing) |
|
|
- **Project Repository**: [github.com/sirisacademic/aina-impulse](https://github.com/sirisacademic/aina-impulse) |
|
|
- **Base Model**: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) |
|
|
- **AINA Project**: [projecteaina.cat](https://projecteaina.cat/) |
|
|
- **SIRIS Academic**: [sirisacademic.com](https://sirisacademic.com/) |
|
|
|