File size: 9,355 Bytes
91261b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6e8ece
 
91261b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6e8ece
 
 
91261b2
 
 
 
a6e8ece
 
91261b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6e8ece
91261b2
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
license: apache-2.0
language:
  - ca
  - es
  - en
base_model: BSC-LT/salamandra-7b-instruct-tools
library_name: transformers
pipeline_tag: text-generation
tags:
  - query-parsing
  - semantic-search
  - structured-output
  - json-generation
  - multilingual
  - catalan
  - spanish
  - LoRA
  - fine-tuned
  - AINA
  - R&D
datasets:
  - SIRIS-Lab/impuls-query-parsing
metrics:
  - accuracy
model-index:
  - name: IMPULS-Salamandra-7B-Query-Parser
    results:
      - task:
          type: text-generation
          name: Query Parsing
        metrics:
          - name: JSON Validity
            type: accuracy
            value: 1.0
          - name: Strict Accuracy
            type: accuracy
            value: 0.51
          - name: Relaxed Accuracy
            type: accuracy
            value: 0.65
          - name: Language Match
            type: accuracy
            value: 0.87
---

# IMPULS-Salamandra-7B-Query-Parser

A fine-tuned version of [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) for converting natural language queries into structured JSON for R&D project semantic search.

## Model Description

This model was developed as part of the **IMPULS project** (AINA Challenge 2024), a collaboration between [SIRIS Academic](https://sirisacademic.com/) and [Generalitat de Catalunya](https://web.gencat.cat/) to build a multilingual semantic search system for Catalonia's R&D ecosystem (RIS3-MCAT platform).

The model converts natural language queries in **Catalan, Spanish, and English** into structured JSON containing:
- **Semantic query**: Core thematic content for vector search
- **Filters**: Structured metadata (funding programme, year range, location, organization type)
- **Query rewrite**: Human-readable interpretation of the query
- **Metadata**: Language detection and processing notes

### Example

**Input (Catalan):**
```
projectes d'IA en salut finançats per H2020 des de 2020
```

**Output:**
```json
{
  "doc_type": "projects",
  "filters": {
    "programme": "Horizon 2020",
    "year": ">=2020"
  },
  "organisations": [],
  "semantic_query": "intel·ligència artificial salut",
  "query_rewrite": "Projectes sobre IA en salut del programa H2020 des de 2020",
  "meta": {
    "lang": "CA"
  }
}
```

## Training Details

### Base Model
- **Model**: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools)
- **Architecture**: LlamaForCausalLM (7B parameters)

### Fine-tuning Method
- **Technique**: LoRA (Low-Rank Adaptation)
- **Trainable parameters**: ~1% of total (~50MB adapter)

### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank (r) | 16 |
| Alpha | 32 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |

### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Epochs | 3 |
| Batch size | 16 (effective) |
| Learning rate | 2e-4 |
| Sequence length | 2048 |
| Precision | FP16 (mixed) |
| Optimizer | AdamW |
| LR scheduler | Cosine |
| Warmup ratio | 0.1 |

### Training Data

- **Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing)
- **Training split**: 682 multilingual queries (synthetic, template-generated)
- **Language distribution**: ~33% Catalan, ~33% Spanish, ~33% English
- **Query types**: Discover (88%), Quantify (12%)

### Evaluation Data

- **Test split**: 100 real queries from domain experts (SIRIS Academic)
- **Annotation**: Manual gold-standard JSON for each query

## Evaluation Results

### Overall Performance

| Metric | Base Model | Fine-tuned |
|--------|------------|------------|
| JSON Validity | 100% | **100%** |
| Strict Accuracy | 15% | **51%** |
| Relaxed Accuracy | 29% | **65%** |
| Language Match | 53% | **87%** |
| Semantic Query Accuracy | 44% | **86%** |

### Component-level Accuracy

| Component | Accuracy |
|-----------|----------|
| Programme (H2020, FEDER, etc.) | 96% |
| Year extraction | 98% |
| Location | 91% |
| Organizations | 77% |
| Semantic Query | 86% |

### Performance by Language

| Language | Relaxed Accuracy |
|----------|------------------|
| English | 72% |
| Catalan | 64% |
| Spanish | 52% |

### Comparison with Other Models

| Model | Strict Accuracy | Relaxed Accuracy | JSON Valid |
|-------|-----------------|------------------|------------|
| **Salamandra-7B (ours)** | **51%** | **65%** | 100% |
| Qwen 2.5-7B | 47% | 65% | 100% |
| Mistral-7B | 24% | 55% | 100% |

## Usage

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "SIRIS-Lab/impuls-salamandra-7b-query-parser"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# System prompt (simplified version)
system_prompt = """Convert natural language queries into structured JSON for R&D project search.
Output only valid JSON with the required schema."""

query = "projectes d'hidrogen finançats per H2020 des de 2020"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": query}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.1,
        do_sample=True
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
```

### With 4-bit Quantization (Recommended for limited VRAM)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "SIRIS-Lab/impuls-salamandra-7b-query-parser",
    quantization_config=quantization_config,
    device_map="auto"
)
# Reduces memory from ~14GB to ~3.5GB
```

## Output Schema

```json
{
  "doc_type": "projects",
  "filters": {
    "programme": "string | null",
    "funding_level": "string | null", 
    "year": "string | null",
    "location": "string | null",
    "location_level": "region | province | country | null"
  },
  "organisations": [
    {
      "type": "university | research_center | hospital | company | null",
      "name": "string | null",
      "location": "string | null",
      "location_level": "string | null"
    }
  ],
  "semantic_query": "string | null",
  "query_rewrite": "string",
  "meta": {
    "lang": "CA | ES | EN",
    "notes": "string | null"
  }
}
```

## Hardware Requirements

| Configuration | VRAM Required |
|---------------|---------------|
| FP16 (full precision) | ~14 GB |
| 4-bit quantization | ~3.5 GB |

**Recommended**: GPU with 24GB+ VRAM (A100) or 4-bit quantization on consumer GPUs.

## Limitations

- **Domain-specific**: Optimized for R&D project search queries; may not generalize well to other domains
- **Schema-bound**: Outputs follow a fixed JSON schema; cannot handle arbitrary structured formats
- **Language coverage**: Best performance on Catalan and English; Spanish accuracy is lower
- **Complex queries**: Struggles with queries requiring numerical aggregation or ranking operations

## Intended Use

This model is designed for:
- R&D project discovery platforms (RIS3CAT, Horizon Europe portals)
- Scientific literature search systems
- Multilingual semantic search applications
- Query understanding in Catalan, Spanish, and English

## Ethical Considerations

- The model was trained on synthetic queries generated from templates and real queries from domain experts
- No personal or sensitive data was used in training
- The model is intended for search query parsing and does not generate harmful content

## Citation

If you use this model, please cite:

```bibtex
@misc{impuls-salamandra-2024,
  author = {SIRIS Academic},
  title = {IMPULS-Salamandra-7B-Query-Parser: Multilingual Query Parsing for R&D Semantic Search},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser}}
}
```

## Acknowledgments

- **[Barcelona Supercomputing Center (BSC)](https://www.bsc.es/)** - For the Salamandra base model and AINA infrastructure
- **[Generalitat de Catalunya](https://web.gencat.cat/)** - For funding and the RIS3-MCAT platform
- **[AINA Project](https://projecteaina.cat/)** - For the AINA Challenge 2024 framework

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Salamandra model.

## Links

- **Training Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing)
- **Project Repository**: [github.com/sirisacademic/aina-impulse](https://github.com/sirisacademic/aina-impulse)
- **Base Model**: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools)
- **AINA Project**: [projecteaina.cat](https://projecteaina.cat/)
- **SIRIS Academic**: [sirisacademic.com](https://sirisacademic.com/)