impuls-salamandra-7b-query-parser / README.md

Upload README.md with huggingface_hub

a6e8ece verified 3 days ago

9.36 kB

	---
	license: apache-2.0
	language:
	- ca
	- es
	- en
	base_model: BSC-LT/salamandra-7b-instruct-tools
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- query-parsing
	- semantic-search
	- structured-output
	- json-generation
	- multilingual
	- catalan
	- spanish
	- LoRA
	- fine-tuned
	- AINA
	- R&D
	datasets:
	- SIRIS-Lab/impuls-query-parsing
	metrics:
	- accuracy
	model-index:
	- name: IMPULS-Salamandra-7B-Query-Parser
	results:
	- task:
	type: text-generation
	name: Query Parsing
	metrics:
	- name: JSON Validity
	type: accuracy
	value: 1.0
	- name: Strict Accuracy
	type: accuracy
	value: 0.51
	- name: Relaxed Accuracy
	type: accuracy
	value: 0.65
	- name: Language Match
	type: accuracy
	value: 0.87
	---

	# IMPULS-Salamandra-7B-Query-Parser

	A fine-tuned version of [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) for converting natural language queries into structured JSON for R&D project semantic search.

	## Model Description

	This model was developed as part of the IMPULS project (AINA Challenge 2024), a collaboration between [SIRIS Academic](https://sirisacademic.com/) and [Generalitat de Catalunya](https://web.gencat.cat/) to build a multilingual semantic search system for Catalonia's R&D ecosystem (RIS3-MCAT platform).

	The model converts natural language queries in Catalan, Spanish, and English into structured JSON containing:
	- Semantic query: Core thematic content for vector search
	- Filters: Structured metadata (funding programme, year range, location, organization type)
	- Query rewrite: Human-readable interpretation of the query
	- Metadata: Language detection and processing notes

	### Example

	Input (Catalan):
	```
	projectes d'IA en salut finançats per H2020 des de 2020
	```

	Output:
	```json
	{
	"doc_type": "projects",
	"filters": {
	"programme": "Horizon 2020",
	"year": ">=2020"
	},
	"organisations": [],
	"semantic_query": "intel·ligència artificial salut",
	"query_rewrite": "Projectes sobre IA en salut del programa H2020 des de 2020",
	"meta": {
	"lang": "CA"
	}
	}
	```

	## Training Details

	### Base Model
	- Model: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools)
	- Architecture: LlamaForCausalLM (7B parameters)

	### Fine-tuning Method
	- Technique: LoRA (Low-Rank Adaptation)
	- Trainable parameters: ~1% of total (~50MB adapter)

	### LoRA Configuration
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Rank (r) \| 16 \|
	\| Alpha \| 32 \|
	\| Dropout \| 0.05 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|

	### Training Hyperparameters
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 3 \|
	\| Batch size \| 16 (effective) \|
	\| Learning rate \| 2e-4 \|
	\| Sequence length \| 2048 \|
	\| Precision \| FP16 (mixed) \|
	\| Optimizer \| AdamW \|
	\| LR scheduler \| Cosine \|
	\| Warmup ratio \| 0.1 \|

	### Training Data

	- Dataset: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing)
	- Training split: 682 multilingual queries (synthetic, template-generated)
	- Language distribution: ~33% Catalan, ~33% Spanish, ~33% English
	- Query types: Discover (88%), Quantify (12%)

	### Evaluation Data

	- Test split: 100 real queries from domain experts (SIRIS Academic)
	- Annotation: Manual gold-standard JSON for each query

	## Evaluation Results

	### Overall Performance

	\| Metric \| Base Model \| Fine-tuned \|
	\|--------\|------------\|------------\|
	\| JSON Validity \| 100% \| 100% \|
	\| Strict Accuracy \| 15% \| 51% \|
	\| Relaxed Accuracy \| 29% \| 65% \|
	\| Language Match \| 53% \| 87% \|
	\| Semantic Query Accuracy \| 44% \| 86% \|

	### Component-level Accuracy

	\| Component \| Accuracy \|
	\|-----------\|----------\|
	\| Programme (H2020, FEDER, etc.) \| 96% \|
	\| Year extraction \| 98% \|
	\| Location \| 91% \|
	\| Organizations \| 77% \|
	\| Semantic Query \| 86% \|

	### Performance by Language

	\| Language \| Relaxed Accuracy \|
	\|----------\|------------------\|
	\| English \| 72% \|
	\| Catalan \| 64% \|
	\| Spanish \| 52% \|

	### Comparison with Other Models

	\| Model \| Strict Accuracy \| Relaxed Accuracy \| JSON Valid \|
	\|-------\|-----------------\|------------------\|------------\|
	\| Salamandra-7B (ours) \| 51% \| 65% \| 100% \|
	\| Qwen 2.5-7B \| 47% \| 65% \| 100% \|
	\| Mistral-7B \| 24% \| 55% \| 100% \|

	## Usage

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "SIRIS-Lab/impuls-salamandra-7b-query-parser"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# System prompt (simplified version)
	system_prompt = """Convert natural language queries into structured JSON for R&D project search.
	Output only valid JSON with the required schema."""

	query = "projectes d'hidrogen finançats per H2020 des de 2020"

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": query}
	]

	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.1,
	do_sample=True
	)

	response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	### With 4-bit Quantization (Recommended for limited VRAM)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	import torch

	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_quant_type="nf4"
	)

	model = AutoModelForCausalLM.from_pretrained(
	"SIRIS-Lab/impuls-salamandra-7b-query-parser",
	quantization_config=quantization_config,
	device_map="auto"
	)
	# Reduces memory from ~14GB to ~3.5GB
	```

	## Output Schema

	```json
	{
	"doc_type": "projects",
	"filters": {
	"programme": "string \| null",
	"funding_level": "string \| null",
	"year": "string \| null",
	"location": "string \| null",
	"location_level": "region \| province \| country \| null"
	},
	"organisations": [
	{
	"type": "university \| research_center \| hospital \| company \| null",
	"name": "string \| null",
	"location": "string \| null",
	"location_level": "string \| null"
	}
	],
	"semantic_query": "string \| null",
	"query_rewrite": "string",
	"meta": {
	"lang": "CA \| ES \| EN",
	"notes": "string \| null"
	}
	}
	```

	## Hardware Requirements

	\| Configuration \| VRAM Required \|
	\|---------------\|---------------\|
	\| FP16 (full precision) \| ~14 GB \|
	\| 4-bit quantization \| ~3.5 GB \|

	Recommended: GPU with 24GB+ VRAM (A100) or 4-bit quantization on consumer GPUs.

	## Limitations

	- Domain-specific: Optimized for R&D project search queries; may not generalize well to other domains
	- Schema-bound: Outputs follow a fixed JSON schema; cannot handle arbitrary structured formats
	- Language coverage: Best performance on Catalan and English; Spanish accuracy is lower
	- Complex queries: Struggles with queries requiring numerical aggregation or ranking operations

	## Intended Use

	This model is designed for:
	- R&D project discovery platforms (RIS3CAT, Horizon Europe portals)
	- Scientific literature search systems
	- Multilingual semantic search applications
	- Query understanding in Catalan, Spanish, and English

	## Ethical Considerations

	- The model was trained on synthetic queries generated from templates and real queries from domain experts
	- No personal or sensitive data was used in training
	- The model is intended for search query parsing and does not generate harmful content

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{impuls-salamandra-2024,
	author = {SIRIS Academic},
	title = {IMPULS-Salamandra-7B-Query-Parser: Multilingual Query Parsing for R&D Semantic Search},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser}}
	}
	```

	## Acknowledgments

	- [Barcelona Supercomputing Center (BSC)](https://www.bsc.es/) - For the Salamandra base model and AINA infrastructure
	- [Generalitat de Catalunya](https://web.gencat.cat/) - For funding and the RIS3-MCAT platform
	- [AINA Project](https://projecteaina.cat/) - For the AINA Challenge 2024 framework

	## License

	This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Salamandra model.

	## Links

	- Training Dataset: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing)
	- Project Repository: [github.com/sirisacademic/aina-impulse](https://github.com/sirisacademic/aina-impulse)
	- Base Model: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools)
	- AINA Project: [projecteaina.cat](https://projecteaina.cat/)
	- SIRIS Academic: [sirisacademic.com](https://sirisacademic.com/)