File size: 3,571 Bytes
1e33aae
 
 
70476af
 
 
 
 
55615d0
1e33aae
 
 
f679618
1e33aae
 
 
 
 
 
 
 
 
 
 
35a4ac0
 
1e33aae
f9447c8
35a4ac0
1e33aae
f9447c8
 
 
 
 
 
 
2db6a9b
f9447c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35a4ac0
 
1e33aae
35a4ac0
1e33aae
35a4ac0
 
 
 
 
 
 
 
 
1e33aae
 
35a4ac0
 
 
 
 
 
 
 
1e33aae
 
35a4ac0
1e33aae
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
base_model:
- unsloth/Qwen3-14B-unsloth-bnb-4bit
license: apache-2.0
language:
- en
tags:
- geocoding
- unsloth
---

This fine-tuned LLM is intended for the task of geocoding complex location references, and accompanies [Coordinates from Context: Using LLMs to Ground Complex Location References](https://arxiv.org/pdf/2510.08741) (Masis & O'Connor, EACL 2026).
The model is referred to as "Geoparser-augmented FT Qwen 14B" in the paper. 

### Model description
The base model is a quantized Qwen3-14B model (```unsloth/Qwen3-14B-unsloth-bnb-4bit```), which has been fine-tuned for geocoding, i.e. linking a location reference to an actual geographic location. 
The model was trained using parameter-efficient fine-tuning via low-rank adaptation. 
It was trained for our 'Geoparser-augmented' approach, where a separate geoparsing tool augments the inputs with the center coordinates of mentioned locations;
our fine-tuned model then uses both the original location reference and the mentioned locations' coordinates to generate the described location's bounding box. 
For more details, please see the accompanying paper.

### Training data
The model is trained on 13k examples from the training subset of the [GeoCoDe dataset](https://github.com/EgoLaparra/geocode-data), where the input is a complex location reference and the center coordinates of each mentioned location and the output is the location's corresponding bounding box. 

### Limitations
Due to data limitations, this model has been trained and evaluated for our task only in Mainstream American English. 

### Usage (unsloth)
The following code snippet illustrates how to use the model. For the system prompt we used and for example prompts, please see the appendices in the accompanying paper. 

```python
from unsloth import FastLanguageModel
import torch

model_name = "tmasis/geocoding-complex-location-references"

# Load model and tokenizer from Huggingface Hub
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# Prepare model input
messages = [{"role": "system", "content": <system_prompt>},
    {"role": "user", "content": <prompt>}]
text = tokenizer.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt = True,
    enable_thinking = False
)

# Conduct text generation
outputs = model.generate(**tokenizer(text, return_tensors="pt").to(model.device),
    max_new_tokens=1024, temperature=0.7, top_p=0.8, top_k=20)
response = tokenizer.batch_decode(outputs)[0]
print(response)
```

### Usage (HuggingFace transformers)
Alternatively, you can use the HuggingFace transformers library. 

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tmasis/geocoding-complex-location-references"

# Load model and tokenizer from Huggingface Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name = model_name,
    torch_dtype = "auto",
    device_map = "auto"
)

# Prepare model input
messages = [{"role": "system", "content": <system_prompt>},
    {"role": "user", "content": <prompt>}]
text = tokenizer.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt = True,
    enable_thinking = False
)

# Conduct text generation
outputs = model.generate(**tokenizer(text, return_tensors="pt").to(model.device),
    max_new_tokens=1024, temperature=0.7, top_p=0.8, top_k=20)
response = tokenizer.batch_decode(outputs)[0]
print(response)
```