File size: 4,621 Bytes
bf74958
 
 
 
 
 
 
7b6ffe4
bf74958
5355c40
 
ac2c360
 
7b6ffe4
ac2c360
7b6ffe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac2c360
7b6ffe4
 
 
 
 
 
 
 
 
 
 
 
 
 
ac2c360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b6ffe4
ac2c360
 
7b6ffe4
ac2c360
 
 
 
7b6ffe4
 
 
 
ac2c360
7b6ffe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: apache-2.0
metrics:
- accuracy
base_model:
- google/gemma-3-270m
tags:
- unstructured-to-structured-data
- fine-tune
- salary-normalizer
- salary-parser
---

# Salary Normalizer

A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
and standardizes free-form salary text into structured JSON. Given an arbitrary salary
string and a country name, it extracts currency symbol, ISO code, numeric range, and
pay cadence in a single inference pass.

## Model Details

| Property       | Value                          |
|----------------|-------------------------------|
| Base model     | `google/gemma-3-270m`         |
| Fine-tune type | Supervised (instruction-tuned)|
| License        | Apache 2.0                    |
| Task           | Structured information extraction |

## Intended Use

- Extract and normalize salary mentions from job descriptions or candidate profiles.
- Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
  into a consistent schema for downstream analytics or storage.

**Out-of-scope use:** This model is not designed for general text generation or tasks
unrelated to salary parsing.


## Input Format

Prompts must follow this exact template:
```text
<start_of_turn>user
summarize salary: <SALARY_TEXT>
country: <COUNTRY_NAME><end_of_turn>
<start_of_turn>model
```

**Examples of valid salary text:**
- `$60k - $80k per year`
- `INR 12 LPA`
- `€45,000 annually`
- `12 to 12.5 US $ per hr`

Country names must match one of the [supported countries](#supported-countries) listed below.


## Output Schema

The model returns a JSON object with the following fields:
```json
{
  "currency": "$",
  "iso_code": "USD",
  "min_amount": 60000,
  "max_amount": 80000,
  "pay_rate": "ANNUALLY"
}
```

| Field        | Type          | Description                                                  |
|--------------|---------------|--------------------------------------------------------------|
| `currency`   | `string`      | Raw currency symbol or string as it appears in the input     |
| `iso_code`   | `string`      | Standardized ISO 4217 currency code                          |
| `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated)    |
| `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated)    |
| `pay_rate`   | `string`      | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` |

> **Note:** `min_amount` and `max_amount` reflect normalized numeric values,
> not raw token extractions. For single-value salaries, both fields will hold the same value.

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Draup/salary-normalizer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

salary_text = "12 to 12.5 US $ per hr"
country = "United States"

prompt = (
    f"<start_of_turn>user\n"
    f"summarize salary: {salary_text}\n"
    f"country: {country}<end_of_turn>\n"
    f"<start_of_turn>model\n"
)

inputs = tokenizer(prompt, return_tensors="pt", truncation=True)

outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=False,
    eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
)

result = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(result)
# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
```

## Supported Countries

The model supports salary parsing for the following 49 countries:

| | | | |
|---|---|---|---|
| Argentina | Australia | Austria | Belgium |
| Brazil | Canada | Chile | China |
| Colombia | Czechia | Denmark | Egypt |
| Finland | France | Germany | Hong Kong |
| Hungary | India | Indonesia | Ireland |
| Israel | Italy | Japan | Malaysia |
| Mexico | Netherlands | New Zealand | Norway |
| Pakistan | Peru | Philippines | Poland |
| Portugal | Romania | Russia | Saudi Arabia |
| Singapore | South Africa | South Korea | Spain |
| Sweden | Switzerland | Taiwan | Thailand |
| Turkey | United Arab Emirates | United Kingdom | United States |
| Vietnam | | | |

## Limitations

- Performance may degrade on salary formats not well-represented in training data.
- Country context is used for currency disambiguation; incorrect country input may
  produce inaccurate `iso_code` or `currency` values.
- The model is not multilingual — salary text is expected to be in English or use
  standard numeric/symbol conventions.