Salary Normalizer
A fine-tuned Gemma 3 270M model that parses and standardizes free-form salary text into structured JSON. Given an arbitrary salary string and a country name, it extracts currency symbol, ISO code, numeric range, and pay cadence in a single inference pass.
Model Details
| Property | Value |
|---|---|
| Base model | google/gemma-3-270m |
| Fine-tune type | Supervised (instruction-tuned) |
| License | Apache 2.0 |
| Task | Structured information extraction |
Intended Use
- Extract and normalize salary mentions from job descriptions or candidate profiles.
- Standardize heterogeneous salary formats (e.g.,
12 LPA,$60k–$80k/yr,€45,000 p.a.) into a consistent schema for downstream analytics or storage.
Out-of-scope use: This model is not designed for general text generation or tasks unrelated to salary parsing.
Input Format
Prompts must follow this exact template:
<start_of_turn>user
summarize salary: <SALARY_TEXT>
country: <COUNTRY_NAME><end_of_turn>
<start_of_turn>model
Examples of valid salary text:
$60k - $80k per yearINR 12 LPA€45,000 annually12 to 12.5 US $ per hr
Country names must match one of the supported countries listed below.
Output Schema
The model returns a JSON object with the following fields:
{
"currency": "$",
"iso_code": "USD",
"min_amount": 60000,
"max_amount": 80000,
"pay_rate": "ANNUALLY"
}
| Field | Type | Description |
|---|---|---|
currency |
string |
Raw currency symbol or string as it appears in the input |
iso_code |
string |
Standardized ISO 4217 currency code |
min_amount |
int / float |
Lower bound of the salary range (annualized or as stated) |
max_amount |
int / float |
Upper bound of the salary range (annualized or as stated) |
pay_rate |
string |
One of: HOURLY, DAILY, WEEKLY, BI-WEEKLY, MONTHLY, ANNUALLY, OTHERS |
Note:
min_amountandmax_amountreflect normalized numeric values, not raw token extractions. For single-value salaries, both fields will hold the same value.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Draup/salary-normalizer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
salary_text = "12 to 12.5 US $ per hr"
country = "United States"
prompt = (
f"<start_of_turn>user\n"
f"summarize salary: {salary_text}\n"
f"country: {country}<end_of_turn>\n"
f"<start_of_turn>model\n"
)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(
**inputs,
max_new_tokens=64,
do_sample=False,
eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
)
result = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True
)
print(result)
# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
Supported Countries
The model supports salary parsing for the following 49 countries:
| Argentina | Australia | Austria | Belgium |
| Brazil | Canada | Chile | China |
| Colombia | Czechia | Denmark | Egypt |
| Finland | France | Germany | Hong Kong |
| Hungary | India | Indonesia | Ireland |
| Israel | Italy | Japan | Malaysia |
| Mexico | Netherlands | New Zealand | Norway |
| Pakistan | Peru | Philippines | Poland |
| Portugal | Romania | Russia | Saudi Arabia |
| Singapore | South Africa | South Korea | Spain |
| Sweden | Switzerland | Taiwan | Thailand |
| Turkey | United Arab Emirates | United Kingdom | United States |
| Vietnam |
Limitations
- Performance may degrade on salary formats not well-represented in training data.
- Country context is used for currency disambiguation; incorrect country input may
produce inaccurate
iso_codeorcurrencyvalues. - The model is not multilingual — salary text is expected to be in English or use standard numeric/symbol conventions.
- Downloads last month
- 17
Model tree for Draup/salary-normalizer
Base model
google/gemma-3-270m