Salary Normalizer

A fine-tuned Gemma 3 270M model that parses and standardizes free-form salary text into structured JSON. Given an arbitrary salary string and a country name, it extracts currency symbol, ISO code, numeric range, and pay cadence in a single inference pass.

Model Details

Property Value
Base model google/gemma-3-270m
Fine-tune type Supervised (instruction-tuned)
License Apache 2.0
Task Structured information extraction

Intended Use

  • Extract and normalize salary mentions from job descriptions or candidate profiles.
  • Standardize heterogeneous salary formats (e.g., 12 LPA, $60k–$80k/yr, €45,000 p.a.) into a consistent schema for downstream analytics or storage.

Out-of-scope use: This model is not designed for general text generation or tasks unrelated to salary parsing.

Input Format

Prompts must follow this exact template:

<start_of_turn>user
summarize salary: <SALARY_TEXT>
country: <COUNTRY_NAME><end_of_turn>
<start_of_turn>model

Examples of valid salary text:

  • $60k - $80k per year
  • INR 12 LPA
  • €45,000 annually
  • 12 to 12.5 US $ per hr

Country names must match one of the supported countries listed below.

Output Schema

The model returns a JSON object with the following fields:

{
  "currency": "$",
  "iso_code": "USD",
  "min_amount": 60000,
  "max_amount": 80000,
  "pay_rate": "ANNUALLY"
}
Field Type Description
currency string Raw currency symbol or string as it appears in the input
iso_code string Standardized ISO 4217 currency code
min_amount int / float Lower bound of the salary range (annualized or as stated)
max_amount int / float Upper bound of the salary range (annualized or as stated)
pay_rate string One of: HOURLY, DAILY, WEEKLY, BI-WEEKLY, MONTHLY, ANNUALLY, OTHERS

Note: min_amount and max_amount reflect normalized numeric values, not raw token extractions. For single-value salaries, both fields will hold the same value.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Draup/salary-normalizer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

salary_text = "12 to 12.5 US $ per hr"
country = "United States"

prompt = (
    f"<start_of_turn>user\n"
    f"summarize salary: {salary_text}\n"
    f"country: {country}<end_of_turn>\n"
    f"<start_of_turn>model\n"
)

inputs = tokenizer(prompt, return_tensors="pt", truncation=True)

outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=False,
    eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
)

result = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(result)
# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}

Supported Countries

The model supports salary parsing for the following 49 countries:

Argentina Australia Austria Belgium
Brazil Canada Chile China
Colombia Czechia Denmark Egypt
Finland France Germany Hong Kong
Hungary India Indonesia Ireland
Israel Italy Japan Malaysia
Mexico Netherlands New Zealand Norway
Pakistan Peru Philippines Poland
Portugal Romania Russia Saudi Arabia
Singapore South Africa South Korea Spain
Sweden Switzerland Taiwan Thailand
Turkey United Arab Emirates United Kingdom United States
Vietnam

Limitations

  • Performance may degrade on salary formats not well-represented in training data.
  • Country context is used for currency disambiguation; incorrect country input may produce inaccurate iso_code or currency values.
  • The model is not multilingual — salary text is expected to be in English or use standard numeric/symbol conventions.
Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Draup/salary-normalizer

Finetuned
(135)
this model