--- license: apache-2.0 metrics: - accuracy base_model: - google/gemma-3-270m tags: - unstructured-to-structured-data - fine-tune - salary-normalizer - salary-parser --- # Salary Normalizer A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses and standardizes free-form salary text into structured JSON. Given an arbitrary salary string and a country name, it extracts currency symbol, ISO code, numeric range, and pay cadence in a single inference pass. ## Model Details | Property | Value | |----------------|-------------------------------| | Base model | `google/gemma-3-270m` | | Fine-tune type | Supervised (instruction-tuned)| | License | Apache 2.0 | | Task | Structured information extraction | ## Intended Use - Extract and normalize salary mentions from job descriptions or candidate profiles. - Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`) into a consistent schema for downstream analytics or storage. **Out-of-scope use:** This model is not designed for general text generation or tasks unrelated to salary parsing. ## Input Format Prompts must follow this exact template: ```text user summarize salary: country: model ``` **Examples of valid salary text:** - `$60k - $80k per year` - `INR 12 LPA` - `€45,000 annually` - `12 to 12.5 US $ per hr` Country names must match one of the [supported countries](#supported-countries) listed below. ## Output Schema The model returns a JSON object with the following fields: ```json { "currency": "$", "iso_code": "USD", "min_amount": 60000, "max_amount": 80000, "pay_rate": "ANNUALLY" } ``` | Field | Type | Description | |--------------|---------------|--------------------------------------------------------------| | `currency` | `string` | Raw currency symbol or string as it appears in the input | | `iso_code` | `string` | Standardized ISO 4217 currency code | | `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated) | | `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated) | | `pay_rate` | `string` | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` | > **Note:** `min_amount` and `max_amount` reflect normalized numeric values, > not raw token extractions. For single-value salaries, both fields will hold the same value. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "Draup/salary-normalizer" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) salary_text = "12 to 12.5 US $ per hr" country = "United States" prompt = ( f"user\n" f"summarize salary: {salary_text}\n" f"country: {country}\n" f"model\n" ) inputs = tokenizer(prompt, return_tensors="pt", truncation=True) outputs = model.generate( **inputs, max_new_tokens=64, do_sample=False, eos_token_id=tokenizer.convert_tokens_to_ids("") ) result = tokenizer.decode( outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True ) print(result) # {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"} ``` ## Supported Countries The model supports salary parsing for the following 49 countries: | | | | | |---|---|---|---| | Argentina | Australia | Austria | Belgium | | Brazil | Canada | Chile | China | | Colombia | Czechia | Denmark | Egypt | | Finland | France | Germany | Hong Kong | | Hungary | India | Indonesia | Ireland | | Israel | Italy | Japan | Malaysia | | Mexico | Netherlands | New Zealand | Norway | | Pakistan | Peru | Philippines | Poland | | Portugal | Romania | Russia | Saudi Arabia | | Singapore | South Africa | South Korea | Spain | | Sweden | Switzerland | Taiwan | Thailand | | Turkey | United Arab Emirates | United Kingdom | United States | | Vietnam | | | | ## Limitations - Performance may degrade on salary formats not well-represented in training data. - Country context is used for currency disambiguation; incorrect country input may produce inaccurate `iso_code` or `currency` values. - The model is not multilingual — salary text is expected to be in English or use standard numeric/symbol conventions.