Update README.md

7b6ffe4 verified 4 days ago

4.62 kB

	---
	license: apache-2.0
	metrics:
	- accuracy
	base_model:
	- google/gemma-3-270m
	tags:
	- unstructured-to-structured-data
	- fine-tune
	- salary-normalizer
	- salary-parser
	---

	# Salary Normalizer

	A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
	and standardizes free-form salary text into structured JSON. Given an arbitrary salary
	string and a country name, it extracts currency symbol, ISO code, numeric range, and
	pay cadence in a single inference pass.

	## Model Details

	\| Property \| Value \|
	\|----------------\|-------------------------------\|
	\| Base model \| `google/gemma-3-270m` \|
	\| Fine-tune type \| Supervised (instruction-tuned)\|
	\| License \| Apache 2.0 \|
	\| Task \| Structured information extraction \|

	## Intended Use

	- Extract and normalize salary mentions from job descriptions or candidate profiles.
	- Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
	into a consistent schema for downstream analytics or storage.

	Out-of-scope use: This model is not designed for general text generation or tasks
	unrelated to salary parsing.


	## Input Format

	Prompts must follow this exact template:
	```text
	<start_of_turn>user
	summarize salary: <SALARY_TEXT>
	country: <COUNTRY_NAME><end_of_turn>
	<start_of_turn>model
	```

	Examples of valid salary text:
	- `$60k - $80k per year`
	- `INR 12 LPA`
	- `€45,000 annually`
	- `12 to 12.5 US $ per hr`

	Country names must match one of the [supported countries](#supported-countries) listed below.


	## Output Schema

	The model returns a JSON object with the following fields:
	```json
	{
	"currency": "$",
	"iso_code": "USD",
	"min_amount": 60000,
	"max_amount": 80000,
	"pay_rate": "ANNUALLY"
	}
	```

	\| Field \| Type \| Description \|
	\|--------------\|---------------\|--------------------------------------------------------------\|
	\| `currency` \| `string` \| Raw currency symbol or string as it appears in the input \|
	\| `iso_code` \| `string` \| Standardized ISO 4217 currency code \|
	\| `min_amount` \| `int / float` \| Lower bound of the salary range (annualized or as stated) \|
	\| `max_amount` \| `int / float` \| Upper bound of the salary range (annualized or as stated) \|
	\| `pay_rate` \| `string` \| One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` \|

	> Note: `min_amount` and `max_amount` reflect normalized numeric values,
	> not raw token extractions. For single-value salaries, both fields will hold the same value.

	## Usage
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "Draup/salary-normalizer"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	salary_text = "12 to 12.5 US $ per hr"
	country = "United States"

	prompt = (
	f"<start_of_turn>user\n"
	f"summarize salary: {salary_text}\n"
	f"country: {country}<end_of_turn>\n"
	f"<start_of_turn>model\n"
	)

	inputs = tokenizer(prompt, return_tensors="pt", truncation=True)

	outputs = model.generate(
	**inputs,
	max_new_tokens=64,
	do_sample=False,
	eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
	)

	result = tokenizer.decode(
	outputs[0][inputs["input_ids"].shape[-1]:],
	skip_special_tokens=True
	)
	print(result)
	# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
	```

	## Supported Countries

	The model supports salary parsing for the following 49 countries:

	\| \| \| \| \|
	\|---\|---\|---\|---\|
	\| Argentina \| Australia \| Austria \| Belgium \|
	\| Brazil \| Canada \| Chile \| China \|
	\| Colombia \| Czechia \| Denmark \| Egypt \|
	\| Finland \| France \| Germany \| Hong Kong \|
	\| Hungary \| India \| Indonesia \| Ireland \|
	\| Israel \| Italy \| Japan \| Malaysia \|
	\| Mexico \| Netherlands \| New Zealand \| Norway \|
	\| Pakistan \| Peru \| Philippines \| Poland \|
	\| Portugal \| Romania \| Russia \| Saudi Arabia \|
	\| Singapore \| South Africa \| South Korea \| Spain \|
	\| Sweden \| Switzerland \| Taiwan \| Thailand \|
	\| Turkey \| United Arab Emirates \| United Kingdom \| United States \|
	\| Vietnam \| \| \| \|

	## Limitations

	- Performance may degrade on salary formats not well-represented in training data.
	- Country context is used for currency disambiguation; incorrect country input may
	produce inaccurate `iso_code` or `currency` values.
	- The model is not multilingual — salary text is expected to be in English or use
	standard numeric/symbol conventions.