Job Posting Extractor (Qwen2.5-3B)

A fine-tuned version of Qwen2.5-3B-Instruct specialized in extracting structured JSON data from job postings. Built to replace expensive API calls for web scraping tasks.

What This Model Does

Given a job posting in markdown format, this model extracts structured JSON containing:

  • job_title

  • company

  • location

  • description

  • salary (when available)

  • requirements (when available)

Quick Start

from unsloth import FastLanguageModel
import json

# Load model from HuggingFace
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="HelixCipher/job-posting-extractor-qwen",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Example input
job_markdown = """# Job Position
**Position:** Senior Python Developer
**Company:** TechCorp
**Location:** San Francisco, CA

## Job Description
We are looking for an experienced Python developer...
"""

# Extract JSON
messages = [
    {"role": "system", "content": "You are a JSON extraction assistant. Always output ONLY valid JSON."},
    {"role": "user", "content": f"Extract job fields as JSON.\n\n{job_markdown}"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.1)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

Training Details

  • Base Model: unsloth/qwen2.5-3b-instruct-bnb-4bit.

  • Training Data: 12,000 job posting examples.

  • Training Approach: LoRA with Unsloth.

  • Fine-tuning Library: TRL (Transformer Reinforcement Learning).

Framework Versions

  • PEFT: 0.18.1

  • TRL: 0.24.0

  • Transformers: 4.57.6

  • PyTorch: 2.10.0+cu126

Use Cases

  • Extract job postings from scraped websites.

  • Convert unstructured job listings to structured JSON.

  • Automate data collection for job aggregators.

  • Replace expensive LLM API calls with local inference.

Limitations

  • Trained specifically on job postings - may not work well for other data types.

  • Works best with markdown-formatted input (similar to html2text output).

  • Maximum context: 2048 tokens

License & Attribution

This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

You are free to use, share, copy, modify, and redistribute this material for any purpose (including commercial use), provided that proper attribution is given.

Attribution requirements

Any reuse, redistribution, or derivative work must include:

  1. The creator's name: HelixCipher

  2. A link to the original repository:

    https://github.com/HelixCipher/fine-tuning-an-local-llm-for-web-scraping

  3. An indication of whether changes were made

  4. A reference to the license (CC BY 4.0)

Example Attribution

This work is based on Fine-Tuning An Local LLM for Web Scraping by HelixCipher.
Original source: https://github.com/HelixCipher/fine-tuning-an-local-llm-for-web-scraping

Licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).

You may place this attribution in a README, documentation, credits section, or other visible location appropriate to the medium.

Full license text: https://creativecommons.org/licenses/by/4.0/

Citation

@software{job_posting_extractor,
  author = {HelixCipher},
  title = {Job Posting Extractor - Qwen2.5-3B Fine-tuned Model},
  year = {2026},
  url = {https://huggingface.co/HelixCipher/job-posting-extractor-qwen}
}
Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HelixCipher/job-posting-extractor-qwen