Job Posting Extractor (Qwen2.5-3B)
A fine-tuned version of Qwen2.5-3B-Instruct specialized in extracting structured JSON data from job postings. Built to replace expensive API calls for web scraping tasks.
What This Model Does
Given a job posting in markdown format, this model extracts structured JSON containing:
job_title
company
location
description
salary (when available)
requirements (when available)
Quick Start
from unsloth import FastLanguageModel
import json
# Load model from HuggingFace
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="HelixCipher/job-posting-extractor-qwen",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
# Example input
job_markdown = """# Job Position
**Position:** Senior Python Developer
**Company:** TechCorp
**Location:** San Francisco, CA
## Job Description
We are looking for an experienced Python developer...
"""
# Extract JSON
messages = [
{"role": "system", "content": "You are a JSON extraction assistant. Always output ONLY valid JSON."},
{"role": "user", "content": f"Extract job fields as JSON.\n\n{job_markdown}"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.1)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Training Details
Base Model: unsloth/qwen2.5-3b-instruct-bnb-4bit.
Training Data: 12,000 job posting examples.
Training Approach: LoRA with Unsloth.
Fine-tuning Library: TRL (Transformer Reinforcement Learning).
Framework Versions
PEFT: 0.18.1
TRL: 0.24.0
Transformers: 4.57.6
PyTorch: 2.10.0+cu126
Use Cases
Extract job postings from scraped websites.
Convert unstructured job listings to structured JSON.
Automate data collection for job aggregators.
Replace expensive LLM API calls with local inference.
Limitations
Trained specifically on job postings - may not work well for other data types.
Works best with markdown-formatted input (similar to html2text output).
Maximum context: 2048 tokens
License & Attribution
This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
You are free to use, share, copy, modify, and redistribute this material for any purpose (including commercial use), provided that proper attribution is given.
Attribution requirements
Any reuse, redistribution, or derivative work must include:
The creator's name:
HelixCipherA link to the original repository:
https://github.com/HelixCipher/fine-tuning-an-local-llm-for-web-scraping
An indication of whether changes were made
A reference to the license (CC BY 4.0)
Example Attribution
This work is based on Fine-Tuning An Local LLM for Web Scraping by
HelixCipher.
Original source: https://github.com/HelixCipher/fine-tuning-an-local-llm-for-web-scraping
Licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).
You may place this attribution in a README, documentation, credits section, or other visible location appropriate to the medium.
Full license text: https://creativecommons.org/licenses/by/4.0/
Citation
@software{job_posting_extractor,
author = {HelixCipher},
title = {Job Posting Extractor - Qwen2.5-3B Fine-tuned Model},
year = {2026},
url = {https://huggingface.co/HelixCipher/job-posting-extractor-qwen}
}
- Downloads last month
- 52