YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Job ACOS Extractor v4 - Deployment
A fine-tuned Qwen3.5-2B model for extracting structured ACOS (Action-Context-Outcome-Skill) data from job descriptions.
Quick Start
1. Clone the repo (with Git LFS for model weights)
git lfs install
git clone https://huggingface.co/team-loxo/jd-acos-extractor-v4
cd jd-acos-extractor-v4
The model weights (model/model.safetensors, ~3.8 GB) and model/tokenizer.json are
stored in Git LFS and are downloaded automatically by the clone.
2. Create a Python virtual env (Python 3.10+ recommended)
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
3. Install dependencies
Important: Use the PyTorch CUDA 12.8 wheel index. The default PyPI torch is built for CUDA 13 and won't run on common NVIDIA drivers (CUDA 12.x).
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu128
Requirements:
- NVIDIA GPU with CUDA driver β₯ 12.4 (e.g., RTX 30/40/50-series, A100, H100)
- ~10 GB free disk for model + Python deps
- ~6 GB GPU VRAM for inference (BF16)
4. Run examples
# Single example (uses tests/jd1.txt, prints extraction)
python run.py run
# Test suite: 3 examples + baseline match check
python run.py test
Expected output for python run.py test:
Running 3 test examples...
...
SUMMARY
Examples run: 3
Total time: <a few seconds>s
Avg time/example: <a few seconds>s
Baseline matches: 3/3
Troubleshooting
| Problem | Fix |
|---|---|
model.safetensors is a small text file (~150 bytes) |
LFS not pulled. Run git lfs install && git lfs pull |
RuntimeError: NVIDIA driver ... too old |
Reinstall torch with --extra-index-url https://download.pytorch.org/whl/cu128 |
| Out-of-memory on GPU | Use python run.py run (single example) instead of batched workloads, or set ACOS_DEVICE=cpu |
ModuleNotFoundError: transformers |
Forgot to activate venv: source .venv/bin/activate |
Output Schema
The model outputs a JSON object with exactly 3 fields:
{
"core_responsibilities": ["Design ML pipelines", "Collaborate with data team"],
"hard_requirements": ["Python", "ML frameworks", "distributed systems"],
"bonus_skills": ["PyTorch", "TensorFlow", "Kubernetes"]
}
| Field | Type | Description |
|---|---|---|
core_responsibilities |
list[str] |
Primary duties and day-to-day responsibilities |
hard_requirements |
list[str] |
Core skills and technologies required (skill names only, no experience levels) |
bonus_skills |
list[str] |
Preferred or "nice-to-have" qualifications |
Note: The model extracts skill names only, not experience requirements. For example:
- Input: "5+ years Python experience" β Output:
"Python" - Input: "Experience with ML frameworks (e.g., PyTorch)" β Output:
"ML frameworks"(with PyTorch in bonus_skills)
Project Structure
deploy/
βββ model.py # Model interface and loader (BF16)
βββ config.py # Paths and configuration
βββ run.py # Orchestration (run/test commands)
βββ requirements.txt # Dependencies
βββ model/ # Model weights (downloaded from HF)
β βββ model.safetensors
β βββ tokenizer.json
β βββ tokenizer_config.json
β βββ config.json
β βββ generation_config.json
βββ tests/ # Test examples and baseline
βββ jd1.txt
βββ jd2.txt
βββ jd3.txt
βββ baseline.json
API Usage
from model import load_model
# Load model (singleton, BF16)
extractor = load_model()
# Extract from job description
jd_text = """
Senior Software Engineer - Machine Learning
Requirements:
- 5+ years Python experience
- Experience with ML frameworks (e.g., PyTorch, TensorFlow)
"""
result = extractor.extract(jd_text)
print(result)
# {
# "core_responsibilities": [...],
# "hard_requirements": ["Python", "ML frameworks"],
# "bonus_skills": ["PyTorch", "TensorFlow"]
# }
# Batch extraction (recommended for production)
jd_texts = [jd1, jd2, jd3, ...] # List of job descriptions
results = extractor.extract_batch(jd_texts, batch_size=128)
Production Alignment
This deploy uses the inference-optimized configuration that was used to measure the production metrics below (91.8% entity-level F1, 5.89 samples/sec at batch 128).
| Component | Value | Notes |
|---|---|---|
| System prompt | 382 chars / 78 tokens | Inference-optimized (shorter than training prompt for speed) |
| User message format | "Extract structured data...{jd_text}" |
Matches eval/benchmark configuration |
| MAX_LENGTH | 1,500 tokens | Matches benchmark setup |
| Chat template | qwen (chat_template.jinja) | Matches training |
| Tokenizer | Qwen3.5-2B | Matches training |
| Precision | BF16 | Matches training |
The model is robust to prompt variations: the shorter inference prompt achieves 91.8% F1 with significantly faster throughput than the original 1,471-char training prompt would.
Performance
Entity-Level Metrics (each sample = one entity, n=2,051)
| Metric | Value |
|---|---|
| Precision | 91.8% |
| Recall | 91.8% |
| F1 | 91.8% |
How this is computed:
- Compute item-level F1 between model prediction and gold label for each sample.
- Non-hard failures (F1 β₯ 0.5): 1,613 samples β TP.
- Hard failures (F1 < 0.5, n=439) sent to GPT-5.5 with full JD for adjudication:
- A (157): Model is genuinely wrong β FP/FN
- B (121): Gold is wrong, model correct β TP
- BOTH_OK (21): Both valid β TP
- NEITHER (111): Both have problems β excluded (ambiguous)
- Judge ERROR (28): excluded
- TP = 1,613 + 121 + 21 = 1,755, FP = FN = 157
- Precision = Recall = F1 = 1,755 / (1,755 + 157) = 91.8%
P and R converge at entity-level because each sample produces one extraction event: a wrong extraction simultaneously counts as both FP and FN for that sample.
Hard Failure Breakdown (n=439, GPT-5.5 adjudicated)
| Verdict | Count | % of failures | Meaning |
|---|---|---|---|
| A | 157 | 35.8% | Real model errors |
| B | 121 | 27.6% | Gold label has spurious items, model OK |
| BOTH_OK | 21 | 4.8% | Both acceptable |
| NEITHER | 111 | 25.3% | Both have problems |
| Judge ERROR | 28 | 6.4% | Adjudication failed |
Item-Level Metrics (raw, no GPT correction)
| Field | Precision | Recall | F1 |
|---|---|---|---|
| core_responsibilities | 78.6% | 72.6% | 75.5% |
| hard_requirements | 65.4% | 46.8% | 54.6% |
| bonus_skills | 71.8% | 39.6% | 51.1% |
| Overall | 73.2% | 58.3% | 64.9% |
Item-level recall is dragged down by gold label issues β 27.6% of hard failures were gold containing items NOT in the JD. True item recall is higher.
Speed Benchmarks (BF16)
Single RTX 5090:
| Batch Size | Samples/sec | Latency P50 |
|---|---|---|
| 16 | 0.94 | 1046ms |
| 32 | 1.51 | 657ms |
| 64 | 2.65 | 381ms |
| 128 (optimal) | 5.89 | 168ms |
Multi-GPU Production (8 Γ RTX 5090): 19.18 samples/sec (~1.66M samples/day)
Other Metrics
| Spec | Value |
|---|---|
| Model Size | 3.76 GB |
| Precision | BF16 |
| JSON Parse Rate | 100% |
| Max New Tokens | 400 (retry: 600) |
| Max Input Length | 1500 tokens |
| Max JD Characters | 2500 |
For the full breakdown including extraction & judge prompts, see
v4_production_report.html.
Deployment Notes
- Runtime: Loads from local
model/directory only (no remote HF fetch) - Precision: BF16 (optimal for RTX 5090 and modern GPUs)
- Validation: Detects Git LFS pointers and requires real weights before running
Preflight Checks
Before running:
- Verify
model/model.safetensorsexists and is not a Git LFS pointer - Install dependencies:
pip install -r requirements.txt - Confirm test files exist in
tests/
# Check model file is real (not LFS pointer)
head -c 64 model/model.safetensors | xxd
# Should NOT show "version https://git-lfs"
License
Internal use only. Contact team-loxo for licensing inquiries.