YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Job ACOS Extractor v4 - Deployment

A fine-tuned Qwen3.5-2B model for extracting structured ACOS (Action-Context-Outcome-Skill) data from job descriptions.

Quick Start

1. Clone the repo (with Git LFS for model weights)

git lfs install
git clone https://huggingface.co/team-loxo/jd-acos-extractor-v4
cd jd-acos-extractor-v4

The model weights (model/model.safetensors, ~3.8 GB) and model/tokenizer.json are stored in Git LFS and are downloaded automatically by the clone.

2. Create a Python virtual env (Python 3.10+ recommended)

python3 -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install --upgrade pip

3. Install dependencies

Important: Use the PyTorch CUDA 12.8 wheel index. The default PyPI torch is built for CUDA 13 and won't run on common NVIDIA drivers (CUDA 12.x).

pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu128

Requirements:

NVIDIA GPU with CUDA driver ≥ 12.4 (e.g., RTX 30/40/50-series, A100, H100)
~10 GB free disk for model + Python deps
~6 GB GPU VRAM for inference (BF16)

4. Run examples

# Single example (uses tests/jd1.txt, prints extraction)
python run.py run

# Test suite: 3 examples + baseline match check
python run.py test

Expected output for python run.py test:

Running 3 test examples...
...
SUMMARY
  Examples run:     3
  Total time:       <a few seconds>s
  Avg time/example: <a few seconds>s
  Baseline matches: 3/3

Troubleshooting

Problem	Fix
`model.safetensors` is a small text file (~150 bytes)	LFS not pulled. Run `git lfs install && git lfs pull`
`RuntimeError: NVIDIA driver ... too old`	Reinstall torch with `--extra-index-url https://download.pytorch.org/whl/cu128`
Out-of-memory on GPU	Use `python run.py run` (single example) instead of batched workloads, or set `ACOS_DEVICE=cpu`
`ModuleNotFoundError: transformers`	Forgot to activate venv: `source .venv/bin/activate`

Output Schema

The model outputs a JSON object with exactly 3 fields:

{
  "core_responsibilities": ["Design ML pipelines", "Collaborate with data team"],
  "hard_requirements": ["Python", "ML frameworks", "distributed systems"],
  "bonus_skills": ["PyTorch", "TensorFlow", "Kubernetes"]
}

Field	Type	Description
`core_responsibilities`	`list[str]`	Primary duties and day-to-day responsibilities
`hard_requirements`	`list[str]`	Core skills and technologies required (skill names only, no experience levels)
`bonus_skills`	`list[str]`	Preferred or "nice-to-have" qualifications

Note: The model extracts skill names only, not experience requirements. For example:

Input: "5+ years Python experience" → Output: "Python"
Input: "Experience with ML frameworks (e.g., PyTorch)" → Output: "ML frameworks" (with PyTorch in bonus_skills)

Project Structure

deploy/
├── model.py          # Model interface and loader (BF16)
├── config.py         # Paths and configuration
├── run.py            # Orchestration (run/test commands)
├── requirements.txt  # Dependencies
├── model/            # Model weights (downloaded from HF)
│   ├── model.safetensors
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   ├── config.json
│   └── generation_config.json
└── tests/            # Test examples and baseline
    ├── jd1.txt
    ├── jd2.txt
    ├── jd3.txt
    └── baseline.json

API Usage

from model import load_model

# Load model (singleton, BF16)
extractor = load_model()

# Extract from job description
jd_text = """
Senior Software Engineer - Machine Learning

Requirements:
- 5+ years Python experience
- Experience with ML frameworks (e.g., PyTorch, TensorFlow)
"""

result = extractor.extract(jd_text)
print(result)
# {
#   "core_responsibilities": [...],
#   "hard_requirements": ["Python", "ML frameworks"],
#   "bonus_skills": ["PyTorch", "TensorFlow"]
# }

# Batch extraction (recommended for production)
jd_texts = [jd1, jd2, jd3, ...]  # List of job descriptions
results = extractor.extract_batch(jd_texts, batch_size=128)

Production Alignment

This deploy uses the inference-optimized configuration that was used to measure the production metrics below (91.8% entity-level F1, 5.89 samples/sec at batch 128).

Component	Value	Notes
System prompt	382 chars / 78 tokens	Inference-optimized (shorter than training prompt for speed)
User message format	`"Extract structured data...{jd_text}"`	Matches eval/benchmark configuration
MAX_LENGTH	1,500 tokens	Matches benchmark setup
Chat template	qwen (chat_template.jinja)	Matches training
Tokenizer	Qwen3.5-2B	Matches training
Precision	BF16	Matches training

The model is robust to prompt variations: the shorter inference prompt achieves 91.8% F1 with significantly faster throughput than the original 1,471-char training prompt would.

Performance

Entity-Level Metrics (each sample = one entity, n=2,051)

Metric	Value
Precision	91.8%
Recall	91.8%
F1	91.8%

How this is computed:

Compute item-level F1 between model prediction and gold label for each sample.
Non-hard failures (F1 ≥ 0.5): 1,613 samples → TP.
Hard failures (F1 < 0.5, n=439) sent to GPT-5.5 with full JD for adjudication:
- A (157): Model is genuinely wrong → FP/FN
- B (121): Gold is wrong, model correct → TP
- BOTH_OK (21): Both valid → TP
- NEITHER (111): Both have problems → excluded (ambiguous)
- Judge ERROR (28): excluded
TP = 1,613 + 121 + 21 = 1,755, FP = FN = 157
Precision = Recall = F1 = 1,755 / (1,755 + 157) = 91.8%

P and R converge at entity-level because each sample produces one extraction event: a wrong extraction simultaneously counts as both FP and FN for that sample.

Hard Failure Breakdown (n=439, GPT-5.5 adjudicated)

Verdict	Count	% of failures	Meaning
A	157	35.8%	Real model errors
B	121	27.6%	Gold label has spurious items, model OK
BOTH_OK	21	4.8%	Both acceptable
NEITHER	111	25.3%	Both have problems
Judge ERROR	28	6.4%	Adjudication failed

Item-Level Metrics (raw, no GPT correction)

Field	Precision	Recall	F1
core_responsibilities	78.6%	72.6%	75.5%
hard_requirements	65.4%	46.8%	54.6%
bonus_skills	71.8%	39.6%	51.1%
Overall	73.2%	58.3%	64.9%

Item-level recall is dragged down by gold label issues — 27.6% of hard failures were gold containing items NOT in the JD. True item recall is higher.

Speed Benchmarks (BF16)

Single RTX 5090:

Batch Size	Samples/sec	Latency P50
16	0.94	1046ms
32	1.51	657ms
64	2.65	381ms
128 (optimal)	5.89	168ms

Multi-GPU Production (8 × RTX 5090): 19.18 samples/sec (~1.66M samples/day)

Other Metrics

Spec	Value
Model Size	3.76 GB
Precision	BF16
JSON Parse Rate	100%
Max New Tokens	400 (retry: 600)
Max Input Length	1500 tokens
Max JD Characters	2500

For the full breakdown including extraction & judge prompts, see v4_production_report.html.

Deployment Notes

Runtime: Loads from local model/ directory only (no remote HF fetch)
Precision: BF16 (optimal for RTX 5090 and modern GPUs)
Validation: Detects Git LFS pointers and requires real weights before running

Preflight Checks

Before running:

Verify model/model.safetensors exists and is not a Git LFS pointer
Install dependencies: pip install -r requirements.txt
Confirm test files exist in tests/

# Check model file is real (not LFS pointer)
head -c 64 model/model.safetensors | xxd
# Should NOT show "version https://git-lfs"

License

Internal use only. Contact team-loxo for licensing inquiries.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support