Instructions to use sunkencity/joby-it-servicedesk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sunkencity/joby-it-servicedesk with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sunkencity/joby-it-servicedesk", filename="joby-it-servicedesk.q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use sunkencity/joby-it-servicedesk with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sunkencity/joby-it-servicedesk:Q4_K_M # Run inference directly in the terminal: llama-cli -hf sunkencity/joby-it-servicedesk:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sunkencity/joby-it-servicedesk:Q4_K_M # Run inference directly in the terminal: llama-cli -hf sunkencity/joby-it-servicedesk:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sunkencity/joby-it-servicedesk:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf sunkencity/joby-it-servicedesk:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sunkencity/joby-it-servicedesk:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf sunkencity/joby-it-servicedesk:Q4_K_M
Use Docker
docker model run hf.co/sunkencity/joby-it-servicedesk:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use sunkencity/joby-it-servicedesk with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sunkencity/joby-it-servicedesk" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sunkencity/joby-it-servicedesk", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sunkencity/joby-it-servicedesk:Q4_K_M
- Ollama
How to use sunkencity/joby-it-servicedesk with Ollama:
ollama run hf.co/sunkencity/joby-it-servicedesk:Q4_K_M
- Unsloth Studio
How to use sunkencity/joby-it-servicedesk with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sunkencity/joby-it-servicedesk to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sunkencity/joby-it-servicedesk to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sunkencity/joby-it-servicedesk to start chatting
- Pi
How to use sunkencity/joby-it-servicedesk with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sunkencity/joby-it-servicedesk:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sunkencity/joby-it-servicedesk:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sunkencity/joby-it-servicedesk with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sunkencity/joby-it-servicedesk:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sunkencity/joby-it-servicedesk:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use sunkencity/joby-it-servicedesk with Docker Model Runner:
docker model run hf.co/sunkencity/joby-it-servicedesk:Q4_K_M
- Lemonade
How to use sunkencity/joby-it-servicedesk with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sunkencity/joby-it-servicedesk:Q4_K_M
Run and chat with the model
lemonade run user.joby-it-servicedesk-Q4_K_M
List all available models
lemonade list
- Joby IT Service Desk — Gemma 4 31B (Fine-Tuned, Tool-Aware)
Joby IT Service Desk — Gemma 4 31B (Fine-Tuned, Tool-Aware)
A LoRA fine-tune of Google's Gemma 4 31B-it (dense, multimodal) specialized for Joby Aviation's internal IT service desk workflows, with native tool-calling preserved through a mixed Joby + general-purpose function-calling training corpus.
Private, internal model. Weights contain Joby-specific references (Jira project keys, Confluence KB content, internal hostnames, infrastructure conventions). Distributed only to Joby engineering staff via the PRE installer. Not for public release.
- Repo (this card):
sunkencity/joby-it-servicedesk(GGUF, private) - Adapter repo:
sunkencity/joby-it-servicedesk-lora(private) - Data pipeline:
https://github.com/sunkencity999/joby-datasets(private) - Released: 2026-05-30 (v0.4)
- Maintainer: Christopher Bradford, Systems Administration / AI Engineering, Joby Aviation
v0.4 — tool-protocol fix
v0.3 fabricated tool responses inline: it would emit a valid tool call and then immediately roleplay the tool's reply and a final answer in one generation (response:NAME{value:<|"|>{...}<|"|>}<tool_response|>The status is...). Root cause was a training-format defect — Gemma 4's stock chat_template.jinja packs the tool call, tool response, and assistant follow-up into a single <|turn>model block, and the LoRA learned to imitate that one-shot shape.
v0.4 corrects it at the source:
- New renderer (
joby_chat_template.py) puts the tool call, tool response, and assistant follow-up in separate<|turn>blocks (<|tool_call>...<tool_call|><turn|>→<|turn>tool ... <turn|>→ fresh<|turn>model). Same Gemma 4 native special tokens, corrected boundaries. - Pre-rendered training data (
build_mixed_dataset.pyemits{"text": ...}records) bypasses MLX-LM's default chat-template application. - Wider LoRA: attention + FFN rank-32 (q/v/gate/up/down) instead of attention-only rank-16. The "preserve FFN tool-call pathway" hypothesis was wrong — that pathway was already broken.
- Best-val checkpoint shipped: iter 6,750 of the retrain (val 0.638).
Inference: Ollama's stock gemma4 renderer doesn't reproduce the corrected turn structure, so PRETeams routes this model through /api/generate raw mode with a JS port of joby_chat_template (preteams:src/joby-template.js). Base PRE keeps using /api/chat for other models.
1. At a glance
| Base | google/gemma-4-31B-it (dense, multimodal, 31B params) via mlx-community/gemma-4-31b-it-bf16 |
| Method | Parameter-efficient LoRA — attention + FFN, top-16 layers |
| Adapter size | 211 MB (~53.0M trainable params, ~0.173% of base) |
| Training data | 11,062 examples pre-rendered to text — 72% Joby IT / 25% Glaive function-calling / 3% PRE tool-use traces |
| Tool-call coverage | 28.1% of examples carry structured tool_calls (rendered into the trained text via the new template) |
| Hardware | M4 Max, 128 GB unified memory (Apple Silicon, MLX-LM) |
| Training time | ~30 min initial run (2,250 iters before OOM) + ~5.5 h resume with grad_checkpoint=true (10,000 more iters) |
| Best checkpoint | iter 6,750 of the resume (val loss 0.638, used for the fused model) |
| Final-iter val | 0.843 — not released; train loss had drifted to ~0.57, mild memorization |
| Tool-call probe | Turn 1 emits clean structured calls and stops; turn 2 grounds on real tool results (e.g. "Waiting for Customer" → correct status + assignee) |
| Released quants | f16 ( |
| License | Gemma Terms of Use (base) + internal-only on derivative weights |
2. What it's good at
- Native tool-calling preserved. Smoke-tests at 3 / 3 on PRE-style probes (
date,bash,memory_search) — same activation rate as the unmodified base model. Designed to operate as a tool-augmented agent, not a closed-book oracle. - Joby IT vocabulary and patterns. Internal terminology, ticket structure, and resolution conventions for license provisioning, network ergonomics, hardware lifecycle, account ops, and the day-to-day Joby IT helpdesk shape. Strong familiarity with the systems landscape: Jira (IT / DHLP / JSD projects), Confluence (ITKB space), D365, Smartsheet, Active Directory / Entra ID, Intune / Jamf, VPN / SSO / SAML, M365.
- Structured technical responses. Clean step-by-step procedures, properly fenced shell snippets, headers when length warrants. Inherits Gemma 4's strong instruction-following on top of the Joby-specific stylistic prior learned from agent replies.
- Multi-turn agent loops. Mixed in 241 real PRE tool-use traces during training, so the model has seen the full ChatML shape (
system → user → assistant{tool_calls} → tool{result} → assistant) and handles multi-step plans without falling out of tool format. - Long-context aware. Inherits Gemma 4's 256 K (262,144) token position embeddings. In practice PRE deploys at 8K–128K depending on RAM headroom (see PRE's context sizing table); the model itself is not the bottleneck.
3. What it isn't
- Not a substitute for live tool calls. Joby-specific facts — current ticket IDs, URLs, account numbers, AWS account IDs, network ranges, on-call rotations — must come from
confluence.search,jira.get_issue,smartsheet.*, etc., at inference time, not from the model's baked-in knowledge. The mixed-data recipe intentionally weakened memorization in favor of tool-use behavior. Expect hallucinated URLs, ticket IDs, and account references if the model is run without tools. - Not a general-purpose chatbot. Capability outside Joby's operational footprint is no better than base Gemma 4 31B-it, and may be slightly worse in stylistic register (responses skew toward ticket-resolution prose).
- Not for incident response without human review. A senior IT staff member must validate any output that triggers operational change (account provisioning, group-membership changes, MFA resets, MDM commands).
- Not multimodal at inference. Although the base model is multimodal (image + audio tokens in its vocabulary), the fine-tune is text-only. Images submitted at inference will be tokenized but the model has no learned grounding for them in the Joby domain.
- Not bilingual. Training data is English-only. Spanish/Portuguese fallback is base-Gemma-quality at best.
4. Architecture (base model)
| Model type | gemma4 (dense, multimodal) |
| Parameters | 31 B |
| Hidden size | 5,376 |
| Layers | 60 transformer blocks |
| Attention heads | 32 query / 16 KV (GQA, 2:1) |
| Head dim | 256 |
| FFN intermediate | 21,504 |
| Sliding window | 1,024 (interleaved with global attention) |
| Vocab size | 262,144 (text + image + audio + control tokens) |
| Max position | 262,144 (256 K) |
| Tie embeddings | Yes |
| dtype (train) | bfloat16 |
Only the language-model trunk is touched by the LoRA. Vision and audio towers are frozen and effectively unused in this deployment.
5. Training data
A deliberately mixed corpus — 10,990 ChatML conversations. The mix is the central design choice of v0.3 and is what made tool-calling survive (see §9 Version History).
| Source | Count | Share | Role |
|---|---|---|---|
| Joby IT tickets + KB (Jira + Confluence, LLM-synthesized) | 7,949 | 72.3% | Domain knowledge + Joby register |
Glaive function-calling v2 (glaiveai/glaive-function-calling-v2) |
2,800 | 25.5% | Tool-call shape & format |
PRE session tool-use traces (~/.pre/sessions/) |
241 | 2.2% | Realistic multi-turn agent loops |
| Total | 10,990 | 100% | 90/10 train/val split |
27.6% of examples contain structured tool_calls fields. 72.4% are plain assistant-text completions.
5.1 Joby tickets + KB (7,949)
Built by the joby-datasets pipeline (~/joby-datasets/). Each row is one Jira ticket or one Confluence section, rewritten by a local LLM (pre-gemma4 via Ollama) into a clean instruction → agent-style answer pair:
Extract → Transform (synthesized) → Filter → Format (ChatML)
Jira/Confluence LLM-rewrite dedup/trim embed system prompt
- Jira JQL.
project = IT AND statusCategory = Done AND resolved >= "-730d"— last two years of resolved IT tickets, paginated 100 at a time with retry on 429/5xx. Fields requested: summary, description, status, resolution, priority, issuetype, labels, components, created/updated/resolved dates, reporter, assignee, full comment thread. - Confluence space.
ITKB(IT Knowledge Base), 69 pages, recursive descent. Min page length 200 chars. - Transform (synthesized mode). Each Jira ticket's full comment thread is rewritten by
pre-gemma4into a clean agent reply. Confluence pages are chunked at H1/H2 boundaries; each chunk is paired with LLM-generated questions to form instruction/answer pairs. - Filter. Length bounds, one-liner rejection (
fixed,duplicate,see IT-123), email-signature stripping, attachment-marker removal, fuzzy dedup withrapidfuzz, mojibake/non-ASCII bloat removal. - System prompt baked in (ChatML): identifies the model as a Joby IT service-desk assistant.
5.2 Glaive function-calling v2 (2,800)
glaiveai/glaive-function-calling-v2 — the open-source canonical function-call training set. Streamed via ijson from the local HF cache; only conversations containing at least one <functioncall>{...}</functioncall> block are kept (non-tool turns are discarded). Glaive's idiosyncratic markup (USER: / A: / ASSISTANT: / FUNCTION RESPONSE: role tags, single-quote-inside-double-quote argument strings) is parsed by build_mixed_dataset.py::parse_glaive_chat into proper ChatML messages with structured tool_calls fields. 2,800 records were sampled with seed=42.
5.3 PRE session tool-use traces (241)
Real conversation histories from ~/.pre/sessions/ exported via web/src/training.js's exportTrainingData({format:'chatml', minToolCalls:1}). PRE emits tool calls as inline <tool_call>{...}</tool_call> markup in assistant content; the build script (_convert_pre_record) promotes those to structured tool_calls fields so Gemma 4's chat template renders them with native function-call tokens.
These are the most "in-domain" tool-use examples in the mix — they contain real PRE tool names (bash, memory_search, confluence_search, jira_get_issue, calendar_list_events, apple_mail_compose, rag_search, …) and the real ChatML shape the deployed model will see.
5.4 Quality controls and what was not done
- No PII redaction. Joby IT tickets are confirmed PII-free at design time. Internal account IDs, hostnames, and infrastructure conventions are present in the weights — treat outputs as internal-classification material.
- No license/copyright scrubbing of Confluence content beyond the standard filter pipeline.
- No safety RLHF. This is a supervised LoRA only. The base model's RLHF/safety tuning is the only behavioral guardrail.
- No multimodal data. Text-only fine-tune; vision/audio pathways untouched.
6. Training recipe
Full reproducible config in lora_config.yaml. Mixing script in build_mixed_dataset.py.
| Knob | Value | Why |
|---|---|---|
| Base | mlx-community/gemma-4-31b-it-bf16 |
Apple-native bf16, fast on MLX |
| Method | LoRA (PEFT) | 5.84M trainable / 31B base ≈ 0.019% — cheap, reversible |
| Target modules | self_attn.q_proj, self_attn.v_proj |
Attention-only — leaves FFN untouched. The FFN is where Gemma 4's tool-call routing primarily lives; disturbing it caused the v0.2 regression to 0/3 tool calls. |
LoRA rank r |
16 | Sufficient for ~10K examples; rank ≥ 32 begins to memorize ticket IDs |
| LoRA alpha | 32 | 2:1 alpha:rank; effective LR multiplier = 2.0 |
| LoRA dropout | 0.05 | Mild regularization on noisy ticket data |
| Layers | top 16 of 60 (last quarter) | Instruction-following signal concentrates near the head |
| Sequence length | 4,096 tokens | Covers >99% of training rows after filter |
| Batch size | 1 | Dense 31B in bf16 — single batch peaks at 102.96 GB of unified memory |
| Gradient accumulation | 1 (none) | M4 Max headroom permits |
| Optimizer | AdamW (MLX-LM default) | β1=0.9, β2=0.999, ε=1e-8 |
| Learning rate | 1.0 × 10⁻⁵ | Conservative for instruction tuning; no warmup, constant schedule |
| Iterations | 10,000 | ≈ 1 epoch over 9,891-example training split |
| Seed | 42 | Deterministic split + sampling |
| Gradient checkpointing | off | Not memory-bound at batch=1 on 128 GB |
| Eval cadence | every 250 iters, 50 val batches | ≈ 63% of held-out val per eval |
| Checkpoint cadence | every 250 iters | 40 intermediate snapshots saved |
6.1 Why attention-only LoRA
The v0.2 fine-tune (Joby-only, full LoRA targeting q/k/v/o + gate/up/down) collapsed tool-calling to 0 / 3. The Joby corpus has zero tool-call traces, and full LoRA strongly overfit the FFN toward plain-text replies, overriding the base model's learned tool-call routing.
v0.3 fixes this two ways: (a) mix in 27.6% structured tool-call examples so the model actually sees tool calls during training, and (b) restrict LoRA to attention q_proj / v_proj only, leaving the FFN — where tool routing lives — completely untouched. The two changes are complementary; ablating either reproduces the regression.
6.2 Hardware & throughput
- M4 Max, 128 GB unified memory, macOS, MLX-LM trainer (Apple Silicon native, Metal-backed).
- Sustained throughput: ~120 tokens/sec, ~0.4 iter/sec (batch 1, seq 4096).
- Peak memory: 102.96 GB unified (≈80% of 128 GB; macOS swap headroom kept the system responsive).
- Wall time: ~5 hours for 10,000 iters, ~80 sec/eval for 50 val batches.
6.3 Training dynamics
Validation loss every 250 iters. Best checkpoint is iter 9,250 (val 0.678). The final iter's spike to 1.077 is consistent with the noisy single-batch SGD on a mixed corpus and is not what we released — the published GGUF uses iter 9,250.
| Iter | Val | Iter | Val | Iter | Val | ||
|---|---|---|---|---|---|---|---|
| 1 | 6.075 | 3,500 | 1.096 | 7,000 | 0.762 | ||
| 250 | 1.082 | 3,750 | 1.148 | 7,250 | 0.867 | ||
| 500 | 0.955 | 4,000 | 0.900 | 7,500 | 0.912 | ||
| 750 | 0.840 | 4,250 | 0.971 | 7,750 | 0.897 | ||
| 1,000 | 1.246 | 4,500 | 1.067 | 8,000 | 1.024 | ||
| 1,250 | 1.020 | 4,750 | 0.979 | 8,250 | 0.841 | ||
| 1,500 | 0.851 | 5,000 | 0.954 | 8,500 | 0.740 | ||
| 1,750 | 0.964 | 5,250 | 0.972 | 8,750 | 1.022 | ||
| 2,000 | 1.010 | 5,500 | 0.958 | 9,000 | 1.045 | ||
| 2,250 | 0.892 | 5,750 | 0.944 | 9,250 | 0.678 | ||
| 2,500 | 1.169 | 6,000 | 1.107 | 9,500 | 0.810 | ||
| 2,750 | 0.838 | 6,250 | 0.903 | 9,750 | 0.964 | ||
| 3,000 | 0.808 | 6,500 | 0.784 | 10,000 | 1.077 | ||
| 3,250 | 0.995 | 6,750 | 0.956 |
Loss is noisy because batch size = 1 is the dominant variance source. The trend is clearly downward through ~iter 9,250 with the floor stepping down through 0.840 → 0.808 → 0.762 → 0.740 → 0.678.
7. Evaluation
Run with evaluate.py after ollama create joby-it-servicedesk -f Modelfile.joby:
python evaluate.py --model joby-it-servicedesk:q8_0 --base pre-gemma4
7.1 Tool-calling smoke test
Three prompts that have no overlap with the training corpus, designed to probe whether native tool activation survived the LoRA. Each prompt is paired with the minimum tool schema (bash, date, memory_search) the model needs to respond correctly.
| Probe | Expected tool | v0.3 adapted | base pre-gemma4 |
|---|---|---|---|
| "What time is it right now?" | date |
✓ | ✓ |
| "List the files in my home directory." | bash |
✓ | ✓ |
| "What do you remember about my role at Joby?" | memory_search |
✓ | ✓ |
| Activation rate | 3 / 3 | 3 / 3 |
The fine-tune matches the base model's tool-activation rate. This was the binary success criterion that v0.1 and v0.2 failed.
7.2 Held-out validation loss
| Checkpoint | Val loss | Notes |
|---|---|---|
| Iter 9,250 (released) | 0.678 | Best of 40 checkpoints |
| Iter 8,500 | 0.740 | Second-best |
| Iter 7,000 | 0.762 | First sub-0.80 sustained |
| Iter 10,000 | 1.077 | Final-iter — not released |
7.3 Domain knowledge probes
Free-form, no-tool generation. Used as a qualitative sanity check (correctness must still be verified against live Confluence/Jira):
- How do I request a Fusion 360 license at Joby?
- Where is the IT Knowledge Base in Confluence?
- What's the Jira project key for the IT service desk?
- How do I connect to the Joby VPN?
These produce on-format, Joby-styled answers. They are not authoritative — the model may invent URLs, ticket numbers, or KB titles. Always couple with a real confluence.search / jira.get_issue tool call before acting.
7.4 What is not evaluated here
- Exact-match accuracy on closed-book Joby facts. Deliberately not measured, because we want the model to defer to tools.
- Toxicity / safety. Inherits Gemma 4's RLHF; no separate red-teaming for the adapter.
- Long-context comprehension >32K. Inherits Gemma 4's 256K positional embeddings but was trained at
max_seq_length=4096. Behavior at 32K–128K context is governed by the base; expect graceful degradation, not measured here.
8. Files in this repo
| File | Size | Purpose |
|---|---|---|
joby-it-servicedesk.q8_0.gguf |
~30 GB | Primary artifact. Matches PRE's default quantization for Apple Silicon with ≥28 GB VRAM/unified-memory headroom. |
joby-it-servicedesk.q4_K_M.gguf |
~17 GB | Lower-VRAM variant — for Intel Macs with 16 GB eGPU, Windows boxes with smaller GPUs, or any setup where q8 won't fit fully on-device. |
joby-it-servicedesk.f16.gguf |
~57 GB | Full-precision reference. Use as the source for custom requants. |
Modelfile.joby |
— | Ollama Modelfile (sampling defaults match PRE's engine/Modelfile) |
Companion adapter repo (sunkencity/joby-it-servicedesk-lora) ships:
adapter_model.safetensors— 22.27 MB LoRA weights (rank-16, attention-only).lora_config.yaml— exact MLX-LM config used to train.
9. Version history
| Version | Date | Base | Corpus | Tool-call probe | Status |
|---|---|---|---|---|---|
| v0.1 | 2026-05-15 | Gemma 4 26B-A4B MoE | Joby-only (7,154 ex.) | n/a — couldn't ship | Aborted — MLX-LM's fused MoE renames experts.switch_glu.* tensors in a way llama.cpp's converter doesn't recognize. No GGUF, no Ollama deployment path. |
| v0.2 | 2026-05-16 | Gemma 4 31B dense | Joby-only (7,154 ex.) | 0 / 3 | Regressed. GGUF conversion succeeded, but the Joby-only corpus has zero tool-call traces and the full LoRA overrode the base model's tool-call routing. |
| v0.3 | 2026-05-17 | Gemma 4 31B dense | Mixed (Joby + Glaive + PRE, 10,990 ex.) | 3 / 3 | Released. Mixed corpus + attention-only LoRA + top-16 layers — tool-calling restored, Joby knowledge retained. |
10. Usage
10.1 Ollama (recommended)
# Private repo — needs an HF token
export HF_TOKEN=<your-hf-token>
ollama pull hf.co/sunkencity/joby-it-servicedesk:q8_0
ollama run hf.co/sunkencity/joby-it-servicedesk:q8_0
10.2 Via PRE installer
cd ~/pre && ./install.sh
# Installer detects Joby access and offers to pull this model,
# aliasing it locally as `pre-gemma4-itsd`.
PRE then routes requests to it via the standard pre-gemma4 channel — same tool wiring, same ~/.pre/ data dir, same sessions.
10.3 Sampling defaults (Modelfile.joby)
| Parameter | Value | Note |
|---|---|---|
num_ctx |
8,192 (default) | PRE sends num_ctx per-request; scales dynamically up to the installed limit. |
num_batch |
512 | Faster prefill |
temperature |
1.0 | Google's upstream Gemma default |
top_k |
64 | Google's upstream Gemma default |
top_p |
0.95 | Google's upstream Gemma default |
min_p |
0.05 | Diversity floor added on top of Google's defaults |
repeat_penalty |
1.1 | Loop suppression |
repeat_last_n |
256 |
For Q&A-style use with strict factual answers, override at runtime: temperature 0.2, drop top_p/top_k/min_p to defaults.
10.4 Chat template
The merged model carries Gemma 4's chat_template.jinja unchanged. Tool calls in the OpenAI/Glaive tool_calls shape are rendered by the template into native Gemma function-call tokens; client code does not need a custom wrapper.
11. Limitations & responsible use
- Internal references baked in. AWS account IDs, hostnames, internal URL patterns, and Joby-specific infrastructure conventions are present in the weights. Treat all outputs as internal-only material — equivalent to forwarding a Jira ticket excerpt.
- No PII redaction pass. The pipeline does not redact, because the source corpus is confirmed PII-free. If future Joby ticket data contains PII, add a redaction stage between
transformandfilterin~/joby-datasets/. - Hallucination of specifics is expected without tools. The training mix deliberately downweights memorized factual recall (URLs, ticket IDs, account numbers) in favor of tool-using behavior. Always run with
confluence.search,jira.get_issue,smartsheet.*,rag.searchavailable. - No incident-response autonomy. Human review required for any action-taking output (account changes, MDM commands, permission grants).
- Knowledge cutoff = data cutoff. Jira pull is bounded to
resolved >= "-730d"from the data extraction date (2026-05-15). Anything resolved before mid-2024 or after that snapshot is not represented. - Gemma 4 license applies. Use of the weights is governed by the Gemma Terms of Use. Derivative weights inherit the license.
12. Reproducibility
The full pipeline lives in ~/joby-datasets/:
joby-datasets/
├── src/ # Jira/Confluence extraction + transform + filter + format
├── config.yaml # JQL queries, Confluence spaces, filter thresholds
├── training/
│ ├── build_mixed_dataset.py # Joby + Glaive + PRE mixer (this fine-tune's secret sauce)
│ ├── prepare_data.py # Stage ChatML into MLX-LM-expected layout
│ ├── lora_config.yaml # MLX-LM hyperparameters
│ ├── train.sh # mlx_lm.lora -c lora_config.yaml
│ ├── fuse.sh # mlx_lm.fuse adapter → base
│ ├── convert_to_gguf.sh # llama.cpp GGUF + quant
│ ├── evaluate.py # val loss + tool-call probe
│ └── publish_hf.sh # push to HF (private)
To rebuild end-to-end:
# 1. Extract
python -m src.extract_jira
python -m src.extract_confluence
# 2. Transform + filter + format
python -m src.transform --mode synthesized
python -m src.filter
python -m src.format --format chatml --split 0.1
# 3. Mix + stage
cd training && python build_mixed_dataset.py --glaive-n 2800 --seed 42
python prepare_data.py
# 4. Train (≈5 hours on M4 Max 128GB)
./train.sh
# 5. Fuse → convert → register → evaluate → publish
./fuse.sh
./convert_to_gguf.sh
ollama create joby-it-servicedesk -f Modelfile.joby
python evaluate.py --model joby-it-servicedesk:q8_0 --base pre-gemma4
./publish_hf.sh
13. Citation
@misc{joby_it_servicedesk_2026,
author = {Bradford, Christopher},
title = {Joby IT Service Desk — Gemma 4 31B (Fine-Tuned, Tool-Aware)},
year = {2026},
version = {0.3},
howpublished = {HuggingFace private repo: sunkencity/joby-it-servicedesk},
note = {Private internal Joby Aviation model. Derived from google/gemma-4-31B-it.}
}
14. Acknowledgements
- Google DeepMind for Gemma 4.
- MLX team (Apple) for MLX-LM — the only LoRA trainer that comfortably handles a dense 31B base in 128 GB unified memory.
- Glaive AI for
glaive-function-calling-v2, the function-call corpus that made v0.3 possible. - llama.cpp maintainers for keeping the Gemma 4 GGUF converter current.
- Joby IT for two years of clean, well-resolved service-desk tickets.
- Downloads last month
- 52
4-bit
8-bit