The Great Classification Showdown: OSS vs BERT on Consumer Hardware

Community Article Published January 26, 2026

Can you really train production-grade AI models on the same GPU that runs your Steam library? Spoiler: yes, and your electricity bill will thank you.

HELIOS-01 Gaming PC with RGB lighting

Meet HELIOS-01, my home office's secondary heating system that occasionally trains ML models.

The Setup: A Tale of Skeptical Colleagues

Picture this: a typical Monday morning standup. The team is discussing how to classify thousands of multilingual customer support messages per week. Multiple tags per message. Four languages. The usual enterprise chaos.

"Just use ChatGPT," someone suggests. "Classification is a solved problem."

And they're not wrong! API-based solutions work great... until you start doing the math on processing 1000+ messages weekly across four languages. Then there's the latency, the vendor lock-in, the data privacy concerns, and that nagging feeling that you're paying someone else to run model.predict() on your data.

So I did what any reasonable ML engineer would do: I went home and asked my GPU to prove a point.

The Challenge

EuroChef+ Logo

For this experiment, I created EuroChef+, a fictional European culinary streaming platform (think Netflix, but every show ends with someone saying "bon appétit" in four different languages). Their customer support team deals with:

  • 4 languages: English, French, Dutch, and German
  • 15 possible tags per message: technical issues, billing, content requests, urgency levels, user types, emotional states
  • Multi-label classification: A single message might be tagged as [technical_issue, urgent, premium_user, frustrated]
  • Real-world messiness: Typos, franglais, varying message lengths, and the occasional CAPS LOCK customer

The dataset (1000 synthetic messages) was generated using OpenAI and Gemini APIs with carefully crafted prompts to mimic real customer support scenarios.

A Note on Synthetic Data

"But wait," I hear you say, "synthetic data? Isn't that cheating?"

Fair question. The dataset deliberately includes imperfect labels, typos, and realistic edge cases—because real customer data is messy too. The goal wasn't to create a perfect benchmark, but to simulate what you'd actually encounter in production: customers who write "J'ai un problème avec le streaming!!!" while also being tagged as both frustrated and premium_user. The labels were generated with guidance but not manually verified—just like when you're bootstrapping a classification system with limited annotation budget.

The Contenders

In the Red Corner: GPT-OSS-20B with LoRA

GPT-OSS-20B

OpenAI's GPT-OSS-20B is a 21-billion parameter model that punches way above its weight class. Thanks to its Mixture-of-Experts (MoE) architecture and MXFP4 quantization, this beast fits in ~16GB of VRAM.

MXFP4, you say? It's OpenAI's quantization format that compresses weights to approximately 4 bits per parameter. The "MX" stands for Microscaling—a block-wise quantization approach where small groups of weights share a scaling factor, preserving more precision than naive 4-bit quantization. The result? A 20B model that plays nice with consumer GPUs without completely destroying your inference quality.

A good article on MXFP4 --> https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me

For fine-tuning, I used **LoRA (Low-Rank Adaptation)**—a technique that freezes the original weights and trains small adapter matrices instead. The math is elegant:

W=W+BAW' = W + BA

Where W W is your frozen 4096×4096 weight matrix, and B B (4096×r) and A A (r×4096) are your tiny trainable adapters. With rank r=8 r=8 , you're training a fraction of the original parameters. That's a massive discount on your VRAM bill.

LoRA Config:

LORA_R = 8
LORA_ALPHA = 16
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

In the Blue Corner: mDeBERTa-v3-base

mDeBERTa

Microsoft's mDeBERTa-v3-base is the multilingual version of DeBERTa—a BERT variant with disentangled attention and enhanced mask decoder. At 278M parameters, it's roughly 75x smaller than our LLM contender.

The "m" stands for multilingual, supporting 100+ languages out of the box. Perfect for our European customer base who might write "Mon streaming ne marche pas!!!" or "Der Download funktioniert nicht" in the same ticket queue.

Training approach: Full fine-tuning with weighted BCE loss to handle class imbalance. Some labels like enterprise appear in only 7 samples, while premium_user shows up 83 times—the weighted loss helps the model not ignore the rare classes.

The Arena: HELIOS-01

All training and inference happened on my home-built machine, which I affectionately call HELIOS-01. Why? Because during training, it outputs approximately 350 watts and heats my home office like a small sun.

Specs:

  • GPU: NVIDIA RTX 4090 (24GB VRAM)
  • The rest: Enough CPU and RAM to not bottleneck the GPU

No cloud. No datacenter. Just consumer hardware, RGB lights, and the comforting hum of fans spinning at 80%.

Training Time: The First Surprise

Before we even look at accuracy, let's talk about how long these models take to train:

Model Training Time
mDeBERTa-v3 ~1.5 minutes
GPT-OSS-20B + LoRA ~25 minutes

Yes, you read that right. The BERT model trained in ninety seconds. The LLM took 17x longer. This isn't a knock on the LLM—it's a 20B parameter model doing autoregressive generation during training—but it's a reality check for anyone thinking bigger is always better.

The Results: Let's Get to the Numbers

After training both models on the same dataset split, here's how they performed on 127 test messages:

Understanding the Metrics

Before diving into the numbers, here's a quick primer on what each metric tells us:

  • Precision: Of all the labels the model predicted, how many were actually correct? High precision means fewer false alarms.
  • Recall: Of all the labels that should have been predicted, how many did the model actually find? High recall means the model doesn't miss much.
  • F1 Score: The harmonic mean of precision and recall—a single number that balances both. An F1 of 1.0 is perfect, 0.0 is terrible.
  • F1 Micro: Pools all predictions across all labels together, then calculates the F1 score. This gives more weight to frequent labels and reflects overall performance.
  • F1 Macro: Calculates F1 for each label separately, then averages them equally. This treats rare labels (like enterprise with only 7 samples) as equally important as common ones.
  • Exact Match: The strictest metric—it only counts a prediction as correct if every single label for a message is predicted perfectly. Getting 14 out of 15 labels right still counts as a miss. This is why exact match scores look lower than F1 scores.
Metric mDeBERTa-v3 GPT-OSS-20B (Base) GPT-OSS-20B (LoRA)
F1 Micro 0.810 0.575 0.802
F1 Macro 0.810 0.557 0.781
Precision 0.761 0.679 0.808
Recall 0.865 0.499 0.796
Exact Match 0.354 0.008 0.409
Latency (ms/sample) 4.3 8,199 740
Throughput 235/s 0.12/s 1.35/s

The Verdict

mDeBERTa wins on speed and matches on accuracy. The BERT-based approach achieves slightly better F1 scores while being ~174x faster than the fine-tuned LLM at inference time.

But wait—there's nuance here:

  1. GPT-OSS-20B (LoRA) nails exact matches better: 40.9% of predictions were perfectly correct across all 15 labels, compared to 35.4% for mDeBERTa. When the LLM gets it right, it really gets it right.

  2. Base LLM without fine-tuning is... not great: The base GPT-OSS-20B achieved only 57.5% F1 micro and a dismal 0.8% exact match. This isn't surprising—the model wasn't trained to output JSON with specific tag names. Fine-tuning matters, even for large models.

  3. Higher precision vs. higher recall: The LoRA model has better precision (0.808 vs 0.761), while mDeBERTa has better recall (0.865 vs 0.796). Depending on your use case—"don't miss anything" vs "don't cry wolf"—one might be preferable.

Performance by Language

Both models handle all four languages reasonably well:

Language mDeBERTa F1 LoRA F1
German 0.797 0.871
English 0.830 0.805
Dutch 0.824 0.795
French 0.804 0.794

The LLM particularly excels at German (perhaps MoE experts specializing in that language?), while mDeBERTa is more consistent across the board.

Per-Label Deep Dive

Some labels are just harder than others:

Label mDeBERTa F1 LoRA F1 Notes
enterprise 1.000 0.933 Only 7 samples—mDeBERTa nailed it
feature_request 0.923 0.906 Both excellent
urgent 0.480 0.629 LLM understands urgency context better
frustrated 0.677 0.630 Both struggle with sentiment
aggressive 0.750 0.571 mDeBERTa handles emotions better overall
low_priority 0.844 0.800 Both good at "meh" detection

The LLM handles urgent better—perhaps its reasoning capabilities help understand time-sensitivity context like "guests arriving in 30 minutes and the app won't load!" Meanwhile, mDeBERTa absolutely crushes the rare enterprise label.

What This Means

Choose mDeBERTa if:

  • You need speed (235 samples/second vs 1.35)
  • You're processing high volumes of text
  • Your GPU budget is tight (trains in 90 seconds!)
  • You want a simpler deployment story
  • Latency matters (4ms vs 740ms per request)

Choose the LLM + LoRA approach if:

  • You need the model to generalize beyond the training distribution
  • Your classification task might evolve (LLMs can be re-prompted for new labels)
  • You value exact match accuracy over aggregate metrics
  • You're already using the LLM for other tasks anyway
  • You need to explain the classification (LLMs can be prompted to justify)

Consider the hybrid approach?

Use mDeBERTa for fast, bulk classification, and escalate edge cases (low confidence predictions) to the LLM for a second opinion. Best of both worlds, to be tested.

Room for Improvement

Both models could likely do better with:

  • More data: 900 training samples is on the smaller side
  • Hyperparameter tuning: I used reasonable defaults, not exhaustively searched
  • Better class balancing: The weighted loss helped, but techniques like oversampling could help more
  • Longer training: The LoRA model could probably benefit from more epochs
  • Threshold tuning: For mDeBERTa, per-label thresholds improved results slightly

This experiment was about proving feasibility, not squeezing out every last percentage point.

The Real MVP: The Hugging Face Ecosystem

None of this would be possible without the incredible open-source ecosystem:

The fact that I can fine-tune a 20B parameter model on consumer hardware using a few lines of config is genuinely wild. Five years ago this would have required a small datacenter.

Try It Yourself

Everything from this experiment is open and available:

Models:

Dataset:

Code:

Final Thoughts

So, can consumer hardware handle production ML tasks? Absolutely. The combination of:

  • Efficient model architectures (MoE, disentangled attention)
  • Smart quantization (MXFP4)
  • Parameter-efficient fine-tuning (LoRA)
  • A GPU that can also play Cyberpunk at 4K

...means the barrier to entry for custom ML solutions has never been lower.

My colleagues were right—classification is a solved problem. But the solution doesn't have to involve sending your data to an API endpoint and paying per token. Sometimes, the answer is 2 minutes of training on a machine that doubles as a space heater.

The results speak for themselves:

  • mDeBERTa: Fast, accurate, trains in 90 seconds, perfect for high-volume classification
  • GPT-OSS-20B + LoRA: Slightly better at exact matches, more flexible, but 174x slower at inference

Choose your fighter based on your constraints. Or better yet—use both.

Now if you'll excuse me, HELIOS-01 and I have some more experiments to run. The office is getting cold.


Built with lots of coffee and the occasional "CUDA out of memory" error on HELIOS-01. All code, models, and data are available on GitHub and Hugging Face. Special thanks to the Hugging Face team for making this democratization of ML possible.

Community

Sign up or log in to comment