AlephBERT Hebrew Shopping Intent Classifier

סיווג כוונות בעברית

This is a small Hebrew text classifier. You give it a short message like "תוסיף חלב וביצים" ("add milk and eggs") and it tells you what the person wants, out of 17 possible intents (add an item, show the list, clear the list, and so on). It is a fine-tune of AlephBERT, a Hebrew BERT model, and I built it as a learning project.

I am not an ML researcher, and this is not a Hebrew NLP benchmark. It is a worked example of turning one narrow problem into a small fine-tuned model and checking it honestly. The full tutorial and code are in the GitHub repo: github.com/spivi/alephbert-intent-he.

Important limitation: the train and test data are synthetic paraphrases, not real chat traffic. The split avoids leakage by holding out source seeds before paraphrasing, but read the numbers as a controlled experiment, not as proof of real-world accuracy.

Try it

from transformers import pipeline

clf = pipeline("text-classification", model="spivi87/alephbert-intent-he", top_k=3)
clf("תקנה ביצים")
# [{'label': 'GROCERY_REQUEST', 'score': 0.90}, ...]

You get readable label names back (not LABEL_0). The mapping from numbers to names is saved inside config.json. If you would rather run without PyTorch, the GitHub repo has a one-command ONNX export.

There is also a live demo you can try in the browser, with no setup: spivi87/alephbert-intent-he-demo.

What it is for

It sorts short Hebrew messages from a shopping or grocery chat into one of 17 intents: add an item, show the list, clear the list, a recipe link, group admin actions, and so on (the full list is in the table further down).

It also works as a starting point if you want to fine-tune your own Hebrew classifier for a different topic. The weights are already comfortable with short, informal Hebrew that has typos, emoji, and a little English mixed in.

A practical tip: I use about 0.7 as a confidence heuristic in the demo. I did not calibrate the probabilities, so treat it as an operational rule of thumb, not a proven threshold. Below it, fall back to a larger model or treat the message as OTHER.

Results

Mean over 3 training runs (seeds 42, 43, 44), on a held-out test set of 374 messages (22 per intent). The published checkpoint is the seed 42 run.

  • Accuracy: 0.773 ± 0.008
  • Macro F1: 0.763 ± 0.008

Per-intent metrics (seed 42)

Intent Precision Recall F1 Support
GROCERY_REQUEST 0.800 0.909 0.851 22
RECIPE_URL 0.900 0.818 0.857 22
LIST_QUERY 0.760 0.864 0.809 22
CLEAR_LIST 0.722 0.591 0.650 22
REMOVE_ITEM 0.750 0.818 0.783 22
PARTIAL_COMPLETION 0.909 0.909 0.909 22
GROUP_INFO 1.000 0.545 0.706 22
GET_INVITE_CODE 0.786 1.000 0.880 22
CREATE_INVITE 0.619 0.591 0.605 22
RENAME_GROUP 0.957 1.000 0.978 22
LEAVE_GROUP 0.714 0.909 0.800 22
NOTIFICATION_SETTINGS 0.857 0.545 0.667 22
REVOKE_INVITE 0.808 0.955 0.875 22
RECIPE_SEARCH 0.808 0.955 0.875 22
UPDATE_QUANTITY 1.000 0.955 0.977 22
BUG_REPORT 0.375 0.273 0.316 22
OTHER 0.600 0.682 0.638 22

The full report and confusion matrix are in EVALUATION.md and confusion_matrix.png.

How it compares to a zero-shot LLM

On the same 374-message test set:

Approach Accuracy Cost per 1,000 messages
Random guessing 0.0668 $0
Always the most common label 0.0588 $0
Keyword rules (hand-written) 0.2487 $0
GPT-4o-mini, zero-shot 0.5722 about $0.05 (estimate, see note)
AlephBERT fine-tune (this model, seed 42) 0.7834 $0

On this narrow synthetic grocery-intent task the fine-tune scored higher than my zero-shot GPT-4o-mini baseline on the same split. That does not mean it is a better general Hebrew model. It means that for a narrow, repeated task, a small task-specific model can be cheaper, private, and surprisingly effective. The GPT-4o-mini cost is an estimate for my zero-shot prompt and these short messages; OpenAI bills per token, not per message, so your cost will vary with prompt length, number of labels, and output format.

How it was trained (short version)

The training data is synthetic. I started from a handful of Hebrew example sentences per intent and asked an LLM to rewrite each one into many variations. After testing the first version I found gaps and fed the failures back into the data: "buy X" requests like "תקנה חלב" were misread, and short single-item requests like "תקנה ביצים" were confused with removing an item. I added hand-authored examples to cover the missing phrasings and retrained. No real user messages were used, so there is no private data. A sample is in the GitHub repo, under data/sample.jsonl.

The train/test split happens at the seed level, before the rewriting step. For each intent, 2 seeds are held out completely, and the test set only contains rewrites of those held-out seeds. Splitting after rewriting instead would let rewrites of the same sentence land on both sides, so the model has effectively seen the test questions and the accuracy looks better than it is.

Training settings

Setting Value
Base model onlplab/alephbert-base
Optimizer AdamW (HF Trainer default)
Learning rate 2e-5 (linear warmup, linear decay)
Batch size 16 (train) / 32 (eval)
Max sequence length 128 tokens
Max epochs 10 (early stopping on eval_accuracy, patience 3)
Seeds 42, 43, 44 (seed 42 is the published checkpoint)

The full step-by-step instructions are in the GitHub repo.

Labels

ID Label What it means
0 GROCERY_REQUEST Add items to the shopping list
1 RECIPE_URL A recipe link: pull the ingredients from it
2 LIST_QUERY Show the current shopping list
3 CLEAR_LIST Mark everything bought and clear the list
4 REMOVE_ITEM Remove one item from the list
5 PARTIAL_COMPLETION Mark most items bought, except a few
6 GROUP_INFO Show the group members and details
7 GET_INVITE_CODE Get the existing group invite code
8 CREATE_INVITE Generate a new group invite code
9 RENAME_GROUP Change the group name
10 LEAVE_GROUP Leave the current group
11 NOTIFICATION_SETTINGS Toggle notification preferences
12 REVOKE_INVITE Cancel a group invite code
13 RECIPE_SEARCH Build a shopping list for a known dish
14 UPDATE_QUANTITY Change the quantity of an item already on the list
15 BUG_REPORT Report a problem with the bot
16 OTHER A chat or off-topic message, not a shopping action

Good to know before you rely on it

  • It was trained only on shopping and grocery messages. For another topic, treat it as a base to fine-tune, not a finished classifier.
  • BUG_REPORT and OTHER are the weakest classes (F1 around 0.32 to 0.64). Read a low-confidence prediction as "not sure" rather than a firm label.
  • It expects Hebrew, in short messages of up to 128 tokens. A little Hebrew and English mixing is fine.
  • The training data is synthetic, so unusual or very noisy phrasing may be classified less reliably than the test examples.

Credit and license

This model is only a fine-tune. The real work, teaching a model the Hebrew language in the first place, was done by the AlephBERT team at the ONLP Lab, Bar-Ilan University: Amit Seker, Elron Bandel, Dan Bareket, Idan Brusilovsky, Refael Greenfeld, and Reut Tsarfaty. Thank you to them. Their code and models are at github.com/OnlpLab/AlephBERT. If AlephBERT is useful in your own work, please cite their paper.

Base model: onlplab/alephbert-base (Apache 2.0). This fine-tune is released under Apache 2.0 as well.

@inproceedings{seker-etal-2022-alephbert,
    title     = "{A}leph{BERT}: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level",
    author    = "Seker, Amit and Bandel, Elron and Bareket, Dan and Brusilovsky, Idan and Greenfeld, Refael and Tsarfaty, Reut",
    booktitle = "Proceedings of the 60th Annual Meeting of the ACL",
    year      = "2022",
    address   = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2022.acl-long.4",
    pages     = "46--56",
}
Downloads last month
59
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spivi87/alephbert-intent-he

Finetuned
(10)
this model

Space using spivi87/alephbert-intent-he 1