AlephBERT Hebrew Shopping Intent Classifier

סיווג כוונות בעברית

This is a small Hebrew text classifier. You give it a short message like "תוסיף חלב וביצים" ("add milk and eggs") and it tells you what the person wants, out of 17 possible intents (add an item, show the list, clear the list, and so on). It is a fine-tune of AlephBERT, a Hebrew BERT model, and I built it as a learning project.

I am not an ML researcher, and this is not a Hebrew NLP benchmark. It is a worked example of turning one narrow problem into a small fine-tuned model and checking it honestly. The full tutorial and code are in the GitHub repo: github.com/spivi/alephbert-intent-he.

Important limitation: the train and test data are synthetic paraphrases, not real chat traffic. The split avoids leakage by holding out source seeds before paraphrasing, but read the numbers as a controlled experiment, not as proof of real-world accuracy.

Try it

from transformers import pipeline

clf = pipeline("text-classification", model="spivi87/alephbert-intent-he", top_k=3)
clf("תקנה ביצים")
# [{'label': 'GROCERY_REQUEST', 'score': 0.90}, ...]

You get readable label names back (not LABEL_0). The mapping from numbers to names is saved inside config.json. If you would rather run without PyTorch, the GitHub repo has a one-command ONNX export.

There is also a live demo you can try in the browser, with no setup: spivi87/alephbert-intent-he-demo.

What it is for

It sorts short Hebrew messages from a shopping or grocery chat into one of 17 intents: add an item, show the list, clear the list, a recipe link, group admin actions, and so on (the full list is in the table further down).

It also works as a starting point if you want to fine-tune your own Hebrew classifier for a different topic. The weights are already comfortable with short, informal Hebrew that has typos, emoji, and a little English mixed in.

A practical tip: I use about 0.7 as a confidence heuristic in the demo. I did not calibrate the probabilities, so treat it as an operational rule of thumb, not a proven threshold. Below it, fall back to a larger model or treat the message as OTHER.

Results

Mean over 3 training runs (seeds 42, 43, 44), on a held-out test set of 374 messages (22 per intent). The published checkpoint is the seed 42 run.

Accuracy: 0.773 ± 0.008
Macro F1: 0.763 ± 0.008

Per-intent metrics (seed 42)

Intent	Precision	Recall	F1	Support
`GROCERY_REQUEST`	0.800	0.909	0.851	22
`RECIPE_URL`	0.900	0.818	0.857	22
`LIST_QUERY`	0.760	0.864	0.809	22
`CLEAR_LIST`	0.722	0.591	0.650	22
`REMOVE_ITEM`	0.750	0.818	0.783	22
`PARTIAL_COMPLETION`	0.909	0.909	0.909	22
`GROUP_INFO`	1.000	0.545	0.706	22
`GET_INVITE_CODE`	0.786	1.000	0.880	22
`CREATE_INVITE`	0.619	0.591	0.605	22
`RENAME_GROUP`	0.957	1.000	0.978	22
`LEAVE_GROUP`	0.714	0.909	0.800	22
`NOTIFICATION_SETTINGS`	0.857	0.545	0.667	22
`REVOKE_INVITE`	0.808	0.955	0.875	22
`RECIPE_SEARCH`	0.808	0.955	0.875	22
`UPDATE_QUANTITY`	1.000	0.955	0.977	22
`BUG_REPORT`	0.375	0.273	0.316	22
`OTHER`	0.600	0.682	0.638	22

The full report and confusion matrix are in EVALUATION.md and confusion_matrix.png.

How it compares to a zero-shot LLM

On the same 374-message test set:

Approach	Accuracy	Cost per 1,000 messages
Random guessing	0.0668	$0
Always the most common label	0.0588	$0
Keyword rules (hand-written)	0.2487	$0
GPT-4o-mini, zero-shot	0.5722	about $0.05 (estimate, see note)
AlephBERT fine-tune (this model, seed 42)	0.7834	$0

On this narrow synthetic grocery-intent task the fine-tune scored higher than my zero-shot GPT-4o-mini baseline on the same split. That does not mean it is a better general Hebrew model. It means that for a narrow, repeated task, a small task-specific model can be cheaper, private, and surprisingly effective. The GPT-4o-mini cost is an estimate for my zero-shot prompt and these short messages; OpenAI bills per token, not per message, so your cost will vary with prompt length, number of labels, and output format.

How it was trained (short version)

The training data is synthetic. I started from a handful of Hebrew example sentences per intent and asked an LLM to rewrite each one into many variations. After testing the first version I found gaps and fed the failures back into the data: "buy X" requests like "תקנה חלב" were misread, and short single-item requests like "תקנה ביצים" were confused with removing an item. I added hand-authored examples to cover the missing phrasings and retrained. No real user messages were used, so there is no private data. A sample is in the GitHub repo, under data/sample.jsonl.

The train/test split happens at the seed level, before the rewriting step. For each intent, 2 seeds are held out completely, and the test set only contains rewrites of those held-out seeds. Splitting after rewriting instead would let rewrites of the same sentence land on both sides, so the model has effectively seen the test questions and the accuracy looks better than it is.

Training settings

Setting	Value
Base model	`onlplab/alephbert-base`
Optimizer	AdamW (HF `Trainer` default)
Learning rate	`2e-5` (linear warmup, linear decay)
Batch size	16 (train) / 32 (eval)
Max sequence length	128 tokens
Max epochs	10 (early stopping on `eval_accuracy`, patience 3)
Seeds	42, 43, 44 (seed 42 is the published checkpoint)

The full step-by-step instructions are in the GitHub repo.

Labels

ID	Label	What it means
0	`GROCERY_REQUEST`	Add items to the shopping list
1	`RECIPE_URL`	A recipe link: pull the ingredients from it
2	`LIST_QUERY`	Show the current shopping list
3	`CLEAR_LIST`	Mark everything bought and clear the list
4	`REMOVE_ITEM`	Remove one item from the list
5	`PARTIAL_COMPLETION`	Mark most items bought, except a few
6	`GROUP_INFO`	Show the group members and details
7	`GET_INVITE_CODE`	Get the existing group invite code
8	`CREATE_INVITE`	Generate a new group invite code
9	`RENAME_GROUP`	Change the group name
10	`LEAVE_GROUP`	Leave the current group
11	`NOTIFICATION_SETTINGS`	Toggle notification preferences
12	`REVOKE_INVITE`	Cancel a group invite code
13	`RECIPE_SEARCH`	Build a shopping list for a known dish
14	`UPDATE_QUANTITY`	Change the quantity of an item already on the list
15	`BUG_REPORT`	Report a problem with the bot
16	`OTHER`	A chat or off-topic message, not a shopping action

Good to know before you rely on it

It was trained only on shopping and grocery messages. For another topic, treat it as a base to fine-tune, not a finished classifier.
BUG_REPORT and OTHER are the weakest classes (F1 around 0.32 to 0.64). Read a low-confidence prediction as "not sure" rather than a firm label.
It expects Hebrew, in short messages of up to 128 tokens. A little Hebrew and English mixing is fine.
The training data is synthetic, so unusual or very noisy phrasing may be classified less reliably than the test examples.

Credit and license

This model is only a fine-tune. The real work, teaching a model the Hebrew language in the first place, was done by the AlephBERT team at the ONLP Lab, Bar-Ilan University: Amit Seker, Elron Bandel, Dan Bareket, Idan Brusilovsky, Refael Greenfeld, and Reut Tsarfaty. Thank you to them. Their code and models are at github.com/OnlpLab/AlephBERT. If AlephBERT is useful in your own work, please cite their paper.

Base model: onlplab/alephbert-base (Apache 2.0). This fine-tune is released under Apache 2.0 as well.

@inproceedings{seker-etal-2022-alephbert,
    title     = "{A}leph{BERT}: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level",
    author    = "Seker, Amit and Bandel, Elron and Bareket, Dan and Brusilovsky, Idan and Greenfeld, Refael and Tsarfaty, Reut",
    booktitle = "Proceedings of the 60th Annual Meeting of the ACL",
    year      = "2022",
    address   = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2022.acl-long.4",
    pages     = "46--56",
}

Downloads last month: 59

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for spivi87/alephbert-intent-he

Base model

onlplab/alephbert-base

Finetuned

(10)

this model

spivi87
/

alephbert-intent-he