rmtariq commited on
Commit
f69c15d
Β·
verified Β·
1 Parent(s): 2cb076d

Model save

Browse files
Files changed (2) hide show
  1. README.md +77 -119
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,119 +1,77 @@
1
- # πŸ‡²πŸ‡Ύ Malay Claim Classification Model
2
-
3
- This is a fine-tuned BERT model built to classify claims in Malay (and English) into 21 categories.
4
-
5
- ## πŸ“Š Categories
6
-
7
- The model classifies claims into the following categories:
8
-
9
- - `Politik` (Politics)
10
- - `Perpaduan` (Unity)
11
- - `Keluarga` (Family)
12
- - `Belia` (Youth)
13
- - `Perumahan` (Housing)
14
- - `Internet` (Internet)
15
- - `Pengguna` (Consumer)
16
- - `Makanan` (Food)
17
- - `Pekerjaan` (Employment)
18
- - `Pengangkutan` (Transportation)
19
- - `Sukan` (Sports)
20
- - `Ekonomi` (Economy)
21
- - `Hiburan` (Entertainment)
22
- - `Jenayah` (Crime)
23
- - `Alam Sekitar` (Environment)
24
- - `Teknologi` (Technology)
25
- - `Pendidikan` (Education)
26
- - `Agama` (Religion)
27
- - `Sosial` (Social)
28
- - `Kesihatan` (Health)
29
- - `Halal` (Halal)
30
-
31
- ## 🧠 Base Model
32
-
33
- Fine-tuned from `bert-base-multilingual-cased`, which supports both Malay and English text.
34
-
35
- ## πŸ§ͺ Example Usage
36
-
37
- ```python
38
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
39
- import torch
40
-
41
- # Load model and tokenizer
42
- model_name = "rmtariq/malay_classification"
43
- tokenizer = AutoTokenizer.from_pretrained(model_name)
44
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
45
-
46
- # Function to classify a claim
47
- def classify_claim(claim):
48
- # Prepare the input
49
- inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=128)
50
-
51
- # Get the prediction
52
- with torch.no_grad():
53
- outputs = model(**inputs)
54
-
55
- # Get the predicted class
56
- logits = outputs.logits
57
- predicted_class_id = logits.argmax().item()
58
-
59
- # Get the confidence score
60
- probabilities = torch.nn.functional.softmax(logits, dim=1)[0]
61
- confidence = probabilities[predicted_class_id].item()
62
-
63
- # Map to category
64
- category = model.config.id2label[predicted_class_id]
65
-
66
- return category, confidence
67
-
68
- # Example claims
69
- examples = [
70
- "Projek mega kerajaan penuh dengan ketirisan.",
71
- "Harga barang keperluan naik setiap bulan.",
72
- "Program vaksinasi tidak mencakupi golongan luar bandar.",
73
- "Makanan di hotel lima bintang tidak jelas status halalnya."
74
- ]
75
-
76
- # Classify each example
77
- for claim in examples:
78
- category, confidence = classify_claim(claim)
79
- print(f"Claim: {claim}")
80
- print(f"Category: {category}")
81
- print(f"Confidence: {confidence:.4f}")
82
- print("-" * 50)
83
- ```
84
-
85
- ## πŸ“š Dataset
86
-
87
- Fine-tuned on a custom dataset with 3,675 claims labeled by category, with an 80/20 train/test split.
88
-
89
- ## πŸ” Evaluation
90
-
91
- The model achieves high accuracy on the test set, with most predictions having confidence scores above 0.95.
92
-
93
-
94
- ## 🎯 Specific Claim Patterns
95
-
96
- The model includes special handling for specific claim patterns:
97
-
98
- 1. **Police-related claims**: Claims about the police chief, summons, or threats
99
- - Example: "Ketua Polis Negara (KPN) Tan Sri Razarudin Husain hantar e-mel berkaitan saman dan berbaur ugutan kepada orang awam"
100
- - Category: Jenayah (Crime)
101
-
102
- 2. **Zakat-related claims**: Claims about zakat fitrah, rice types, or payment validity
103
- - Example: "Zakat fitrah tidak sah jika dibayar tidak mengikut jenis beras yang dimakan"
104
- - Category: Agama (Religion)
105
-
106
- 3. **Tax-related claims**: Claims about government taxes, especially on palm oil
107
- - Example: "Kerajaan akan memperkenalkan cukai khas minyak sawit mentah"
108
- - Category: Ekonomi (Economy)
109
-
110
- 4. **Consumer product claims**: Claims about contact lenses or online sales
111
- - Example: "Kanta lekap tidak boleh dijual secara dalam talian"
112
- - Category: Pengguna (Consumer)
113
-
114
- 5. **National security claims**: Claims about ammunition, colonization, or enemies
115
- - Example: "Penemuan 50 tan kelongsong dan peluru petanda negara bakal dijajah musuh"
116
- - Category: Politik (Politics)
117
- ## πŸ“‹ License
118
-
119
- MIT License
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: rmtariq/malay_classification
4
+ tags:
5
+ - generated_from_trainer
6
+ metrics:
7
+ - accuracy
8
+ - f1
9
+ - precision
10
+ - recall
11
+ model-index:
12
+ - name: malay_classification
13
+ results: []
14
+ ---
15
+
16
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
+ should probably proofread and complete it, then remove this comment. -->
18
+
19
+ # malay_classification
20
+
21
+ This model is a fine-tuned version of [rmtariq/malay_classification](https://huggingface.co/rmtariq/malay_classification) on the None dataset.
22
+ It achieves the following results on the evaluation set:
23
+ - Loss: 0.0024
24
+ - Accuracy: 0.9990
25
+ - F1: 0.9990
26
+ - Precision: 0.9991
27
+ - Recall: 0.9990
28
+
29
+ ## Model description
30
+
31
+ More information needed
32
+
33
+ ## Intended uses & limitations
34
+
35
+ More information needed
36
+
37
+ ## Training and evaluation data
38
+
39
+ More information needed
40
+
41
+ ## Training procedure
42
+
43
+ ### Training hyperparameters
44
+
45
+ The following hyperparameters were used during training:
46
+ - learning_rate: 5e-05
47
+ - train_batch_size: 8
48
+ - eval_batch_size: 16
49
+ - seed: 42
50
+ - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
51
+ - lr_scheduler_type: linear
52
+ - lr_scheduler_warmup_steps: 500
53
+ - num_epochs: 3
54
+
55
+ ### Training results
56
+
57
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
58
+ |:-------------:|:------:|:----:|:---------------:|:--------:|:------:|:---------:|:------:|
59
+ | 0.1691 | 0.2720 | 500 | 0.1373 | 0.9717 | 0.9717 | 0.9730 | 0.9717 |
60
+ | 0.0493 | 0.5441 | 1000 | 0.0369 | 0.9943 | 0.9943 | 0.9945 | 0.9943 |
61
+ | 0.0669 | 0.8161 | 1500 | 0.0406 | 0.9952 | 0.9952 | 0.9954 | 0.9952 |
62
+ | 0.0287 | 1.0881 | 2000 | 0.0276 | 0.9943 | 0.9944 | 0.9948 | 0.9943 |
63
+ | 0.0061 | 1.3602 | 2500 | 0.0168 | 0.9971 | 0.9971 | 0.9972 | 0.9971 |
64
+ | 0.0137 | 1.6322 | 3000 | 0.0128 | 0.9981 | 0.9981 | 0.9981 | 0.9981 |
65
+ | 0.0178 | 1.9042 | 3500 | 0.0179 | 0.9968 | 0.9968 | 0.9969 | 0.9968 |
66
+ | 0.0112 | 2.1763 | 4000 | 0.0110 | 0.9975 | 0.9975 | 0.9975 | 0.9975 |
67
+ | 0.0001 | 2.4483 | 4500 | 0.0079 | 0.9987 | 0.9987 | 0.9988 | 0.9987 |
68
+ | 0.0001 | 2.7203 | 5000 | 0.0021 | 0.9987 | 0.9987 | 0.9987 | 0.9987 |
69
+ | 0.0003 | 2.9924 | 5500 | 0.0024 | 0.9990 | 0.9990 | 0.9991 | 0.9990 |
70
+
71
+
72
+ ### Framework versions
73
+
74
+ - Transformers 4.53.1
75
+ - Pytorch 2.7.1
76
+ - Datasets 3.6.0
77
+ - Tokenizers 0.21.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6c1d91e1af8ac7d950b8aaa56fe4210b91e19b4da2a78b94275fe0e46baf0a90
3
  size 711501908
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b2d031437b7b3ceed085985b1a9a59ced72928e3b8c09fa62fa3969391c8b34
3
  size 711501908