ARISCOT commited on
Commit
512697c
·
verified ·
1 Parent(s): ae90040

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -30,9 +30,42 @@ widget:
30
  - text: "Scientists have discovered a planet made entirely of diamond."
31
  example_title: "Science Claim"
32
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  # Digital Literacy & Fact-Checker AI 🌍
34
 
35
  This AI helps verify news claims globally, with a specialized focus on digital literacy and misinformation trends in West Africa."
36
 
37
  ## How it Works
38
- This model uses the RoBERTa architecture to classify news claims into four categories: reliable, misleading, false, or unverified.
 
 
30
  - text: "Scientists have discovered a planet made entirely of diamond."
31
  example_title: "Science Claim"
32
  ---
33
+
34
+ # 1. Load the different "Subject Experts"
35
+ # We take a sample of 5,000 from each to keep the model balanced
36
+ global_news = load_dataset("jason1966/algozee_fake-news", split='train[:5000]')
37
+ politics = load_dataset("ucsbnlp/liar", split='train[:5000]')
38
+ science_health = load_dataset("Intel/misinformation-guard", split='train[:5000]')
39
+
40
+ # 2. Label Harmonization
41
+ # Different datasets use different numbers for "False".
42
+ # We force them all to use: 0 for False, 1 for True.
43
+ def clean_labels(example):
44
+ # Example logic: if the label is 'fake' or 0, it stays 0
45
+ if str(example['label']).lower() in ['fake', 'false', '0']:
46
+ example['label'] = 0
47
+ else:
48
+ example['label'] = 1
49
+ return example
50
+
51
+ # Apply the cleaning to all datasets
52
+ global_news = global_news.map(clean_labels)
53
+ politics = politics.map(clean_labels)
54
+ science_health = science_health.map(clean_labels)
55
+
56
+ # 3. Create the "Super Dataset"
57
+ universal_data = concatenate_datasets([global_news, politics, science_health])
58
+
59
+ # 4. Shuffle so the model learns all subjects at the same time
60
+ universal_data = universal_data.shuffle(seed=42)
61
+
62
+ print(f"Universal model is ready to train on {len(universal_data)} claims across all categories!")
63
+ ---
64
+
65
  # Digital Literacy & Fact-Checker AI 🌍
66
 
67
  This AI helps verify news claims globally, with a specialized focus on digital literacy and misinformation trends in West Africa."
68
 
69
  ## How it Works
70
+ This model uses the RoBERTa architecture to classify news claims into four categories: reliable, misleading, false, or unverified.
71
+ from datasets import load_dataset, concatenate_datasets, DatasetDict