Executor-Tyrant-Framework commited on
Commit
4e714d6
·
verified ·
1 Parent(s): 8e2ec90

Sync from GitHub: 080bffac0f304b7f3781266ca4fb761c2974f8a6

Browse files
Files changed (1) hide show
  1. grammars/concepts.gbnf +8 -10
grammars/concepts.gbnf CHANGED
@@ -28,14 +28,12 @@
28
 
29
  root ::= item ("," ws item){0,7}
30
  item ::= word (ws word){0,3}
31
- # Word: first char is any letter (allows TitleCase proper nouns
32
- # like "Calvin" or "RSA"); body is lowercase + digits + hyphen only.
33
- # Mid-word capitals are forbidden. This forces Falcon3's BPE
34
- # tokenizer to emit space-prefixed continuation tokens (" Sunlight")
35
- # rather than jamming tokens together as one camelCase "word"
36
- # ("thatGreenhouseCarbon"). Per arXiv 2502.14969, space-prefixed
37
- # tokens have better-trained embeddings (5-10% gain), which is
38
- # particularly important for smaller/quantized models like
39
- # Falcon3-10B-1.58bit.
40
- word ::= [a-zA-Z] [a-z0-9-]{2,19}
41
  ws ::= " "
 
28
 
29
  root ::= item ("," ws item){0,7}
30
  item ::= word (ws word){0,3}
31
+ # Word: any letter + alphanumerics + hyphens. Mid-word capitals
32
+ # are permitted because legitimate concepts frequently contain them:
33
+ # acronyms (RSA, CPU, DNA, ATP, NADPH), patronymic proper nouns
34
+ # (McDonalds, MacPhearson), brand/product names (iPhone, eBay).
35
+ # Length-cap of 20 chars + the defensive parser's word-count gate
36
+ # handle the run-7 token-jam pollution ("thatGreenhouseCarbon")
37
+ # without sacrificing legitimate mid-capital content.
38
+ word ::= [a-zA-Z] [a-zA-Z0-9-]{2,19}
 
 
39
  ws ::= " "