Spaces:

Executor-Tyrant-Framework
/

NuWave

Running

Commit

8e2ec90

verified ·

1 Parent(s): 6c53a1d

Sync from GitHub: 0c5737d7108460e1c2b09e575eccf57cc50766be

Files changed (1) hide show

grammars/concepts.gbnf CHANGED Viewed

@@ -28,5 +28,14 @@
 root    ::= item ("," ws item){0,7}
 item    ::= word (ws word){0,3}
-word    ::= [a-zA-Z] [a-zA-Z0-9-]{2,19}
 ws      ::= " "

 root    ::= item ("," ws item){0,7}
 item    ::= word (ws word){0,3}
+# Word: first char is any letter (allows TitleCase proper nouns
+# like "Calvin" or "RSA"); body is lowercase + digits + hyphen only.
+# Mid-word capitals are forbidden. This forces Falcon3's BPE
+# tokenizer to emit space-prefixed continuation tokens (" Sunlight")
+# rather than jamming tokens together as one camelCase "word"
+# ("thatGreenhouseCarbon"). Per arXiv 2502.14969, space-prefixed
+# tokens have better-trained embeddings (5-10% gain), which is
+# particularly important for smaller/quantized models like
+# Falcon3-10B-1.58bit.
+word    ::= [a-zA-Z] [a-z0-9-]{2,19}
 ws      ::= " "