Script that spits out every damned hyphen in Unicode, because Gemma loves them

Browse files

I mean, legitimately **banning the fucking hyphen** is damn near the only way to get Gemma models to not-do-the-thing.

Files changed (1) hide show

every_hyphen.py +49 -0

every_hyphen.py ADDED Viewed

	@@ -0,0 +1,49 @@

+#!/usr/bin/env python3 # -*- coding: utf-8; -*-
+#  table sourced from http://jkorpela.fi/dashes.html
+# I know, it has tabs, it's disgusting. Sorry.
+## -  	 U+002D 	 &#45;    	 hyphen-minus                           	 the Ascii hyphen, with multiple usage, or “ambiguous semantic value”; the width should be “average”
+## ~  	 U+007E 	 &#126;   	 tilde                                  	 the Ascii tilde, with multiple usage; “swung dash”
+##   	 U+00AD 	 &#173;   	 soft hyphen                            	 “discretionary hyphen”
+## ֊  	 U+058A 	 &#1418;  	 armenian hyphen                        	 as soft hyphen, but different in shape
+## ־  	 U+05BE 	 &#1470;  	 hebrew punctuation maqaf               	 word hyphen in Hebrew
+## ᐀  	 U+1400 	 &#5120;  	 canadian syllabics hyphen              	 used in Canadian Aboriginal Syllabics
+## ᠆  	 U+1806 	 &#6150;  	 mongolian todo soft hyphen             	 as soft hyphen, but displayed at the beginning of the second line
+## ‐  	 U+2010 	 &#8208;  	 hyphen                                 	 unambiguously a hyphen character, as in “left-to-right”; narrow width
+## ‑  	 U+2011 	 &#8209;  	 non-breaking hyphen                    	 as hyphen (U+2010), but not an allowed line break point
+## ‒  	 U+2012 	 &#8210;  	 figure dash                            	 as hyphen-minus, but has the same width as digits
+## –  	 U+2013 	 &#8211;  	 en dash                                	 used e.g. to indicate a range of values
+## —  	 U+2014 	 &#8212;  	 em dash                                	 used e.g. to make a break in the flow of a sentence
+## ―  	 U+2015 	 &#8213;  	 horizontal bar                         	 used to introduce quoted text in some typographic styles; “quotation dash”; often (e.g., in the representative glyph in the Unicode standard) longer than em dash
+## ⁓  	 U+2053 	 &#8275;  	 swung dash                             	 like a large tilde
+## ⁻  	 U+207B 	 &#8315;  	 superscript minus                      	 a compatibility character which is equivalent to minus sign U+2212 in superscript style
+## ₋  	 U+208B 	 &#8331;  	 subscript minus                        	 a compatibility character which is equivalent to minus sign U+2212 in subscript style
+## −  	 U+2212 	 &#8722;  	 minus sign                             	 an arithmetic operator; the glyph may look the same as the glyph for a hyphen-minus, or may be longer ;
+## ⸗  	 U+2E17 	 &#11799; 	 double oblique hyphen                  	 used in ancient Near-Eastern linguistics; not in Fraktur, but the glyph of Ascii hyphen or hyphen is similar to this character in Fraktur fonts
+## ⸺  	 U+2E3A 	 &#11834; 	 two-em dash                            	 omission dash<(a>, 2 em units wide
+## ⸻  	 U+2E3B 	 &#11835; 	 three-em dash                          	 used in bibliographies, 3 em units wide
+## 〜 	 U+301C 	 &#12316; 	 wave dash                              	 a Chinese/Japanese/Korean character
+## 〰 	 U+3030 	 &#12336; 	 wavy dash                              	 a Chinese/Japanese/Korean character
+## ゠ 	 U+30A0 	 &#12448; 	 katakana-hiragana double hyphen        	 in Japasene kana writing
+## ︱ 	 U+FE31 	 &#65073; 	 presentation form for vertical em dash 	 vertical variant of em dash
+## ︲ 	 U+FE32 	 &#65074; 	 presentation form for vertical en dash 	 vertical variant of en dash
+## ﹘ 	 U+FE58 	 &#65112; 	 small em dash                          	 small variant of em dash
+## ﹣ 	 U+FE63 	 &#65123; 	 small hyphen-minus                     	 small variant of Ascii hyphen
+## － 	 U+FF0D 	 &#65293; 	 fullwidth hyphen-minus                 	 variant of Ascii hyphen for use with CJK characters
+# for i in "\u002D" "\u007E" "\u00AD" "\u058A" "\u05BE" "\u1400" "\u1806" "\u2010" "\u2011" "\u2012" "\u2013" "\u2014" "\u2015" "\u2053" "\u207B" "\u208B" "\u2212" "\u2E17" "\u2E3A" "\u2E3B" "\u301C" "\u3030" "\u30A0" "\uFE31" "\uFE32" "\uFE58" "\uFE63" "\uFF0D":
+#     print(i * 72)
+import os
+import re
+with open(os.path.basename(__file__), mode='r') as f:
+    line = f.readline()
+    r = re.compile(r'^## ')
+    for line in f:
+        if r.search(line):
+            q = [x.strip() for x in line.split('\t')]
+            q[0] = q[0].split()[1]
+            print("{:40} {}".format(q[3] + ':', q[0] * 18))