File size: 3,908 Bytes
edede4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
{"expressions": [
    "x^2 + 2*x + 1",
    "sin(x)^2 + cos(x)^2",
    "x^3 - 3*x^2 + 3*x - 1",
    "e^(i*pi) + 1",
    "log(x*y)",
    "sqrt(x^2 + y^2)",
    "1/(1 + e^(-x))",
    "x^2 - y^2",
    "a^2 + 2*a*b + b^2",
    "(x+1)*(x-1)",
    "diff(sin(x), x)",
    "integrate(x^2, x)",
    "limit(sin(x)/x, x, 0)",
    "sum(k^2, k, 1, n)",
    "factorial(n) / (factorial(k)*factorial(n-k))",
    "exp(-x^2/2) / sqrt(2*pi)",
    "a*x^2 + b*x + c",
    "(-b + sqrt(b^2 - 4*a*c)) / (2*a)",
    "log(1 + x)",
    "x - x^3/6 + x^5/120",
    "1 + 1/2 + 1/4 + 1/8",
    "n*(n+1)/2",
    "2^10",
    "abs(x - y)",
    "floor(x) + ceil(-x)",
    "gamma(n+1)",
    "sinh(x) + cosh(x)",
    "atan(y/x)",
    "x^2 + y^2 + z^2",
    "det([[a,b],[c,d]])"
  ],

  "equivalent_pairs": [
    ["x^2 + 2*x + 1",      "(x+1)^2"],
    ["a^2 - b^2",          "(a+b)*(a-b)"],
    ["a^2 + 2*a*b + b^2",  "(a+b)^2"],
    ["x^3 - y^3",          "(x-y)*(x^2 + x*y + y^2)"],
    ["sin(x)^2 + cos(x)^2","1"],
    ["log(x) + log(y)",    "log(x*y)"],
    ["e^x * e^y",          "e^(x+y)"],
    ["1/x + 1/y",          "(x+y)/(x*y)"],
    ["b + a",              "a + b"],
    ["2*x + 2*y",          "2*(x+y)"],
    ["x/2",                "x * (1/2)"],
    ["x^2 * x^3",          "x^5"],
    ["(x^2)^3",            "x^6"],
    ["log(e^x)",           "x"],
    ["e^(log(x))",         "x"],
    ["n*(n+1)/2",          "n/2 + n^2/2"],
    ["1 + x + x^2",        "(x^3 - 1)/(x-1)"],
    ["cos(2*x)",           "1 - 2*sin(x)^2"],
    ["tan(x)",             "sin(x)/cos(x)"],
    ["cosh(x)^2 - sinh(x)^2","1"]
  ],

  "rewriting_groups": [
    ["x^2 + 2*x + 1", "(x+1)^2", "x*(x+2) + 1"],
    ["a*b + a*c",     "a*(b+c)", "a*c + a*b"],
    ["sin(x)/cos(x)", "tan(x)",  "sin(x)*sec(x)"],
    ["e^(x+y)",       "e^x * e^y"],
    ["log(x^2)",      "2*log(x)","log(x) + log(x)"],
    ["n*(n+1)/2",     "n/2*(n+1)", "sum(k, k, 1, n)"]
  ],

  "mixed_text_math": [
    "The derivative of $\\sin(x^2)$ with respect to $x$ is $2x\\cos(x^2)$.",
    "Let $f(x) = x^2 + 2x + 1$. Then $f(x) = (x+1)^2$.",
    "The quadratic formula gives $x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$.",
    "Euler's identity states that $e^{i\\pi} + 1 = 0$.",
    "The integral $\\int_0^1 x^2 dx = \\frac{1}{3}$.",
    "For any $n \\geq 1$, the sum $\\sum_{k=1}^{n} k = \\frac{n(n+1)}{2}$.",
    "The Pythagorean theorem: $a^2 + b^2 = c^2$ for right triangles.",
    "The normal distribution is $f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-x^2/2}$.",
    "If $\\sin^2(x) + \\cos^2(x) = 1$ then $\\tan^2(x) + 1 = \\sec^2(x)$.",
    "The limit $\\lim_{x \\to 0} \\frac{\\sin(x)}{x} = 1$ is fundamental.",
    "Find the derivative of f(x) = sin(x^2) + 3x.",
    "Solve for x: x^2 - 5*x + 6 = 0.",
    "The area of a circle of radius r is pi*r^2.",
    "Simplify: (a+b)^2 - (a-b)^2.",
    "Compute the Taylor series of exp(x) around x=0."
  ],

  "latex_only": [
    "\\frac{x^2 - 1}{x + 1}",
    "\\sqrt{\\frac{a^2 + b^2}{2}}",
    "\\int_0^\\infty e^{-x^2} dx",
    "\\sum_{n=0}^{\\infty} \\frac{x^n}{n!}",
    "\\lim_{n \\to \\infty} \\left(1 + \\frac{1}{n}\\right)^n",
    "\\binom{n}{k} = \\frac{n!}{k!(n-k)!}",
    "\\frac{d}{dx}\\left[\\ln(x)\\right] = \\frac{1}{x}",
    "\\nabla^2 f = \\frac{\\partial^2 f}{\\partial x^2} + \\frac{\\partial^2 f}{\\partial y^2}"
  ],

  "ascii_only": [
    "x**2 + 2*x + 1",
    "sin(x)**2 + cos(x)**2",
    "exp(-x**2 / 2) / sqrt(2*pi)",
    "factorial(n) / (factorial(k) * factorial(n - k))",
    "log(x**2) - 2*log(x)",
    "abs(a - b) + abs(b - c)",
    "floor(x/2) * 2",
    "gamma(n + 1) / gamma(n)"
  ],

  "metadata": {
    "version": "1.0",
    "description": "MathTok benchmark dataset — curated expressions for evaluating structural tokenization quality",
    "sources": ["handcrafted", "DeepMind-Mathematics-inspired"],
    "num_expressions": 30,
    "num_equivalent_pairs": 20,
    "num_rewriting_groups": 6,
    "num_mixed": 15
  }
}