ACDRepo commited on
Commit
fbc206b
·
verified ·
1 Parent(s): 24f62d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md CHANGED
@@ -1,3 +1,135 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - math
6
+ - combinatorics
7
+ - permutations
8
+ - algebraic-combinatorics
9
+ - llama
10
+ - causal-lm
11
  ---
12
+
13
+ # PermuFormer
14
+
15
+ PermuFormer is a small Llama-style causal language model trained on symbolic permutation tasks from algebraic combinatorics. It is intended as a specialist base model for permutation representation, reasoning, and finetuning experiments rather than as a general natural-language assistant.
16
+
17
+ The model operates on a compact whitespace-tokenized vocabulary for permutations. Prompts are formulaic equations: the left side specifies a permutation task and generation begins after the `=` token.
18
+
19
+ ## Model Details
20
+
21
+ - **Architecture:** `LlamaForCausalLM`
22
+ - **Parameters:** about 75.7M
23
+ - **Layers:** 12
24
+ - **Hidden size:** 768
25
+ - **Attention heads:** 12 query heads, 4 key/value heads
26
+ - **MLP intermediate size:** 2048
27
+ - **Activation:** SiLU/SwiGLU
28
+ - **Position encoding:** RoPE, theta 10000
29
+ - **Vocabulary size:** 186
30
+ - **Context length used by tokenizer:** 1000 tokens
31
+ - **Checkpoint:** `step_2600000`
32
+
33
+ ## Training Data
34
+
35
+ PermuFormer was trained autoregressively on synthetic permutation examples generated with exact combinatorial algorithms. The paper describes a dataset of 39.8M instances, approximately 2.66B tokens, over the symmetric groups `S_2` through `S_11`.
36
+
37
+ Training tasks cover three broad families:
38
+
39
+ - **Translation between encodings:** one-line notation, cycle notation, reduced Coxeter expressions, RSK tableaux, inversion vectors, and Lehmer codes.
40
+ - **Permutation statistics and properties:** length, descents, fixed points, sign/parity, cycle type, RSK shape, pattern avoidance, longest increasing/decreasing subsequences, and related statistics.
41
+ - **Algebraic operations and comparisons:** product/composition, inverse, powers, conjugation, commutator, relative products, multiplication by simple transpositions, complement, reverse, descent tests, and Bruhat order.
42
+
43
+ Some targets include computational witnesses before the final answer, for example inversion lists before a length answer or pattern witnesses before an avoidance answer.
44
+
45
+ ## Usage
46
+
47
+ Use deterministic decoding for most evaluation-style tasks. Make sure special token IDs come from the tokenizer.
48
+
49
+ ```python
50
+ from transformers import AutoModelForCausalLM, AutoTokenizer
51
+ import torch
52
+
53
+ model_id = "YOUR_ORG/permuformer"
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+ model = AutoModelForCausalLM.from_pretrained(model_id)
57
+ model.eval()
58
+
59
+ prompt = (
60
+ "<|endoftext|> n3 "
61
+ "1linebegin [ 3 , 1 , 2 ] 1lineend "
62
+ "in cyclenotationmake ="
63
+ )
64
+
65
+ inputs = tokenizer(prompt, return_tensors="pt")
66
+
67
+ with torch.no_grad():
68
+ output_ids = model.generate(
69
+ **inputs,
70
+ max_new_tokens=80,
71
+ do_sample=False,
72
+ eos_token_id=tokenizer.eos_token_id,
73
+ pad_token_id=tokenizer.pad_token_id,
74
+ )
75
+
76
+ print(tokenizer.decode(output_ids[0], skip_special_tokens=False))
77
+ ```
78
+
79
+ ### Prompt Format
80
+
81
+ All tokens are separated by spaces. Multi-digit integers, delimiters, and task names are individual tokens. A typical example starts with `<|endoftext|>`, then a size token such as `n7`, then the task expression, then `=`.
82
+
83
+ Translation example:
84
+
85
+ ```text
86
+ <|endoftext|> n3 1linebegin [ 3 , 1 , 2 ] 1lineend in cyclenotationmake =
87
+ ```
88
+
89
+ Property example:
90
+
91
+ ```text
92
+ <|endoftext|> n3 1linebegin [ 3 , 2 , 1 ] 1lineend property lengthmake =
93
+ ```
94
+
95
+ Algebraic operation example:
96
+
97
+ ```text
98
+ <|endoftext|> n3 1linebegin [ 2 , 1 , 3 ] 1lineend inversemake =
99
+ ```
100
+
101
+ ## Evaluation Notes
102
+
103
+ The training code evaluates by exact match on the generated right-hand side after `=`. The local training log for this repository reports, at step 2,522,000 on a 2,560-example stratified evaluation sample:
104
+
105
+ - Overall exact match: **98.44%**
106
+ - Translation: **97.78%**
107
+ - Property/statistic tasks: **99.17%**
108
+ - Algebraic tasks: **98.36%**
109
+
110
+ These figures are from the local log and should be treated as checkpoint-adjacent repository metadata, not a full benchmark report for every downstream setting.
111
+
112
+ The paper also reports that PermuFormer is substantially more accurate than frontier general-purpose LLMs on a small held-out sample from the model's symbolic test distribution, while noting that the comparison is imperfect because PermuFormer was trained directly in this syntax.
113
+
114
+ ## Finetuning
115
+
116
+ PermuFormer is designed to be finetuned on specialized permutation tasks. Experiments in the paper include:
117
+
118
+ - 231-avoidance and 2143-avoidance
119
+ - mHeight
120
+ - Schubert polynomial structure constants
121
+ - Kazhdan-Lusztig polynomial degree prediction
122
+
123
+ The repository's finetuning scripts compare starting from this pretrained checkpoint with training the same architecture from scratch.
124
+
125
+ ## Limitations
126
+
127
+ - This is a specialist symbolic model. It expects the exact whitespace-tokenized syntax used during training and is brittle to natural-language paraphrases or malformed prompts.
128
+ - The model is trained on permutations of sizes represented in the training data, primarily `S_2` through `S_11`; behavior outside that regime is not guaranteed.
129
+ - Exact-match accuracy depends on canonical output formatting. Some mathematical tasks may have multiple valid answers, but evaluation expects the chosen canonical form.
130
+ - The model focuses on permutations. It does not natively handle broader combinatorial structures such as arbitrary graphs or partitions unless encoded through the supported task syntax.
131
+ - Outputs should be verified by exact combinatorial software for research-critical use.
132
+
133
+ ## Citation
134
+
135
+ If you use this model, please cite the accompanying PermuFormer paper once citation details are available.