File size: 4,603 Bytes
399efd3
 
 
 
 
 
 
 
 
 
 
 
 
319642d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: mit
tags:
  - auditing
  - llm
  - reasoning-tokens
  - sentence-transformers
  - matching-head
datasets:
  - s1ghhh/CoIn-Auditing-Dataset
language:
  - en
pipeline_tag: feature-extraction
---

# CoIn-Matching-Head

Pre-trained matching head models for the **CoIn** framework β€” a system for auditing hidden reasoning tokens in commercial LLM APIs.

**Paper**: [CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs](https://arxiv.org/abs/2505.13778)

**Code**: [GitHub](https://github.com/s1ghhh/LLM-Auditing-CoIn)

## Model Description

This repository contains three pre-trained models used in the CoIn auditing pipeline:

### 1. Tokens2Block Matching Head (Model A)
- **Purpose**: Verifies that sampled token IDs match their corresponding reasoning blocks
- **Architecture**: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head
- **Input**: Token ID embeddings (mean-pooled) + reasoning block text embedding
- **Output**: Match probability (0-1)

### 2. Block2Answer Matching Head (Model B)
- **Purpose**: Verifies that each reasoning block is semantically relevant to the final answer
- **Architecture**: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head
- **Input**: Reasoning block text embedding + answer text embedding
- **Output**: Match probability (0-1)

### 3. DeepSet Verifier
- **Purpose**: Aggregates per-block matching scores into a final benign/malicious prediction
- **Architecture**: DeepSet (permutation-invariant set encoding)
- **Input**: Sequence of interleaved (score_a, score_b) pairs from Model A and B
- **Output**: Probability of the sample being benign (0-1)

## Training Details

- **Base Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`
- **Matching Head Type**: Cosine similarity head (`cos_sim`)
- **Loss Function**: Focal Loss
- **Optimizer**: Adam
- **Learning Rate**: 2e-5
- **Batch Size**: 128
- **Epochs**: 3
- **Random Seed**: 42
- **Training Data**: [CoIn-Auditing-Dataset](https://huggingface.co/datasets/s1ghhh/CoIn-Auditing-Dataset)

## Usage

### Quick Start

```python
from sentence_transformers import SentenceTransformer
import torch

# Load Model B (Block2Answer, block_size=256)
model_dir = "./matching_head_BlockToAnswer/256/train_all-MiniLM-L6-v2_mixed_pos_merged_4_domain_0.5_hard_easy_mixed_neg_4_domain_limit0_cos_sim_focal_freeze"
embedding_model = SentenceTransformer(f"{model_dir}/embedding_model", trust_remote_code=True)

# Load matching head
from heads import get_matching_head
embedding_dim = embedding_model.get_sentence_embedding_dimension()
matching_head = get_matching_head("cos_sim", embedding_dim)
matching_head.load_state_dict(torch.load(f"{model_dir}/matching_head.pt"))
matching_head.eval()

# Score a (reasoning_block, answer) pair
emb_block = embedding_model.encode("The derivative of x^2 is 2x...", convert_to_tensor=True)
emb_answer = embedding_model.encode("The answer is 2x.", convert_to_tensor=True)

features = {"embedding_a": emb_block.unsqueeze(0), "embedding_b": emb_answer.unsqueeze(0)}
with torch.no_grad():
    logits = matching_head(features)["logits"]
    score = torch.sigmoid(logits).item()
print(f"Match score: {score:.4f}")
```

### Full Pipeline

See the [GitHub repository](https://github.com/s1ghhh/LLM-Auditing-CoIn) for the complete CoIn pipeline usage.

## File Structure

```
CoIn-Matching-Head/
β”œβ”€β”€ matching_head_TokensToBlock/         # Model A
β”‚   └── {256,512,1024}/                  # Block size variants
β”‚       └── train_.../
β”‚           β”œβ”€β”€ embedding_model/         # Sentence-transformers model
β”‚           β”œβ”€β”€ matching_head.pt         # Matching head weights
β”‚           └── tokenid_embedding_cache.pt
β”œβ”€β”€ matching_head_BlockToAnswer/         # Model B
β”‚   └── {256,512,1024}/
β”‚       └── train_.../
β”‚           β”œβ”€β”€ embedding_model/
β”‚           └── matching_head.pt
└── learned_verifier/
    β”œβ”€β”€ DeepSet/
    β”‚   β”œβ”€β”€ deepset_weight.pt            # DeepSet verifier weights
    β”‚   └── model_cfg.py                 # Model config
    └── RNN/                             # RNN verifier variant
```

## Citation

```bibtex
@article{sun2025coin,
  title={Coin: Counting the invisible reasoning tokens in commercial opaque llm apis},
  author={Sun, Guoheng and Wang, Ziyao and Tian, Bowei and Liu, Meng and Shen, Zheyu and He, Shwai and He, Yexiao and Ye, Wanghao and Wang, Yiting and Li, Ang},
  journal={arXiv preprint arXiv:2505.13778},
  year={2025}
}
```