CoIn-Matching-Head / README.md
s1ghhh's picture
Update README.md
319642d verified
---
license: mit
tags:
- auditing
- llm
- reasoning-tokens
- sentence-transformers
- matching-head
datasets:
- s1ghhh/CoIn-Auditing-Dataset
language:
- en
pipeline_tag: feature-extraction
---
# CoIn-Matching-Head
Pre-trained matching head models for the **CoIn** framework β€” a system for auditing hidden reasoning tokens in commercial LLM APIs.
**Paper**: [CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs](https://arxiv.org/abs/2505.13778)
**Code**: [GitHub](https://github.com/s1ghhh/LLM-Auditing-CoIn)
## Model Description
This repository contains three pre-trained models used in the CoIn auditing pipeline:
### 1. Tokens2Block Matching Head (Model A)
- **Purpose**: Verifies that sampled token IDs match their corresponding reasoning blocks
- **Architecture**: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head
- **Input**: Token ID embeddings (mean-pooled) + reasoning block text embedding
- **Output**: Match probability (0-1)
### 2. Block2Answer Matching Head (Model B)
- **Purpose**: Verifies that each reasoning block is semantically relevant to the final answer
- **Architecture**: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head
- **Input**: Reasoning block text embedding + answer text embedding
- **Output**: Match probability (0-1)
### 3. DeepSet Verifier
- **Purpose**: Aggregates per-block matching scores into a final benign/malicious prediction
- **Architecture**: DeepSet (permutation-invariant set encoding)
- **Input**: Sequence of interleaved (score_a, score_b) pairs from Model A and B
- **Output**: Probability of the sample being benign (0-1)
## Training Details
- **Base Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`
- **Matching Head Type**: Cosine similarity head (`cos_sim`)
- **Loss Function**: Focal Loss
- **Optimizer**: Adam
- **Learning Rate**: 2e-5
- **Batch Size**: 128
- **Epochs**: 3
- **Random Seed**: 42
- **Training Data**: [CoIn-Auditing-Dataset](https://huggingface.co/datasets/s1ghhh/CoIn-Auditing-Dataset)
## Usage
### Quick Start
```python
from sentence_transformers import SentenceTransformer
import torch
# Load Model B (Block2Answer, block_size=256)
model_dir = "./matching_head_BlockToAnswer/256/train_all-MiniLM-L6-v2_mixed_pos_merged_4_domain_0.5_hard_easy_mixed_neg_4_domain_limit0_cos_sim_focal_freeze"
embedding_model = SentenceTransformer(f"{model_dir}/embedding_model", trust_remote_code=True)
# Load matching head
from heads import get_matching_head
embedding_dim = embedding_model.get_sentence_embedding_dimension()
matching_head = get_matching_head("cos_sim", embedding_dim)
matching_head.load_state_dict(torch.load(f"{model_dir}/matching_head.pt"))
matching_head.eval()
# Score a (reasoning_block, answer) pair
emb_block = embedding_model.encode("The derivative of x^2 is 2x...", convert_to_tensor=True)
emb_answer = embedding_model.encode("The answer is 2x.", convert_to_tensor=True)
features = {"embedding_a": emb_block.unsqueeze(0), "embedding_b": emb_answer.unsqueeze(0)}
with torch.no_grad():
logits = matching_head(features)["logits"]
score = torch.sigmoid(logits).item()
print(f"Match score: {score:.4f}")
```
### Full Pipeline
See the [GitHub repository](https://github.com/s1ghhh/LLM-Auditing-CoIn) for the complete CoIn pipeline usage.
## File Structure
```
CoIn-Matching-Head/
β”œβ”€β”€ matching_head_TokensToBlock/ # Model A
β”‚ └── {256,512,1024}/ # Block size variants
β”‚ └── train_.../
β”‚ β”œβ”€β”€ embedding_model/ # Sentence-transformers model
β”‚ β”œβ”€β”€ matching_head.pt # Matching head weights
β”‚ └── tokenid_embedding_cache.pt
β”œβ”€β”€ matching_head_BlockToAnswer/ # Model B
β”‚ └── {256,512,1024}/
β”‚ └── train_.../
β”‚ β”œβ”€β”€ embedding_model/
β”‚ └── matching_head.pt
└── learned_verifier/
β”œβ”€β”€ DeepSet/
β”‚ β”œβ”€β”€ deepset_weight.pt # DeepSet verifier weights
β”‚ └── model_cfg.py # Model config
└── RNN/ # RNN verifier variant
```
## Citation
```bibtex
@article{sun2025coin,
title={Coin: Counting the invisible reasoning tokens in commercial opaque llm apis},
author={Sun, Guoheng and Wang, Ziyao and Tian, Bowei and Liu, Meng and Shen, Zheyu and He, Shwai and He, Yexiao and Ye, Wanghao and Wang, Yiting and Li, Ang},
journal={arXiv preprint arXiv:2505.13778},
year={2025}
}
```