Update README.md

319642d verified 3 months ago

4.6 kB

	---
	license: mit
	tags:
	- auditing
	- llm
	- reasoning-tokens
	- sentence-transformers
	- matching-head
	datasets:
	- s1ghhh/CoIn-Auditing-Dataset
	language:
	- en
	pipeline_tag: feature-extraction
	---

	# CoIn-Matching-Head

	Pre-trained matching head models for the CoIn framework — a system for auditing hidden reasoning tokens in commercial LLM APIs.

	Paper: [CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs](https://arxiv.org/abs/2505.13778)

	Code: [GitHub](https://github.com/s1ghhh/LLM-Auditing-CoIn)

	## Model Description

	This repository contains three pre-trained models used in the CoIn auditing pipeline:

	### 1. Tokens2Block Matching Head (Model A)
	- Purpose: Verifies that sampled token IDs match their corresponding reasoning blocks
	- Architecture: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head
	- Input: Token ID embeddings (mean-pooled) + reasoning block text embedding
	- Output: Match probability (0-1)

	### 2. Block2Answer Matching Head (Model B)
	- Purpose: Verifies that each reasoning block is semantically relevant to the final answer
	- Architecture: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head
	- Input: Reasoning block text embedding + answer text embedding
	- Output: Match probability (0-1)

	### 3. DeepSet Verifier
	- Purpose: Aggregates per-block matching scores into a final benign/malicious prediction
	- Architecture: DeepSet (permutation-invariant set encoding)
	- Input: Sequence of interleaved (score_a, score_b) pairs from Model A and B
	- Output: Probability of the sample being benign (0-1)

	## Training Details

	- Base Embedding Model: `sentence-transformers/all-MiniLM-L6-v2`
	- Matching Head Type: Cosine similarity head (`cos_sim`)
	- Loss Function: Focal Loss
	- Optimizer: Adam
	- Learning Rate: 2e-5
	- Batch Size: 128
	- Epochs: 3
	- Random Seed: 42
	- Training Data: [CoIn-Auditing-Dataset](https://huggingface.co/datasets/s1ghhh/CoIn-Auditing-Dataset)

	## Usage

	### Quick Start

	```python
	from sentence_transformers import SentenceTransformer
	import torch

	# Load Model B (Block2Answer, block_size=256)
	model_dir = "./matching_head_BlockToAnswer/256/train_all-MiniLM-L6-v2_mixed_pos_merged_4_domain_0.5_hard_easy_mixed_neg_4_domain_limit0_cos_sim_focal_freeze"
	embedding_model = SentenceTransformer(f"{model_dir}/embedding_model", trust_remote_code=True)

	# Load matching head
	from heads import get_matching_head
	embedding_dim = embedding_model.get_sentence_embedding_dimension()
	matching_head = get_matching_head("cos_sim", embedding_dim)
	matching_head.load_state_dict(torch.load(f"{model_dir}/matching_head.pt"))
	matching_head.eval()

	# Score a (reasoning_block, answer) pair
	emb_block = embedding_model.encode("The derivative of x^2 is 2x...", convert_to_tensor=True)
	emb_answer = embedding_model.encode("The answer is 2x.", convert_to_tensor=True)

	features = {"embedding_a": emb_block.unsqueeze(0), "embedding_b": emb_answer.unsqueeze(0)}
	with torch.no_grad():
	logits = matching_head(features)["logits"]
	score = torch.sigmoid(logits).item()
	print(f"Match score: {score:.4f}")
	```

	### Full Pipeline

	See the [GitHub repository](https://github.com/s1ghhh/LLM-Auditing-CoIn) for the complete CoIn pipeline usage.

	## File Structure

	```
	CoIn-Matching-Head/
	├── matching_head_TokensToBlock/ # Model A
	│ └── {256,512,1024}/ # Block size variants
	│ └── train_.../
	│ ├── embedding_model/ # Sentence-transformers model
	│ ├── matching_head.pt # Matching head weights
	│ └── tokenid_embedding_cache.pt
	├── matching_head_BlockToAnswer/ # Model B
	│ └── {256,512,1024}/
	│ └── train_.../
	│ ├── embedding_model/
	│ └── matching_head.pt
	└── learned_verifier/
	├── DeepSet/
	│ ├── deepset_weight.pt # DeepSet verifier weights
	│ └── model_cfg.py # Model config
	└── RNN/ # RNN verifier variant
	```

	## Citation

	```bibtex
	@article{sun2025coin,
	title={Coin: Counting the invisible reasoning tokens in commercial opaque llm apis},
	author={Sun, Guoheng and Wang, Ziyao and Tian, Bowei and Liu, Meng and Shen, Zheyu and He, Shwai and He, Yexiao and Ye, Wanghao and Wang, Yiting and Li, Ang},
	journal={arXiv preprint arXiv:2505.13778},
	year={2025}
	}
	```