--- license: mit tags: - auditing - llm - reasoning-tokens - sentence-transformers - matching-head datasets: - s1ghhh/CoIn-Auditing-Dataset language: - en pipeline_tag: feature-extraction --- # CoIn-Matching-Head Pre-trained matching head models for the **CoIn** framework — a system for auditing hidden reasoning tokens in commercial LLM APIs. **Paper**: [CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs](https://arxiv.org/abs/2505.13778) **Code**: [GitHub](https://github.com/s1ghhh/LLM-Auditing-CoIn) ## Model Description This repository contains three pre-trained models used in the CoIn auditing pipeline: ### 1. Tokens2Block Matching Head (Model A) - **Purpose**: Verifies that sampled token IDs match their corresponding reasoning blocks - **Architecture**: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head - **Input**: Token ID embeddings (mean-pooled) + reasoning block text embedding - **Output**: Match probability (0-1) ### 2. Block2Answer Matching Head (Model B) - **Purpose**: Verifies that each reasoning block is semantically relevant to the final answer - **Architecture**: `sentence-transformers/all-MiniLM-L6-v2` base encoder + cosine similarity matching head - **Input**: Reasoning block text embedding + answer text embedding - **Output**: Match probability (0-1) ### 3. DeepSet Verifier - **Purpose**: Aggregates per-block matching scores into a final benign/malicious prediction - **Architecture**: DeepSet (permutation-invariant set encoding) - **Input**: Sequence of interleaved (score_a, score_b) pairs from Model A and B - **Output**: Probability of the sample being benign (0-1) ## Training Details - **Base Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` - **Matching Head Type**: Cosine similarity head (`cos_sim`) - **Loss Function**: Focal Loss - **Optimizer**: Adam - **Learning Rate**: 2e-5 - **Batch Size**: 128 - **Epochs**: 3 - **Random Seed**: 42 - **Training Data**: [CoIn-Auditing-Dataset](https://huggingface.co/datasets/s1ghhh/CoIn-Auditing-Dataset) ## Usage ### Quick Start ```python from sentence_transformers import SentenceTransformer import torch # Load Model B (Block2Answer, block_size=256) model_dir = "./matching_head_BlockToAnswer/256/train_all-MiniLM-L6-v2_mixed_pos_merged_4_domain_0.5_hard_easy_mixed_neg_4_domain_limit0_cos_sim_focal_freeze" embedding_model = SentenceTransformer(f"{model_dir}/embedding_model", trust_remote_code=True) # Load matching head from heads import get_matching_head embedding_dim = embedding_model.get_sentence_embedding_dimension() matching_head = get_matching_head("cos_sim", embedding_dim) matching_head.load_state_dict(torch.load(f"{model_dir}/matching_head.pt")) matching_head.eval() # Score a (reasoning_block, answer) pair emb_block = embedding_model.encode("The derivative of x^2 is 2x...", convert_to_tensor=True) emb_answer = embedding_model.encode("The answer is 2x.", convert_to_tensor=True) features = {"embedding_a": emb_block.unsqueeze(0), "embedding_b": emb_answer.unsqueeze(0)} with torch.no_grad(): logits = matching_head(features)["logits"] score = torch.sigmoid(logits).item() print(f"Match score: {score:.4f}") ``` ### Full Pipeline See the [GitHub repository](https://github.com/s1ghhh/LLM-Auditing-CoIn) for the complete CoIn pipeline usage. ## File Structure ``` CoIn-Matching-Head/ ├── matching_head_TokensToBlock/ # Model A │ └── {256,512,1024}/ # Block size variants │ └── train_.../ │ ├── embedding_model/ # Sentence-transformers model │ ├── matching_head.pt # Matching head weights │ └── tokenid_embedding_cache.pt ├── matching_head_BlockToAnswer/ # Model B │ └── {256,512,1024}/ │ └── train_.../ │ ├── embedding_model/ │ └── matching_head.pt └── learned_verifier/ ├── DeepSet/ │ ├── deepset_weight.pt # DeepSet verifier weights │ └── model_cfg.py # Model config └── RNN/ # RNN verifier variant ``` ## Citation ```bibtex @article{sun2025coin, title={Coin: Counting the invisible reasoning tokens in commercial opaque llm apis}, author={Sun, Guoheng and Wang, Ziyao and Tian, Bowei and Liu, Meng and Shen, Zheyu and He, Shwai and He, Yexiao and Ye, Wanghao and Wang, Yiting and Li, Ang}, journal={arXiv preprint arXiv:2505.13778}, year={2025} } ```