# Code Similarity Visualization with GraphCodeBERT This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction. ## βœ’οΈ Reference Martinez-Gil, J. (2025). **Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**. *International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678. ## πŸš€ Features - Selection of two classical sorting algorithms. - Automatic tokenization and embedding via GraphCodeBERT. - PCA-based projection into 2D space for visualization. - Clean, static matplotlib plots showing token overlap and divergence. ## 🧠 Technical Overview - **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base) - **Tokenizer**: RobertaTokenizer - **Embeddings**: Last hidden layer of GraphCodeBERT - **Reduction Technique**: Principal Component Analysis (PCA) - **Interface**: Gradio - **Languages**: Python 3.10+ ## πŸ”¬ Research Context This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis. ## πŸ›  Dependencies All required libraries are listed in `requirements.txt`: ``` transformers torch scikit-learn numpy matplotlib gradio Pillow ``` ## πŸ–₯️ Intended Use - Academic teaching and demonstration of code embeddings - Qualitative evaluation of pretrained models for source code - Supplementary visualization for software engineering publications ## πŸ“¬ Contact **Jorge Martinez-Gil** Senior Research Scientist in Computer Science