| # Code Similarity Visualization with GraphCodeBERT | |
| This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction. | |
| ## ✒️ Reference | |
| Martinez-Gil, J. (2025). | |
| **Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**. | |
| *International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678. | |
| ## 🚀 Features | |
| - Selection of two classical sorting algorithms. | |
| - Automatic tokenization and embedding via GraphCodeBERT. | |
| - PCA-based projection into 2D space for visualization. | |
| - Clean, static matplotlib plots showing token overlap and divergence. | |
| ## 🧠 Technical Overview | |
| - **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base) | |
| - **Tokenizer**: RobertaTokenizer | |
| - **Embeddings**: Last hidden layer of GraphCodeBERT | |
| - **Reduction Technique**: Principal Component Analysis (PCA) | |
| - **Interface**: Gradio | |
| - **Languages**: Python 3.10+ | |
| ## 🔬 Research Context | |
| This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis. | |
| ## 🛠 Dependencies | |
| All required libraries are listed in `requirements.txt`: | |
| ``` | |
| transformers | |
| torch | |
| scikit-learn | |
| numpy | |
| matplotlib | |
| gradio | |
| Pillow | |
| ``` | |
| ## 🖥️ Intended Use | |
| - Academic teaching and demonstration of code embeddings | |
| - Qualitative evaluation of pretrained models for source code | |
| - Supplementary visualization for software engineering publications | |
| ## 📬 Contact | |
| **Jorge Martinez-Gil** | |
| Senior Research Scientist in Computer Science |