Update README.md
Browse files
README.md
CHANGED
|
@@ -19,87 +19,6 @@ Our mission is to combine **linguistics, machine learning, and software engineer
|
|
| 19 |
|
| 20 |
We actively contribute to the **global AI community** through publications, open datasets, benchmarking platforms, and collaborative projects.
|
| 21 |
|
| 22 |
-
---
|
| 23 |
-
|
| 24 |
-
## 🎯 Mission & Vision
|
| 25 |
-
Our goal is to **advance the state of the art** in NLP and AI for low-resource languages by:
|
| 26 |
-
1. **Developing state-of-the-art models** and tools tailored to Turkish and similar languages.
|
| 27 |
-
2. **Creating and maintaining high-quality datasets** and benchmarks to improve transparency and evaluation.
|
| 28 |
-
3. **Fostering collaboration** between academia, industry, and the open-source community.
|
| 29 |
-
4. **Educating the next generation** of NLP researchers in Türkiye and beyond.
|
| 30 |
-
5. **Promoting open science** to accelerate innovation and inclusivity in AI.
|
| 31 |
-
|
| 32 |
-
---
|
| 33 |
-
|
| 34 |
-
## 🧠 Core Research Areas
|
| 35 |
-
- **🔤 Tokenization Research** – Linguistically-informed hybrid tokenizers for agglutinative languages.
|
| 36 |
-
- **🧠 Morphological Tokenizer** – Rule-based, phonetic-aware tokenization with ENCODE/DECODE logic.
|
| 37 |
-
- **📊 Benchmarking & Evaluation** – Turkish MMLU with 6,200+ questions across 62 domains.
|
| 38 |
-
- **🤖 AI Chat Platforms** – Interactive chat environments for LLM deployment in Turkish.
|
| 39 |
-
- **📈 Machine Learning** – Novel algorithms, including data quality-based adaptive learning rates.
|
| 40 |
-
- **📂 Data Science** – Large-scale dataset creation, preprocessing, and analysis for NLP tasks.
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
## 🚀 Featured Projects
|
| 45 |
-
|
| 46 |
-
### 📚 **Turkish MMLU Benchmark**
|
| 47 |
-
- **6200+ questions**, **62 categories** from Turkish academic and professional exams.
|
| 48 |
-
- Original content — *not translated from other languages*.
|
| 49 |
-
- Available on [Hugging Face](https://huggingface.co/datasets/alibayram/turkish_mmlu) & [Zenodo](https://doi.org/10.5281/zenodo.13375018).
|
| 50 |
-
|
| 51 |
-
### 🧩 **Hybrid Tokenizer Framework**
|
| 52 |
-
- Morphological + semantic analysis for agglutinative languages.
|
| 53 |
-
- Handles **phonetic transformations** and **shared token IDs** for similar morphemes.
|
| 54 |
-
- Supports ENCODE/DECODE operations with linguistic accuracy.
|
| 55 |
-
|
| 56 |
-
### 🏥 **Medical LLM Fine-Tuning**
|
| 57 |
-
- Fine-tuned large language models using **167,000+ Turkish doctor–patient dialogues**.
|
| 58 |
-
- Adaptive learning rate techniques based on **data quality scoring**.
|
| 59 |
-
- Specialized for medical documentation, diagnosis support, and patient interaction.
|
| 60 |
-
|
| 61 |
-
### 🐦 **Turkish BERT**
|
| 62 |
-
- Pre-trained transformer for Turkish NLP.
|
| 63 |
-
- Extensive dataset coverage, open-source release, strong downstream task performance.
|
| 64 |
-
|
| 65 |
-
### 📊 **Turkish NLP Dataset**
|
| 66 |
-
- High-quality multi-task annotated dataset.
|
| 67 |
-
- Covers **NER**, **sentiment analysis**, **QA**, and **topic classification**.
|
| 68 |
-
|
| 69 |
-
---
|
| 70 |
-
|
| 71 |
-
## 📑 Selected Publications
|
| 72 |
-
- **Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark** — arXiv 2025.
|
| 73 |
-
- **Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation** — arXiv 2025.
|
| 74 |
-
- **Tokens with Meaning: A Hybrid Tokenization Approach for NLP** — Submitted to *Language Resources and Evaluation* (Springer Nature).
|
| 75 |
-
- **Healthcare-Focused Turkish Medical LLM** — Under review at *ACM TALLIP*.
|
| 76 |
-
- **Morphological Tokenization for Agglutinative Languages** — SIU 2025 Conference.
|
| 77 |
-
|
| 78 |
-
---
|
| 79 |
-
|
| 80 |
-
## 🧑🤝🧑 Team
|
| 81 |
-
Our interdisciplinary team includes:
|
| 82 |
-
- **Ali Bayram** — PhD Candidate, Morphological Tokenizer & NLP Research.
|
| 83 |
-
- **Ali Arda Fincan** — Undergraduate LLM/NLP Researcher.
|
| 84 |
-
- **Ahmet Semih Gümüş** — NLP & AI Applications.
|
| 85 |
-
- **Sercan Karakaş** — AI Reliability & Interpretability.
|
| 86 |
-
- **Demircan Çelik** — NLP Model Deployment.
|
| 87 |
-
- **Yusuf Özdil** — Data Science & Evaluation.
|
| 88 |
-
- **Umut Ertuğrul Daşgın** — Tokenization Research.
|
| 89 |
-
|
| 90 |
-
We collaborate with researchers from **Yıldız Technical University**, **Yeditepe University**, **University of Chicago**, **Istanbul Bilgi University**, and others.
|
| 91 |
-
|
| 92 |
-
---
|
| 93 |
-
|
| 94 |
-
## 🌐 Community & Collaboration
|
| 95 |
-
We believe in **open science** and **community-driven research**:
|
| 96 |
-
- Public issue tracking & Kanban boards.
|
| 97 |
-
- Wiki documentation for tools & datasets.
|
| 98 |
-
- Pull request contributions and open peer review.
|
| 99 |
-
- Hugging Face models, datasets, and Spaces.
|
| 100 |
-
|
| 101 |
-
---
|
| 102 |
-
|
| 103 |
## 📬 Contact
|
| 104 |
🌐 **Website:** [https://magibu.web.app](https://magibu.web.app)
|
| 105 |
🤗 **Hugging Face:** [https://huggingface.co/magibu](https://huggingface.co/magibu)
|
|
|
|
| 19 |
|
| 20 |
We actively contribute to the **global AI community** through publications, open datasets, benchmarking platforms, and collaborative projects.
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
## 📬 Contact
|
| 23 |
🌐 **Website:** [https://magibu.web.app](https://magibu.web.app)
|
| 24 |
🤗 **Hugging Face:** [https://huggingface.co/magibu](https://huggingface.co/magibu)
|