alibayram commited on
Commit
0651bdf
·
verified ·
1 Parent(s): 8c74eb2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -81
README.md CHANGED
@@ -19,87 +19,6 @@ Our mission is to combine **linguistics, machine learning, and software engineer
19
 
20
  We actively contribute to the **global AI community** through publications, open datasets, benchmarking platforms, and collaborative projects.
21
 
22
- ---
23
-
24
- ## 🎯 Mission & Vision
25
- Our goal is to **advance the state of the art** in NLP and AI for low-resource languages by:
26
- 1. **Developing state-of-the-art models** and tools tailored to Turkish and similar languages.
27
- 2. **Creating and maintaining high-quality datasets** and benchmarks to improve transparency and evaluation.
28
- 3. **Fostering collaboration** between academia, industry, and the open-source community.
29
- 4. **Educating the next generation** of NLP researchers in Türkiye and beyond.
30
- 5. **Promoting open science** to accelerate innovation and inclusivity in AI.
31
-
32
- ---
33
-
34
- ## 🧠 Core Research Areas
35
- - **🔤 Tokenization Research** – Linguistically-informed hybrid tokenizers for agglutinative languages.
36
- - **🧠 Morphological Tokenizer** – Rule-based, phonetic-aware tokenization with ENCODE/DECODE logic.
37
- - **📊 Benchmarking & Evaluation** – Turkish MMLU with 6,200+ questions across 62 domains.
38
- - **🤖 AI Chat Platforms** – Interactive chat environments for LLM deployment in Turkish.
39
- - **📈 Machine Learning** – Novel algorithms, including data quality-based adaptive learning rates.
40
- - **📂 Data Science** – Large-scale dataset creation, preprocessing, and analysis for NLP tasks.
41
-
42
- ---
43
-
44
- ## 🚀 Featured Projects
45
-
46
- ### 📚 **Turkish MMLU Benchmark**
47
- - **6200+ questions**, **62 categories** from Turkish academic and professional exams.
48
- - Original content — *not translated from other languages*.
49
- - Available on [Hugging Face](https://huggingface.co/datasets/alibayram/turkish_mmlu) & [Zenodo](https://doi.org/10.5281/zenodo.13375018).
50
-
51
- ### 🧩 **Hybrid Tokenizer Framework**
52
- - Morphological + semantic analysis for agglutinative languages.
53
- - Handles **phonetic transformations** and **shared token IDs** for similar morphemes.
54
- - Supports ENCODE/DECODE operations with linguistic accuracy.
55
-
56
- ### 🏥 **Medical LLM Fine-Tuning**
57
- - Fine-tuned large language models using **167,000+ Turkish doctor–patient dialogues**.
58
- - Adaptive learning rate techniques based on **data quality scoring**.
59
- - Specialized for medical documentation, diagnosis support, and patient interaction.
60
-
61
- ### 🐦 **Turkish BERT**
62
- - Pre-trained transformer for Turkish NLP.
63
- - Extensive dataset coverage, open-source release, strong downstream task performance.
64
-
65
- ### 📊 **Turkish NLP Dataset**
66
- - High-quality multi-task annotated dataset.
67
- - Covers **NER**, **sentiment analysis**, **QA**, and **topic classification**.
68
-
69
- ---
70
-
71
- ## 📑 Selected Publications
72
- - **Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark** — arXiv 2025.
73
- - **Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation** — arXiv 2025.
74
- - **Tokens with Meaning: A Hybrid Tokenization Approach for NLP** — Submitted to *Language Resources and Evaluation* (Springer Nature).
75
- - **Healthcare-Focused Turkish Medical LLM** — Under review at *ACM TALLIP*.
76
- - **Morphological Tokenization for Agglutinative Languages** — SIU 2025 Conference.
77
-
78
- ---
79
-
80
- ## 🧑‍🤝‍🧑 Team
81
- Our interdisciplinary team includes:
82
- - **Ali Bayram** — PhD Candidate, Morphological Tokenizer & NLP Research.
83
- - **Ali Arda Fincan** — Undergraduate LLM/NLP Researcher.
84
- - **Ahmet Semih Gümüş** — NLP & AI Applications.
85
- - **Sercan Karakaş** — AI Reliability & Interpretability.
86
- - **Demircan Çelik** — NLP Model Deployment.
87
- - **Yusuf Özdil** — Data Science & Evaluation.
88
- - **Umut Ertuğrul Daşgın** — Tokenization Research.
89
-
90
- We collaborate with researchers from **Yıldız Technical University**, **Yeditepe University**, **University of Chicago**, **Istanbul Bilgi University**, and others.
91
-
92
- ---
93
-
94
- ## 🌐 Community & Collaboration
95
- We believe in **open science** and **community-driven research**:
96
- - Public issue tracking & Kanban boards.
97
- - Wiki documentation for tools & datasets.
98
- - Pull request contributions and open peer review.
99
- - Hugging Face models, datasets, and Spaces.
100
-
101
- ---
102
-
103
  ## 📬 Contact
104
  🌐 **Website:** [https://magibu.web.app](https://magibu.web.app)
105
  🤗 **Hugging Face:** [https://huggingface.co/magibu](https://huggingface.co/magibu)
 
19
 
20
  We actively contribute to the **global AI community** through publications, open datasets, benchmarking platforms, and collaborative projects.
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## 📬 Contact
23
  🌐 **Website:** [https://magibu.web.app](https://magibu.web.app)
24
  🤗 **Hugging Face:** [https://huggingface.co/magibu](https://huggingface.co/magibu)