| --- |
| title: README |
| emoji: 💻 |
| colorFrom: gray |
| colorTo: green |
| sdk: static |
| pinned: false |
| pinned_items: |
| - collection: [AhiskaAI/datasets] |
| --- |
| <style> |
| body, html, iframe, .cards { |
| background-color: #0b0f19 !important; |
| } |
| h1, h2, h3, strong { |
| color: #ffffff !important; |
| } |
| p, li, span, div { |
| color: #f3f4f6 !important; |
| } |
| hr { |
| border-top: 1px solid #1f2937 !important; |
| } |
| </style> |
| # AhıskaAI |
| |
| An independent, open-source AI research lab specializing in Small Language Models (SLMs), custom tokenizers, robust OCR architectures, and highly curated niche datasets. Built from the ground up with deep technical curiosity. |
|
|
| --- |
|
|
| ## 🧠 Our Approach: "Fail Forward" & Open Code |
|
|
| We believe true machine learning engineering happens through transparency. Instead of only showing perfected weights, AhıskaAI documents the entire lifecycle of model development. |
|
|
| Our workspace is organized into explicit tracks: |
| * **Base Models:** Architectures trained from scratch using our custom-built BPE tokenizers. |
| * **Fine-Tuned Models:** Production-ready SLMs optimized for specific context-driven tasks (translation, historical synthesis, and niche NLP). |
| * **Curated Datasets:** Cleaned, structured data pipelines (including synthetic optimization and conversational formatting). |
| * **Failed Models:** Our explicit log of failed training runs, gradient explosions, and alignment experiments. We publish our mistakes so the community can learn from them. |
|
|
| --- |
|
|
| ## 🛠️ Tech Stack & Focus |
| * **Architectures:** Custom Transformer SLMs (24M to 125M+ parameters) |
| * **NLP Pipelines:** Custom BPE Tokenization, Synthetically Enhanced Datasets (ShareGPT/Alpaca formats) |
| * **Computer Vision:** High-frequency OCR models trained for localized data extraction and captcha bypasses |
| * **Methodologies:** From-scratch Pre-training, Supervised Fine-Tuning (SFT) |
|
|
| --- |
|
|
| ## 🌍 Identity & Mission |
| Named in honor of the Ahıska Turks, our long-term roadmap focuses on bridging state-of-the-art deep learning with cultural and historical preservation—bringing heritage and accurate documentation into the open-source digital landscape. |
|
|
| *Driven by passion. Powered by local compute.* |