166 MB
3 files
Updated about 1 month ago
Name
Size
data
.gitattributes2.46 kB
xet
README.md2.7 kB
xet
README.md

Dataset Description

Code-170k-sango is a groundbreaking dataset containing 176,999 programming conversations, originally sourced from glaiveai/glaive-code-assistant-v2 and translated into Sango, making coding education accessible to Sango speakers.

🌟 Key Features

  • 176,999 high-quality conversations about programming and coding
  • Pure Sango language - democratizing coding education
  • Multi-turn dialogues covering various programming concepts
  • Diverse topics: algorithms, data structures, debugging, best practices, and more
  • Ready for instruction tuning of Large Language Models

🎯 Use Cases

  • Training Sango-language coding assistants
  • Building educational tools for Sango developers
  • Researching multilingual code generation
  • Creating programming tutorials in Sango
  • Supporting low-resource language AI development

Dataset Structure

Data Fields

  • conversations: A list of conversation turns, where each turn contains:
    • from: The speaker ("human" or "gpt")
    • value: The message content in Sango

Example

{
  "conversations": [
    {
      "from": "human",
      "value": "[Question in Sango]"
    },
    {
      "from": "gpt",
      "value": "[Answer in Sango]"
    }
  ]
}

Usage

Loading the Dataset

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("michsethowusu/Code-170k-sango")

# Access training data
train_data = dataset['train']

# Example: Print first conversation
for turn in train_data[0]['conversations']:
    print(f"{turn['from']}: {turn['value']}")

Citation

@dataset{code170k_sango,
  title={Code-170k-sango: Programming Conversations in Sango},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/michsethowusu/Code-170k-sango}
}

License

This dataset is released under the Apache 2.0 License.


Thank you for using Code-170k-sango to advance programming education in Sango! 🌍✨

Total size
166 MB
Files
3
Last updated
May 25
Pre-warmed CDN
US EU US EU

Contributors