Papers
arxiv:2605.08600

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Published on May 9
· Submitted by
Rustem Yeshpanov
on May 12
Authors:

Abstract

A new multilingual movie review dataset from Kazakhstan is introduced with manual annotations for language and sentiment, evaluated using classical and transformer-based models for polarity and score classification tasks.

AI-generated summary

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.

Community

Accepted to NLP4DH 2026

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.08600
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08600 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08600 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.