Spaces:
Sleeping
Sleeping
hç commited on
Upload 6 files
Browse files- README.md +67 -20
- app.py +33 -0
- project_description.txt +69 -0
- requirements.txt +5 -3
- ridge_model.pkl +3 -0
- tfidf_vectorizer.pkl +3 -0
README.md
CHANGED
|
@@ -1,20 +1,67 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 📝 Feedback Prize - English Language Learning (Basitleştirilmiş Versiyon)
|
| 2 |
+
|
| 3 |
+
Bu proje, Kaggle'daki "Feedback Prize - English Language Learning" yarışmasına basitleştirilmiş bir çözüm sunar. Öğrenci kompozisyonlarından 6 dil becerisi tahmin edilir:
|
| 4 |
+
|
| 5 |
+
- Cohesion
|
| 6 |
+
- Syntax
|
| 7 |
+
- Vocabulary
|
| 8 |
+
- Phraseology
|
| 9 |
+
- Grammar
|
| 10 |
+
- Conventions
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## 📁 Kullanılan Veri Seti
|
| 15 |
+
|
| 16 |
+
- `train.csv`: Öğrenci yazıları ve puanlar
|
| 17 |
+
- `test.csv`: Tahmin yapılacak yazılar
|
| 18 |
+
- `sample_submission.csv`: Örnek çıktı formatı
|
| 19 |
+
|
| 20 |
+
Veriler [Kaggle yarışma sayfasından](https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data) indirilebilir.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## 🔧 Kullanılan Yöntemler
|
| 25 |
+
|
| 26 |
+
- **TF-IDF** ile metin vektörleştirme
|
| 27 |
+
- **Ridge Regression** ile çoklu puan tahmini
|
| 28 |
+
- `MultiOutputRegressor` ile 6 hedefin aynı anda öğrenilmesi
|
| 29 |
+
- Basit ve etkili yaklaşım (RMSE ≈ 0.56)
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## 💻 Streamlit Uygulaması
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
streamlit run app.py
|
| 37 |
+
|
| 38 |
+
📦 Kurulum
|
| 39 |
+
|
| 40 |
+
pip install -r requirements.txt
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
🧠 Model ve Vektörleştirici
|
| 45 |
+
ridge_model.pkl: Eğitilmiş regresyon modeli
|
| 46 |
+
|
| 47 |
+
tfidf_vectorizer.pkl: TF-IDF ile kelime temsilleri
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
📤 Kaggle Submission
|
| 51 |
+
Model, test.csv üzerinde tahmin yaparak submission.csv dosyasını üretir. Bu dosya doğrudan Kaggle'a yüklenebilir.📌 Geliştirilebilirlik
|
| 52 |
+
Daha güçlü NLP modelleri (BERT, DeBERTa)
|
| 53 |
+
|
| 54 |
+
Ensemble yaklaşımlar
|
| 55 |
+
|
| 56 |
+
Tokenizer bazlı embedding’ler
|
| 57 |
+
|
| 58 |
+
LSTM/Transformer tabanlı derin modeller
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
🧑🎓 Amaç
|
| 63 |
+
Bu proje, gerçek bir yarışmanın sadeleştirilmiş bir çözümünü anlamak, NLP modelleme sürecini öğrenmek ve üretilebilir bir prototip oluşturmak amacıyla geliştirilmiştir.
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
🏷️ Lisans
|
| 67 |
+
MIT License
|
app.py
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# app.py
|
| 2 |
+
import streamlit as st
|
| 3 |
+
import pandas as pd
|
| 4 |
+
import numpy as np
|
| 5 |
+
import joblib
|
| 6 |
+
from sklearn.feature_extraction.text import TfidfVectorizer
|
| 7 |
+
from sklearn.linear_model import Ridge
|
| 8 |
+
from sklearn.multioutput import MultiOutputRegressor
|
| 9 |
+
|
| 10 |
+
# Başlık
|
| 11 |
+
st.title("📝 English Essay Skill Predictor")
|
| 12 |
+
st.markdown("Yazınızı girin, 6 dil puanını tahmin edelim (cohesion, syntax, etc.)")
|
| 13 |
+
|
| 14 |
+
# Kullanıcıdan metin al
|
| 15 |
+
user_text = st.text_area("✍️ Kompozisyonunuzu buraya yazın", height=250)
|
| 16 |
+
|
| 17 |
+
# Model ve TF-IDF yükleme (önceden eğitilmiş)
|
| 18 |
+
model = joblib.load("ridge_model.pkl")
|
| 19 |
+
tfidf = joblib.load("tfidf_vectorizer.pkl")
|
| 20 |
+
|
| 21 |
+
# Tahmin butonu
|
| 22 |
+
if st.button("📊 Tahmin Et"):
|
| 23 |
+
if user_text.strip() == "":
|
| 24 |
+
st.warning("Lütfen bir yazı girin.")
|
| 25 |
+
else:
|
| 26 |
+
# Vektörleştir
|
| 27 |
+
text_vec = tfidf.transform([user_text])
|
| 28 |
+
preds = model.predict(text_vec)[0]
|
| 29 |
+
|
| 30 |
+
# Sonuçları göster
|
| 31 |
+
labels = ['Cohesion', 'Syntax', 'Vocabulary', 'Phraseology', 'Grammar', 'Conventions']
|
| 32 |
+
for label, score in zip(labels, preds):
|
| 33 |
+
st.write(f"**{label}**: {round(score, 2)} / 5")
|
project_description.txt
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
PROJECT TITLE: Feedback Prize - English Language Learning (Simplified Kaggle Project)
|
| 2 |
+
|
| 3 |
+
ASSIGNMENT OBJECTIVE: Choose a Kaggle competition, process the data, build a machine learning model, and visualize the results.
|
| 4 |
+
|
| 5 |
+
SELECTED COMPETITION:
|
| 6 |
+
Kaggle Challenge: https://www.kaggle.com/competitions/feedback-prize-english-language-learning
|
| 7 |
+
|
| 8 |
+
The goal of this competition is to predict 6 language proficiency scores from student-written essays:
|
| 9 |
+
1. Cohesion
|
| 10 |
+
2. Syntax
|
| 11 |
+
3. Vocabulary
|
| 12 |
+
4. Phraseology
|
| 13 |
+
5. Grammar
|
| 14 |
+
6. Conventions
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
STEPS COMPLETED:
|
| 19 |
+
|
| 20 |
+
1. DATA LOADING AND EXPLORATION
|
| 21 |
+
- Loaded the `train.csv` file.
|
| 22 |
+
- Inspected the content: student essays (`full_text`) and 6 target scores.
|
| 23 |
+
- Explored text length distributions and score histograms.
|
| 24 |
+
|
| 25 |
+
2. TEXT PROCESSING (Vectorization)
|
| 26 |
+
- Used `TfidfVectorizer` from scikit-learn to convert essays into numerical format.
|
| 27 |
+
- Removed English stopwords and limited features to 10,000.
|
| 28 |
+
|
| 29 |
+
3. MODEL TRAINING
|
| 30 |
+
- Chose Ridge Regression (with L2 regularization).
|
| 31 |
+
- Used `MultiOutputRegressor` to predict all 6 scores simultaneously.
|
| 32 |
+
- Split data into training and validation sets (80% / 20%).
|
| 33 |
+
- Achieved a validation RMSE: **0.5632**
|
| 34 |
+
|
| 35 |
+
4. TEST PREDICTIONS AND SUBMISSION
|
| 36 |
+
- Applied TF-IDF on `test.csv` and made predictions.
|
| 37 |
+
- Created a `submission.csv` file matching the `sample_submission.csv` format for Kaggle.
|
| 38 |
+
|
| 39 |
+
5. STREAMLIT USER INTERFACE
|
| 40 |
+
- Built `app.py` with Streamlit to accept custom essays and predict scores.
|
| 41 |
+
- Used saved model: `ridge_model.pkl`
|
| 42 |
+
- Used saved TF-IDF vectorizer: `tfidf_vectorizer.pkl`
|
| 43 |
+
|
| 44 |
+
6. INCLUDED FILES
|
| 45 |
+
- `requirements.txt` includes all Python dependencies.
|
| 46 |
+
- Saved trained model and vectorizer as `.pkl` files.
|
| 47 |
+
- Project folder is ready for GitHub or ZIP submission.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
LIBRARIES USED:
|
| 52 |
+
- pandas
|
| 53 |
+
- numpy
|
| 54 |
+
- scikit-learn
|
| 55 |
+
- joblib
|
| 56 |
+
- streamlit
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
PROJECT SUMMARY:
|
| 61 |
+
In this project:
|
| 62 |
+
- A real Kaggle NLP competition was selected.
|
| 63 |
+
- All key ML stages were covered: data cleaning, feature extraction, modeling, prediction, evaluation, and web UI.
|
| 64 |
+
- The project serves as both a practical learning experience and a simplified working prototype for multi-target regression in natural language processing.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
COMPLETED BY: [Enter Your Name]
|
| 69 |
+
SUBMISSION DATE: [Enter Date]
|
requirements.txt
CHANGED
|
@@ -1,3 +1,5 @@
|
|
| 1 |
-
|
| 2 |
-
pandas
|
| 3 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit
|
| 2 |
+
pandas
|
| 3 |
+
scikit-learn
|
| 4 |
+
joblib
|
| 5 |
+
numpy
|
ridge_model.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:68c068bc0a684d581f4c350662cca089f2b7126a79c2ede0412fe075778b6743
|
| 3 |
+
size 481432
|
tfidf_vectorizer.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fa30b4afda71944c53a5f76c65fea2c987763fbf43bb796f22ba328e5a5dce07
|
| 3 |
+
size 371125
|