commited on
Commit
4db119a
·
verified ·
1 Parent(s): 722e410

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +67 -20
  2. app.py +33 -0
  3. project_description.txt +69 -0
  4. requirements.txt +5 -3
  5. ridge_model.pkl +3 -0
  6. tfidf_vectorizer.pkl +3 -0
README.md CHANGED
@@ -1,20 +1,67 @@
1
- ---
2
- title: Feedback Ell Streamlit
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: Streamlit template space
12
- license: mit
13
- ---
14
-
15
- # Welcome to Streamlit!
16
-
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
-
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📝 Feedback Prize - English Language Learning (Basitleştirilmiş Versiyon)
2
+
3
+ Bu proje, Kaggle'daki "Feedback Prize - English Language Learning" yarışmasına basitleştirilmiş bir çözüm sunar. Öğrenci kompozisyonlarından 6 dil becerisi tahmin edilir:
4
+
5
+ - Cohesion
6
+ - Syntax
7
+ - Vocabulary
8
+ - Phraseology
9
+ - Grammar
10
+ - Conventions
11
+
12
+ ---
13
+
14
+ ## 📁 Kullanılan Veri Seti
15
+
16
+ - `train.csv`: Öğrenci yazıları ve puanlar
17
+ - `test.csv`: Tahmin yapılacak yazılar
18
+ - `sample_submission.csv`: Örnek çıktı formatı
19
+
20
+ Veriler [Kaggle yarışma sayfasından](https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data) indirilebilir.
21
+
22
+ ---
23
+
24
+ ## 🔧 Kullanılan Yöntemler
25
+
26
+ - **TF-IDF** ile metin vektörleştirme
27
+ - **Ridge Regression** ile çoklu puan tahmini
28
+ - `MultiOutputRegressor` ile 6 hedefin aynı anda öğrenilmesi
29
+ - Basit ve etkili yaklaşım (RMSE ≈ 0.56)
30
+
31
+ ---
32
+
33
+ ## 💻 Streamlit Uygulaması
34
+
35
+ ```bash
36
+ streamlit run app.py
37
+
38
+ 📦 Kurulum
39
+
40
+ pip install -r requirements.txt
41
+
42
+
43
+
44
+ 🧠 Model ve Vektörleştirici
45
+ ridge_model.pkl: Eğitilmiş regresyon modeli
46
+
47
+ tfidf_vectorizer.pkl: TF-IDF ile kelime temsilleri
48
+
49
+
50
+ 📤 Kaggle Submission
51
+ Model, test.csv üzerinde tahmin yaparak submission.csv dosyasını üretir. Bu dosya doğrudan Kaggle'a yüklenebilir.📌 Geliştirilebilirlik
52
+ Daha güçlü NLP modelleri (BERT, DeBERTa)
53
+
54
+ Ensemble yaklaşımlar
55
+
56
+ Tokenizer bazlı embedding’ler
57
+
58
+ LSTM/Transformer tabanlı derin modeller
59
+
60
+
61
+
62
+ 🧑‍🎓 Amaç
63
+ Bu proje, gerçek bir yarışmanın sadeleştirilmiş bir çözümünü anlamak, NLP modelleme sürecini öğrenmek ve üretilebilir bir prototip oluşturmak amacıyla geliştirilmiştir.
64
+
65
+
66
+ 🏷️ Lisans
67
+ MIT License
app.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py
2
+ import streamlit as st
3
+ import pandas as pd
4
+ import numpy as np
5
+ import joblib
6
+ from sklearn.feature_extraction.text import TfidfVectorizer
7
+ from sklearn.linear_model import Ridge
8
+ from sklearn.multioutput import MultiOutputRegressor
9
+
10
+ # Başlık
11
+ st.title("📝 English Essay Skill Predictor")
12
+ st.markdown("Yazınızı girin, 6 dil puanını tahmin edelim (cohesion, syntax, etc.)")
13
+
14
+ # Kullanıcıdan metin al
15
+ user_text = st.text_area("✍️ Kompozisyonunuzu buraya yazın", height=250)
16
+
17
+ # Model ve TF-IDF yükleme (önceden eğitilmiş)
18
+ model = joblib.load("ridge_model.pkl")
19
+ tfidf = joblib.load("tfidf_vectorizer.pkl")
20
+
21
+ # Tahmin butonu
22
+ if st.button("📊 Tahmin Et"):
23
+ if user_text.strip() == "":
24
+ st.warning("Lütfen bir yazı girin.")
25
+ else:
26
+ # Vektörleştir
27
+ text_vec = tfidf.transform([user_text])
28
+ preds = model.predict(text_vec)[0]
29
+
30
+ # Sonuçları göster
31
+ labels = ['Cohesion', 'Syntax', 'Vocabulary', 'Phraseology', 'Grammar', 'Conventions']
32
+ for label, score in zip(labels, preds):
33
+ st.write(f"**{label}**: {round(score, 2)} / 5")
project_description.txt ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PROJECT TITLE: Feedback Prize - English Language Learning (Simplified Kaggle Project)
2
+
3
+ ASSIGNMENT OBJECTIVE: Choose a Kaggle competition, process the data, build a machine learning model, and visualize the results.
4
+
5
+ SELECTED COMPETITION:
6
+ Kaggle Challenge: https://www.kaggle.com/competitions/feedback-prize-english-language-learning
7
+
8
+ The goal of this competition is to predict 6 language proficiency scores from student-written essays:
9
+ 1. Cohesion
10
+ 2. Syntax
11
+ 3. Vocabulary
12
+ 4. Phraseology
13
+ 5. Grammar
14
+ 6. Conventions
15
+
16
+ ---
17
+
18
+ STEPS COMPLETED:
19
+
20
+ 1. DATA LOADING AND EXPLORATION
21
+ - Loaded the `train.csv` file.
22
+ - Inspected the content: student essays (`full_text`) and 6 target scores.
23
+ - Explored text length distributions and score histograms.
24
+
25
+ 2. TEXT PROCESSING (Vectorization)
26
+ - Used `TfidfVectorizer` from scikit-learn to convert essays into numerical format.
27
+ - Removed English stopwords and limited features to 10,000.
28
+
29
+ 3. MODEL TRAINING
30
+ - Chose Ridge Regression (with L2 regularization).
31
+ - Used `MultiOutputRegressor` to predict all 6 scores simultaneously.
32
+ - Split data into training and validation sets (80% / 20%).
33
+ - Achieved a validation RMSE: **0.5632**
34
+
35
+ 4. TEST PREDICTIONS AND SUBMISSION
36
+ - Applied TF-IDF on `test.csv` and made predictions.
37
+ - Created a `submission.csv` file matching the `sample_submission.csv` format for Kaggle.
38
+
39
+ 5. STREAMLIT USER INTERFACE
40
+ - Built `app.py` with Streamlit to accept custom essays and predict scores.
41
+ - Used saved model: `ridge_model.pkl`
42
+ - Used saved TF-IDF vectorizer: `tfidf_vectorizer.pkl`
43
+
44
+ 6. INCLUDED FILES
45
+ - `requirements.txt` includes all Python dependencies.
46
+ - Saved trained model and vectorizer as `.pkl` files.
47
+ - Project folder is ready for GitHub or ZIP submission.
48
+
49
+ ---
50
+
51
+ LIBRARIES USED:
52
+ - pandas
53
+ - numpy
54
+ - scikit-learn
55
+ - joblib
56
+ - streamlit
57
+
58
+ ---
59
+
60
+ PROJECT SUMMARY:
61
+ In this project:
62
+ - A real Kaggle NLP competition was selected.
63
+ - All key ML stages were covered: data cleaning, feature extraction, modeling, prediction, evaluation, and web UI.
64
+ - The project serves as both a practical learning experience and a simplified working prototype for multi-target regression in natural language processing.
65
+
66
+ ---
67
+
68
+ COMPLETED BY: [Enter Your Name]
69
+ SUBMISSION DATE: [Enter Date]
requirements.txt CHANGED
@@ -1,3 +1,5 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
1
+ streamlit
2
+ pandas
3
+ scikit-learn
4
+ joblib
5
+ numpy
ridge_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68c068bc0a684d581f4c350662cca089f2b7126a79c2ede0412fe075778b6743
3
+ size 481432
tfidf_vectorizer.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa30b4afda71944c53a5f76c65fea2c987763fbf43bb796f22ba328e5a5dce07
3
+ size 371125