Upload 16 files
Browse files- .gitattributes +1 -0
- README.md +73 -3
- UPLOAD_GUIDE.md +100 -0
- classifier.pkl +3 -0
- config.json +165 -0
- inference.py +157 -0
- label_embeddings.pkl +3 -0
- label_mapping.json +68 -0
- model.safetensors +3 -0
- model_card.md +62 -0
- preprocessor.py +127 -0
- requirements.txt +8 -0
- special_tokens_map.json +15 -0
- tokenizer.json +3 -0
- tokenizer_config.json +54 -0
- usage.md +86 -0
- μΉμ¬μ΄νΈ_μ λ‘λ_κ°μ΄λ.md +76 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,73 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Steel Industry Material Classification Model
|
| 2 |
+
|
| 3 |
+
This model is trained to classify steel industry materials and products based on text descriptions. It uses XLM-RoBERTa as the base model and can classify input text into 66 different steel-related categories.
|
| 4 |
+
|
| 5 |
+
## Model Details
|
| 6 |
+
|
| 7 |
+
- **Base Model**: XLM-RoBERTa
|
| 8 |
+
- **Task**: Sequence Classification
|
| 9 |
+
- **Number of Labels**: 66
|
| 10 |
+
- **Languages**: Korean, English (multilingual support)
|
| 11 |
+
- **Model Size**: ~1GB
|
| 12 |
+
|
| 13 |
+
## Supported Labels
|
| 14 |
+
|
| 15 |
+
The model can classify the following steel industry materials:
|
| 16 |
+
|
| 17 |
+
- Raw Materials: μ² κ΄μ, μνμ, μμ μ½ν¬μ€, 무μ°ν, κ°ν, μμμ²ν, νΌνΈ (Peat), μ€μΌ μ
°μΌ
|
| 18 |
+
- Fuels: μ²μ°κ°μ€, μ‘νμ²μ°κ°μ€, κ²½μ , νλ°μ , λ±μ , λνν, ννΈλ‘€ λ° SBP, μλ₯ μ°λ£μ
|
| 19 |
+
- Gases: μΌμ°ννμ, λ©ν, μν, κ³ λ‘κ°μ€, μ½ν¬μ€ μ€λΈ κ°μ€, μ°μ μ κ°λ‘ κ°μ€, μμ±κ°μ€, κ°μ€κ³΅μ₯ κ°μ€
|
| 20 |
+
- Products: κ°μ² , μ μ² , μ² , μ΄κ°μ±νμ² (HBI), κ³ μ¨ μ±ν νμμ² , μ§μ νμμ²
|
| 21 |
+
- By-products: κ³ λ‘ μ¬λκ·Έ, μμ° μ€μΌμΌ, λΆμ§, μ¬λ¬μ§, μ μμΉ©
|
| 22 |
+
- Others: μ κΈ°, λκ°μ, μ€νμ , ν¬μ₯μ¬, μ΄μ μ
, μ€λ¦¬λ©μ , ν λ
|
| 23 |
+
|
| 24 |
+
## Usage
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 28 |
+
import torch
|
| 29 |
+
|
| 30 |
+
# Load model and tokenizer
|
| 31 |
+
model_name = "your-username/steel-material-classifier"
|
| 32 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 33 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 34 |
+
|
| 35 |
+
# Prepare input
|
| 36 |
+
text = "μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ "
|
| 37 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
| 38 |
+
|
| 39 |
+
# Predict
|
| 40 |
+
with torch.no_grad():
|
| 41 |
+
outputs = model(**inputs)
|
| 42 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 43 |
+
predicted_class = torch.argmax(predictions, dim=1).item()
|
| 44 |
+
|
| 45 |
+
# Get label
|
| 46 |
+
label = model.config.id2label[predicted_class]
|
| 47 |
+
confidence = predictions[0][predicted_class].item()
|
| 48 |
+
|
| 49 |
+
print(f"Predicted: {label}")
|
| 50 |
+
print(f"Confidence: {confidence:.4f}")
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Training Data
|
| 54 |
+
|
| 55 |
+
The model was trained on steel industry material descriptions and technical documents, focusing on Korean and English text related to steel manufacturing processes.
|
| 56 |
+
|
| 57 |
+
## Performance
|
| 58 |
+
|
| 59 |
+
- **Label Independence**: Good (average similarity: 0.1166)
|
| 60 |
+
- **Orthogonality**: Good (average dot product: 0.2043)
|
| 61 |
+
- **Overall Assessment**: The model shows good separation between different material categories
|
| 62 |
+
|
| 63 |
+
## License
|
| 64 |
+
|
| 65 |
+
[Add your license information here]
|
| 66 |
+
|
| 67 |
+
## Citation
|
| 68 |
+
|
| 69 |
+
If you use this model in your research, please cite:
|
| 70 |
+
|
| 71 |
+
```bibtex
|
| 72 |
+
[Add citation information here]
|
| 73 |
+
```
|
UPLOAD_GUIDE.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Steel Material Classification Model Upload Guide
|
| 2 |
+
|
| 3 |
+
## Step 1: Get Hugging Face Token
|
| 4 |
+
|
| 5 |
+
1. Go to https://huggingface.co/settings/tokens
|
| 6 |
+
2. Click "New token"
|
| 7 |
+
3. Give it a name (e.g., "model-upload-token")
|
| 8 |
+
4. Select "Write" role
|
| 9 |
+
5. Copy the token
|
| 10 |
+
|
| 11 |
+
## Step 2: Login to Hugging Face
|
| 12 |
+
|
| 13 |
+
```bash
|
| 14 |
+
huggingface-cli login
|
| 15 |
+
# Enter your token when prompted
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
## Step 3: Create Model Repository
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
huggingface-cli repo create steel-material-classifier --type model
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
## Step 4: Upload Model
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
# Clone the repository
|
| 28 |
+
git clone https://huggingface.co/YOUR_USERNAME/steel-material-classifier
|
| 29 |
+
cd steel-material-classifier
|
| 30 |
+
|
| 31 |
+
# Copy all files from model_v24 directory
|
| 32 |
+
# Then commit and push
|
| 33 |
+
git add .
|
| 34 |
+
git commit -m "Initial commit: Steel material classification model"
|
| 35 |
+
git push
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Alternative: Direct Upload
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
# From the model_v24 directory
|
| 42 |
+
huggingface-cli upload YOUR_USERNAME/steel-material-classifier . --include "*.json,*.safetensors,*.pkl,*.md,*.txt,*.py"
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## Files to Upload
|
| 46 |
+
|
| 47 |
+
### Required Files:
|
| 48 |
+
- β
config.json
|
| 49 |
+
- β
model.safetensors
|
| 50 |
+
- β
tokenizer.json
|
| 51 |
+
- β
tokenizer_config.json
|
| 52 |
+
- β
special_tokens_map.json
|
| 53 |
+
- β
label_mapping.json
|
| 54 |
+
|
| 55 |
+
### Optional Files:
|
| 56 |
+
- β
classifier.pkl
|
| 57 |
+
- β
label_embeddings.pkl
|
| 58 |
+
- β
label_embeddings.pkl.backup
|
| 59 |
+
|
| 60 |
+
### Documentation Files:
|
| 61 |
+
- β
README.md
|
| 62 |
+
- β
requirements.txt
|
| 63 |
+
- β
inference.py
|
| 64 |
+
- β
preprocessor.py
|
| 65 |
+
- β
model_card.md
|
| 66 |
+
- β
usage.md
|
| 67 |
+
|
| 68 |
+
## Model Information
|
| 69 |
+
|
| 70 |
+
- **Model Name**: steel-material-classifier
|
| 71 |
+
- **Base Model**: XLM-RoBERTa
|
| 72 |
+
- **Task**: Sequence Classification
|
| 73 |
+
- **Labels**: 66 steel industry materials
|
| 74 |
+
- **Languages**: Korean, English
|
| 75 |
+
- **Model Size**: ~1GB
|
| 76 |
+
|
| 77 |
+
## Usage After Upload
|
| 78 |
+
|
| 79 |
+
```python
|
| 80 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 81 |
+
import torch
|
| 82 |
+
|
| 83 |
+
# Load model
|
| 84 |
+
model_name = "YOUR_USERNAME/steel-material-classifier"
|
| 85 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 86 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 87 |
+
|
| 88 |
+
# Predict
|
| 89 |
+
text = "μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ "
|
| 90 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
| 91 |
+
|
| 92 |
+
with torch.no_grad():
|
| 93 |
+
outputs = model(**inputs)
|
| 94 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 95 |
+
predicted_class = torch.argmax(predictions, dim=1).item()
|
| 96 |
+
|
| 97 |
+
label = model.config.id2label[predicted_class]
|
| 98 |
+
confidence = predictions[0][predicted_class].item()
|
| 99 |
+
print(f"Predicted: {label} (Confidence: {confidence:.4f})")
|
| 100 |
+
```
|
classifier.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cf2ac4313a1006caa5b470331fcddcf7dd2d368e5822b1c4df3d3926929c8a5e
|
| 3 |
+
size 204311
|
config.json
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "xlm-roberta-base",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"XLMRobertaForSequenceClassification"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 0,
|
| 8 |
+
"classifier_dropout": 0.1,
|
| 9 |
+
"eos_token_id": 2,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 768,
|
| 13 |
+
"initializer_range": 0.02,
|
| 14 |
+
"intermediate_size": 3072,
|
| 15 |
+
"layer_norm_eps": 1e-05,
|
| 16 |
+
"max_position_embeddings": 514,
|
| 17 |
+
"model_type": "xlm-roberta",
|
| 18 |
+
"num_attention_heads": 12,
|
| 19 |
+
"num_hidden_layers": 12,
|
| 20 |
+
"num_labels": 66,
|
| 21 |
+
"output_past": true,
|
| 22 |
+
"pad_token_id": 1,
|
| 23 |
+
"position_embedding_type": "absolute",
|
| 24 |
+
"torch_dtype": "float32",
|
| 25 |
+
"transformers_version": "4.35.2",
|
| 26 |
+
"type_vocab_size": 1,
|
| 27 |
+
"use_cache": true,
|
| 28 |
+
"vocab_size": 250002,
|
| 29 |
+
"id2label": {
|
| 30 |
+
"0": "μ κ²°ν",
|
| 31 |
+
"1": "μ°νλ§κ·Έλ€μ",
|
| 32 |
+
"2": "μ€λΈ μ½ν¬μ€",
|
| 33 |
+
"3": "μ½νλ₯΄",
|
| 34 |
+
"4": "μ§μ νμμ² ",
|
| 35 |
+
"5": "μΌμ°ννμ",
|
| 36 |
+
"6": "μ²μ°κ°μ€",
|
| 37 |
+
"7": "κ°ν",
|
| 38 |
+
"8": "ννΈλ‘€ λ° SBP",
|
| 39 |
+
"9": "μμ²",
|
| 40 |
+
"10": "λκ°μ",
|
| 41 |
+
"11": "κ°μ² ",
|
| 42 |
+
"12": "μνμ",
|
| 43 |
+
"13": "μ°μ
νκΈ°λ¬Ό",
|
| 44 |
+
"14": "λ©ν",
|
| 45 |
+
"15": "κ³ λ‘ μ¬λκ·Έ",
|
| 46 |
+
"16": "μ² μ€ν¬λ©",
|
| 47 |
+
"17": "λΆμ§",
|
| 48 |
+
"18": "μ€νμ ",
|
| 49 |
+
"19": "μ‘νμμ κ°μ€",
|
| 50 |
+
"20": "κ°μ² μ€ν¬λ©",
|
| 51 |
+
"21": "νμ°λ¦¬ν¬",
|
| 52 |
+
"22": "κ²½μ ",
|
| 53 |
+
"23": "μλ₯ μ°λ£μ ",
|
| 54 |
+
"24": "μ κΈ°",
|
| 55 |
+
"25": "무μ°ν",
|
| 56 |
+
"26": "μ€μΌ μ
°μΌ",
|
| 57 |
+
"27": "μ² κ΄μ",
|
| 58 |
+
"28": "νμ°μμλνΈλ₯¨",
|
| 59 |
+
"29": "νμ°λ°λ₯¨",
|
| 60 |
+
"30": "ν¬μ₯μ¬",
|
| 61 |
+
"31": "μ‘ν μ²μ°κ°μ€",
|
| 62 |
+
"32": "μ¬λ¬μ§",
|
| 63 |
+
"33": "μλ€ν",
|
| 64 |
+
"34": "μ°νλ°λ₯¨",
|
| 65 |
+
"35": "κ°μ€κ³΅μ₯ κ°μ€",
|
| 66 |
+
"36": "νμ ",
|
| 67 |
+
"37": "EAF νμ μ κ·Ή",
|
| 68 |
+
"38": "μμ° μ€μΌμΌ",
|
| 69 |
+
"39": "μ½ν¬μ€ μ€λΈ κ°μ€",
|
| 70 |
+
"40": "EAF μΆ©μ νμ",
|
| 71 |
+
"41": "κ³ λ‘κ°μ€",
|
| 72 |
+
"42": "μ΄κ°μ±νμ² (HBI)",
|
| 73 |
+
"43": "νΌνΈ (Peat)",
|
| 74 |
+
"44": "μ μ² ",
|
| 75 |
+
"45": "μμ ",
|
| 76 |
+
"46": "μ°μ μ κ°λ‘ κ°μ€",
|
| 77 |
+
"47": "μ΄μ μ
",
|
| 78 |
+
"48": "μ μμΉ©",
|
| 79 |
+
"49": "μμμ²ν",
|
| 80 |
+
"50": "λ§κ·Έλ€μ¬μ΄νΈ",
|
| 81 |
+
"51": "μμ μ½ν¬μ€",
|
| 82 |
+
"52": "ν λ ",
|
| 83 |
+
"53": "μ€λ¦¬λ©μ ",
|
| 84 |
+
"54": "μ‘ν μμ κ°μ€",
|
| 85 |
+
"55": "λ±μ ",
|
| 86 |
+
"56": "μμ±κ°μ€",
|
| 87 |
+
"57": "μν",
|
| 88 |
+
"58": "μ°νμΉΌμ",
|
| 89 |
+
"59": "λνν",
|
| 90 |
+
"60": "μ² ",
|
| 91 |
+
"61": "λ₯μ² κ΄",
|
| 92 |
+
"62": "μκ²°κ΄",
|
| 93 |
+
"63": "κ³ μ¨ μ±ν νμμ² ",
|
| 94 |
+
"64": "νλ°μ ",
|
| 95 |
+
"65": "νμ°μ€νΈλ‘ ν¬"
|
| 96 |
+
},
|
| 97 |
+
"label2id": {
|
| 98 |
+
"μ κ²°ν": 0,
|
| 99 |
+
"μ°νλ§κ·Έλ€μ": 1,
|
| 100 |
+
"μ€λΈ μ½ν¬μ€": 2,
|
| 101 |
+
"μ½νλ₯΄": 3,
|
| 102 |
+
"μ§μ νμμ² ": 4,
|
| 103 |
+
"μΌμ°ννμ": 5,
|
| 104 |
+
"μ²μ°κ°μ€": 6,
|
| 105 |
+
"κ°ν": 7,
|
| 106 |
+
"ννΈλ‘€ λ° SBP": 8,
|
| 107 |
+
"μμ²": 9,
|
| 108 |
+
"λκ°μ": 10,
|
| 109 |
+
"κ°μ² ": 11,
|
| 110 |
+
"μνμ": 12,
|
| 111 |
+
"μ°μ
νκΈ°λ¬Ό": 13,
|
| 112 |
+
"λ©ν": 14,
|
| 113 |
+
"κ³ λ‘ μ¬λκ·Έ": 15,
|
| 114 |
+
"μ² μ€ν¬λ©": 16,
|
| 115 |
+
"λΆμ§": 17,
|
| 116 |
+
"μ€νμ ": 18,
|
| 117 |
+
"μ‘νμμ κ°μ€": 19,
|
| 118 |
+
"κ°μ² μ€ν¬λ©": 20,
|
| 119 |
+
"νμ°λ¦¬ν¬": 21,
|
| 120 |
+
"κ²½μ ": 22,
|
| 121 |
+
"μλ₯ μ°λ£μ ": 23,
|
| 122 |
+
"μ κΈ°": 24,
|
| 123 |
+
"무μ°ν": 25,
|
| 124 |
+
"μ€μΌ μ
°μΌ": 26,
|
| 125 |
+
"μ² κ΄μ": 27,
|
| 126 |
+
"νμ°μμλνΈλ₯¨": 28,
|
| 127 |
+
"νμ°λ°λ₯¨": 29,
|
| 128 |
+
"ν¬μ₯μ¬": 30,
|
| 129 |
+
"μ‘ν μ²μ°κ°μ€": 31,
|
| 130 |
+
"μ¬λ¬μ§": 32,
|
| 131 |
+
"μλ€ν": 33,
|
| 132 |
+
"μ°νλ°λ₯¨": 34,
|
| 133 |
+
"κ°μ€κ³΅μ₯ κ°μ€": 35,
|
| 134 |
+
"νμ ": 36,
|
| 135 |
+
"EAF νμ μ κ·Ή": 37,
|
| 136 |
+
"μμ° μ€μΌμΌ": 38,
|
| 137 |
+
"μ½ν¬μ€ μ€λΈ κ°μ€": 39,
|
| 138 |
+
"EAF μΆ©μ νμ": 40,
|
| 139 |
+
"κ³ λ‘κ°μ€": 41,
|
| 140 |
+
"μ΄κ°μ±νμ² (HBI)": 42,
|
| 141 |
+
"νΌνΈ (Peat)": 43,
|
| 142 |
+
"μ μ² ": 44,
|
| 143 |
+
"μμ ": 45,
|
| 144 |
+
"μ°μ μ κ°λ‘ κ°μ€": 46,
|
| 145 |
+
"μ΄μ μ
": 47,
|
| 146 |
+
"μ μμΉ©": 48,
|
| 147 |
+
"μμμ²ν": 49,
|
| 148 |
+
"λ§κ·Έλ€μ¬μ΄νΈ": 50,
|
| 149 |
+
"μμ μ½ν¬μ€": 51,
|
| 150 |
+
"ν λ ": 52,
|
| 151 |
+
"μ€λ¦¬λ©μ ": 53,
|
| 152 |
+
"μ‘ν μμ κ°μ€": 54,
|
| 153 |
+
"λ±μ ": 55,
|
| 154 |
+
"μμ±κ°μ€": 56,
|
| 155 |
+
"μν": 57,
|
| 156 |
+
"μ°νμΉΌμ": 58,
|
| 157 |
+
"λνν": 59,
|
| 158 |
+
"μ² ": 60,
|
| 159 |
+
"λ₯μ² κ΄": 61,
|
| 160 |
+
"μκ²°κ΄": 62,
|
| 161 |
+
"κ³ μ¨ μ±ν νμμ² ": 63,
|
| 162 |
+
"νλ°μ ": 64,
|
| 163 |
+
"νμ°μ€νΈλ‘ ν¬": 65
|
| 164 |
+
}
|
| 165 |
+
}
|
inference.py
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
import numpy as np
|
| 3 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 4 |
+
import pickle
|
| 5 |
+
import json
|
| 6 |
+
import os
|
| 7 |
+
|
| 8 |
+
class SteelMaterialClassifier:
|
| 9 |
+
def __init__(self, model_path):
|
| 10 |
+
"""
|
| 11 |
+
Initialize the steel material classifier
|
| 12 |
+
|
| 13 |
+
Args:
|
| 14 |
+
model_path: Path to the model directory
|
| 15 |
+
"""
|
| 16 |
+
self.model_path = model_path
|
| 17 |
+
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 18 |
+
|
| 19 |
+
# Load model and tokenizer
|
| 20 |
+
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 21 |
+
self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
|
| 22 |
+
self.model.to(self.device)
|
| 23 |
+
self.model.eval()
|
| 24 |
+
|
| 25 |
+
# Load additional components
|
| 26 |
+
self._load_additional_components()
|
| 27 |
+
|
| 28 |
+
def _load_additional_components(self):
|
| 29 |
+
"""Load classifier and label embeddings if they exist"""
|
| 30 |
+
try:
|
| 31 |
+
# Load classifier if exists
|
| 32 |
+
classifier_path = os.path.join(self.model_path, "classifier.pkl")
|
| 33 |
+
if os.path.exists(classifier_path):
|
| 34 |
+
with open(classifier_path, 'rb') as f:
|
| 35 |
+
self.classifier = pickle.load(f)
|
| 36 |
+
else:
|
| 37 |
+
self.classifier = None
|
| 38 |
+
|
| 39 |
+
# Load label embeddings if exists
|
| 40 |
+
embeddings_path = os.path.join(self.model_path, "label_embeddings.pkl")
|
| 41 |
+
if os.path.exists(embeddings_path):
|
| 42 |
+
with open(embeddings_path, 'rb') as f:
|
| 43 |
+
self.label_embeddings = pickle.load(f)
|
| 44 |
+
else:
|
| 45 |
+
self.label_embeddings = None
|
| 46 |
+
|
| 47 |
+
except Exception as e:
|
| 48 |
+
print(f"Warning: Could not load additional components: {e}")
|
| 49 |
+
self.classifier = None
|
| 50 |
+
self.label_embeddings = None
|
| 51 |
+
|
| 52 |
+
def predict(self, text, top_k=5):
|
| 53 |
+
"""
|
| 54 |
+
Predict steel material classification
|
| 55 |
+
|
| 56 |
+
Args:
|
| 57 |
+
text: Input text to classify
|
| 58 |
+
top_k: Number of top predictions to return
|
| 59 |
+
|
| 60 |
+
Returns:
|
| 61 |
+
dict: Prediction results with labels and probabilities
|
| 62 |
+
"""
|
| 63 |
+
# Tokenize input
|
| 64 |
+
inputs = self.tokenizer(
|
| 65 |
+
text,
|
| 66 |
+
return_tensors="pt",
|
| 67 |
+
truncation=True,
|
| 68 |
+
max_length=512,
|
| 69 |
+
padding=True
|
| 70 |
+
)
|
| 71 |
+
inputs = {k: v.to(self.device) for k, v in inputs.items()}
|
| 72 |
+
|
| 73 |
+
# Get model predictions
|
| 74 |
+
with torch.no_grad():
|
| 75 |
+
outputs = self.model(**inputs)
|
| 76 |
+
logits = outputs.logits
|
| 77 |
+
probabilities = torch.nn.functional.softmax(logits, dim=-1)
|
| 78 |
+
|
| 79 |
+
# Get top-k predictions
|
| 80 |
+
top_probs, top_indices = torch.topk(probabilities, top_k, dim=1)
|
| 81 |
+
|
| 82 |
+
# Convert to results
|
| 83 |
+
results = []
|
| 84 |
+
for i in range(top_k):
|
| 85 |
+
label_id = top_indices[0][i].item()
|
| 86 |
+
probability = top_probs[0][i].item()
|
| 87 |
+
label = self.model.config.id2label[label_id]
|
| 88 |
+
|
| 89 |
+
results.append({
|
| 90 |
+
"label": label,
|
| 91 |
+
"label_id": label_id,
|
| 92 |
+
"probability": probability
|
| 93 |
+
})
|
| 94 |
+
|
| 95 |
+
return {
|
| 96 |
+
"predictions": results,
|
| 97 |
+
"input_text": text,
|
| 98 |
+
"model_info": {
|
| 99 |
+
"model_name": self.model.config._name_or_path,
|
| 100 |
+
"num_labels": self.model.config.num_labels,
|
| 101 |
+
"device": str(self.device)
|
| 102 |
+
}
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
def predict_batch(self, texts, top_k=5):
|
| 106 |
+
"""
|
| 107 |
+
Predict for multiple texts
|
| 108 |
+
|
| 109 |
+
Args:
|
| 110 |
+
texts: List of input texts
|
| 111 |
+
top_k: Number of top predictions to return
|
| 112 |
+
|
| 113 |
+
Returns:
|
| 114 |
+
list: List of prediction results
|
| 115 |
+
"""
|
| 116 |
+
results = []
|
| 117 |
+
for text in texts:
|
| 118 |
+
result = self.predict(text, top_k)
|
| 119 |
+
results.append(result)
|
| 120 |
+
return results
|
| 121 |
+
|
| 122 |
+
def get_label_info(self):
|
| 123 |
+
"""
|
| 124 |
+
Get information about all available labels
|
| 125 |
+
|
| 126 |
+
Returns:
|
| 127 |
+
dict: Label information
|
| 128 |
+
"""
|
| 129 |
+
return {
|
| 130 |
+
"num_labels": self.model.config.num_labels,
|
| 131 |
+
"id2label": self.model.config.id2label,
|
| 132 |
+
"label2id": self.model.config.label2id
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
# Example usage
|
| 136 |
+
if __name__ == "__main__":
|
| 137 |
+
# Initialize classifier
|
| 138 |
+
model_path = "." # Current directory
|
| 139 |
+
classifier = SteelMaterialClassifier(model_path)
|
| 140 |
+
|
| 141 |
+
# Example predictions
|
| 142 |
+
test_texts = [
|
| 143 |
+
"μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ ",
|
| 144 |
+
"μ²μ°κ°μ€λ₯Ό μ°λ£λ‘ μ¬μ©νμ¬ κ³ λ‘λ₯Ό κ°μ΄",
|
| 145 |
+
"μνμμ 첨κ°νμ¬ μ¬λκ·Έλ₯Ό νμ±"
|
| 146 |
+
]
|
| 147 |
+
|
| 148 |
+
print("=== Steel Material Classification Results ===")
|
| 149 |
+
for text in test_texts:
|
| 150 |
+
result = classifier.predict(text)
|
| 151 |
+
print(f"\nInput: {text}")
|
| 152 |
+
print(f"Top prediction: {result['predictions'][0]['label']} ({result['predictions'][0]['probability']:.4f})")
|
| 153 |
+
|
| 154 |
+
# Show top 3 predictions
|
| 155 |
+
print("Top 3 predictions:")
|
| 156 |
+
for i, pred in enumerate(result['predictions'][:3]):
|
| 157 |
+
print(f" {i+1}. {pred['label']}: {pred['probability']:.4f}")
|
label_embeddings.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:80277db7a3eb26fca6c66c48e4410ca6f591cfc7242e698cddf8ed13ae583026
|
| 3 |
+
size 206147
|
label_mapping.json
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"μ κ²°ν": 0,
|
| 3 |
+
"μ°νλ§κ·Έλ€μ": 1,
|
| 4 |
+
"μ€λΈ μ½ν¬μ€": 2,
|
| 5 |
+
"μ½νλ₯΄": 3,
|
| 6 |
+
"μ§μ νμμ² ": 4,
|
| 7 |
+
"μΌμ°ννμ": 5,
|
| 8 |
+
"μ²μ°κ°μ€": 6,
|
| 9 |
+
"κ°ν": 7,
|
| 10 |
+
"ννΈλ‘€ λ° SBP": 8,
|
| 11 |
+
"μμ²": 9,
|
| 12 |
+
"λκ°μ": 10,
|
| 13 |
+
"κ°μ² ": 11,
|
| 14 |
+
"μνμ": 12,
|
| 15 |
+
"μ°μ
νκΈ°λ¬Ό": 13,
|
| 16 |
+
"λ©ν": 14,
|
| 17 |
+
"κ³ λ‘ μ¬λκ·Έ": 15,
|
| 18 |
+
"μ² μ€ν¬λ©": 16,
|
| 19 |
+
"λΆμ§": 17,
|
| 20 |
+
"μ€νμ ": 18,
|
| 21 |
+
"μ‘νμμ κ°μ€": 19,
|
| 22 |
+
"κ°μ² μ€ν¬λ©": 20,
|
| 23 |
+
"νμ°λ¦¬ν¬": 21,
|
| 24 |
+
"κ²½μ ": 22,
|
| 25 |
+
"μλ₯ μ°λ£μ ": 23,
|
| 26 |
+
"μ κΈ°": 24,
|
| 27 |
+
"무μ°ν": 25,
|
| 28 |
+
"μ€μΌ μ
°μΌ": 26,
|
| 29 |
+
"μ² κ΄μ": 27,
|
| 30 |
+
"νμ°μμλνΈλ₯¨": 28,
|
| 31 |
+
"νμ°λ°λ₯¨": 29,
|
| 32 |
+
"ν¬μ₯μ¬": 30,
|
| 33 |
+
"μ‘ν μ²μ°κ°μ€": 31,
|
| 34 |
+
"μ¬λ¬μ§": 32,
|
| 35 |
+
"μλ€ν": 33,
|
| 36 |
+
"μ°νλ°λ₯¨": 34,
|
| 37 |
+
"κ°μ€κ³΅μ₯ κ°μ€": 35,
|
| 38 |
+
"νμ ": 36,
|
| 39 |
+
"EAF νμ μ κ·Ή": 37,
|
| 40 |
+
"μμ° μ€μΌμΌ": 38,
|
| 41 |
+
"μ½ν¬μ€ μ€λΈ κ°μ€": 39,
|
| 42 |
+
"EAF μΆ©μ νμ": 40,
|
| 43 |
+
"κ³ λ‘κ°μ€": 41,
|
| 44 |
+
"μ΄κ°μ±νμ² (HBI)": 42,
|
| 45 |
+
"νΌνΈ (Peat)": 43,
|
| 46 |
+
"μ μ² ": 44,
|
| 47 |
+
"μμ ": 45,
|
| 48 |
+
"μ°μ μ κ°λ‘ κ°μ€": 46,
|
| 49 |
+
"μ΄μ μ
": 47,
|
| 50 |
+
"μ μμΉ©": 48,
|
| 51 |
+
"μμμ²ν": 49,
|
| 52 |
+
"λ§κ·Έλ€μ¬μ΄νΈ": 50,
|
| 53 |
+
"μμ μ½ν¬μ€": 51,
|
| 54 |
+
"ν λ ": 52,
|
| 55 |
+
"μ€λ¦¬λ©μ ": 53,
|
| 56 |
+
"μ‘ν μμ κ°μ€": 54,
|
| 57 |
+
"λ±μ ": 55,
|
| 58 |
+
"μμ±κ°μ€": 56,
|
| 59 |
+
"μν": 57,
|
| 60 |
+
"μ°νμΉΌμ": 58,
|
| 61 |
+
"λνν": 59,
|
| 62 |
+
"μ² ": 60,
|
| 63 |
+
"λ₯μ² κ΄": 61,
|
| 64 |
+
"μκ²°κ΄": 62,
|
| 65 |
+
"κ³ μ¨ μ±ν νμμ² ": 63,
|
| 66 |
+
"νλ°μ ": 64,
|
| 67 |
+
"νμ°μ€νΈλ‘ ν¬": 65
|
| 68 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fa9f78463531db7ec98f441bf5676f517c701cc4554814198711c1b465e9c3b8
|
| 3 |
+
size 1112197096
|
model_card.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face Model Card for Steel Material Classification
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
This model is designed to classify steel industry materials and products based on text descriptions. It uses XLM-RoBERTa as the base model and can classify input text into 66 different steel-related categories.
|
| 6 |
+
|
| 7 |
+
- **Developed by:** [Your Name/Organization]
|
| 8 |
+
- **Model type:** Text Classification
|
| 9 |
+
- **Language(s):** Korean, English (multilingual)
|
| 10 |
+
- **License:** [Your License]
|
| 11 |
+
- **Finetuned from model:** xlm-roberta-base
|
| 12 |
+
|
| 13 |
+
## Intended Uses & Limitations
|
| 14 |
+
|
| 15 |
+
### Intended Uses
|
| 16 |
+
|
| 17 |
+
This model is intended to be used for:
|
| 18 |
+
- Classifying steel industry materials from text descriptions
|
| 19 |
+
- Supporting LCA (Life Cycle Assessment) analysis in steel manufacturing
|
| 20 |
+
- Automating material categorization in steel industry documentation
|
| 21 |
+
|
| 22 |
+
### Limitations
|
| 23 |
+
|
| 24 |
+
- The model is specifically trained for steel industry materials and may not perform well on other domains
|
| 25 |
+
- Performance may vary with different text styles or technical terminology
|
| 26 |
+
- The model requires Korean or English text input
|
| 27 |
+
|
| 28 |
+
## Training and Evaluation Data
|
| 29 |
+
|
| 30 |
+
### Training Data
|
| 31 |
+
|
| 32 |
+
The model was trained on steel industry material descriptions and technical documents, focusing on Korean and English text related to steel manufacturing processes.
|
| 33 |
+
|
| 34 |
+
### Evaluation Data
|
| 35 |
+
|
| 36 |
+
[Add information about evaluation data]
|
| 37 |
+
|
| 38 |
+
## Training Results
|
| 39 |
+
|
| 40 |
+
### Training Infrastructure
|
| 41 |
+
|
| 42 |
+
[Add training infrastructure details]
|
| 43 |
+
|
| 44 |
+
### Training Results
|
| 45 |
+
|
| 46 |
+
- **Label Independence**: Good (average similarity: 0.1166)
|
| 47 |
+
- **Orthogonality**: Good (average dot product: 0.2043)
|
| 48 |
+
- **Overall Assessment**: The model shows good separation between different material categories
|
| 49 |
+
|
| 50 |
+
## Environmental Impact
|
| 51 |
+
|
| 52 |
+
[Add environmental impact information]
|
| 53 |
+
|
| 54 |
+
## Citation
|
| 55 |
+
|
| 56 |
+
[Add citation information]
|
| 57 |
+
|
| 58 |
+
## Glossary
|
| 59 |
+
|
| 60 |
+
- **LCA**: Life Cycle Assessment
|
| 61 |
+
- **Steel Industry Materials**: Raw materials, fuels, gases, products, and by-products used in steel manufacturing
|
| 62 |
+
- **XLM-RoBERTa**: Cross-lingual language model based on RoBERTa architecture
|
preprocessor.py
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 3 |
+
import numpy as np
|
| 4 |
+
|
| 5 |
+
def preprocess_function(examples, tokenizer, max_length=512):
|
| 6 |
+
"""
|
| 7 |
+
Preprocess text data for the steel material classification model
|
| 8 |
+
|
| 9 |
+
Args:
|
| 10 |
+
examples: Dataset examples containing text
|
| 11 |
+
tokenizer: Tokenizer instance
|
| 12 |
+
max_length: Maximum sequence length
|
| 13 |
+
|
| 14 |
+
Returns:
|
| 15 |
+
dict: Tokenized inputs
|
| 16 |
+
"""
|
| 17 |
+
# Tokenize the texts
|
| 18 |
+
result = tokenizer(
|
| 19 |
+
examples["text"],
|
| 20 |
+
truncation=True,
|
| 21 |
+
padding="max_length",
|
| 22 |
+
max_length=max_length,
|
| 23 |
+
return_tensors="pt"
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
return result
|
| 27 |
+
|
| 28 |
+
def postprocess_function(predictions, id2label):
|
| 29 |
+
"""
|
| 30 |
+
Postprocess model predictions
|
| 31 |
+
|
| 32 |
+
Args:
|
| 33 |
+
predictions: Raw model predictions
|
| 34 |
+
id2label: Mapping from label IDs to label names
|
| 35 |
+
|
| 36 |
+
Returns:
|
| 37 |
+
dict: Processed predictions with labels and probabilities
|
| 38 |
+
"""
|
| 39 |
+
# Convert logits to probabilities
|
| 40 |
+
probabilities = torch.nn.functional.softmax(torch.tensor(predictions), dim=-1)
|
| 41 |
+
|
| 42 |
+
# Get top predictions
|
| 43 |
+
top_probs, top_indices = torch.topk(probabilities, k=5, dim=1)
|
| 44 |
+
|
| 45 |
+
results = []
|
| 46 |
+
for i in range(len(predictions)):
|
| 47 |
+
sample_results = []
|
| 48 |
+
for j in range(5):
|
| 49 |
+
label_id = top_indices[i][j].item()
|
| 50 |
+
probability = top_probs[i][j].item()
|
| 51 |
+
label = id2label[label_id]
|
| 52 |
+
|
| 53 |
+
sample_results.append({
|
| 54 |
+
"label": label,
|
| 55 |
+
"label_id": label_id,
|
| 56 |
+
"probability": probability
|
| 57 |
+
})
|
| 58 |
+
results.append(sample_results)
|
| 59 |
+
|
| 60 |
+
return results
|
| 61 |
+
|
| 62 |
+
def validate_input(text):
|
| 63 |
+
"""
|
| 64 |
+
Validate input text for classification
|
| 65 |
+
|
| 66 |
+
Args:
|
| 67 |
+
text: Input text to validate
|
| 68 |
+
|
| 69 |
+
Returns:
|
| 70 |
+
bool: True if valid, False otherwise
|
| 71 |
+
"""
|
| 72 |
+
if not isinstance(text, str):
|
| 73 |
+
return False
|
| 74 |
+
|
| 75 |
+
if len(text.strip()) == 0:
|
| 76 |
+
return False
|
| 77 |
+
|
| 78 |
+
if len(text) > 1000: # Reasonable limit for steel material descriptions
|
| 79 |
+
return False
|
| 80 |
+
|
| 81 |
+
return True
|
| 82 |
+
|
| 83 |
+
def clean_text(text):
|
| 84 |
+
"""
|
| 85 |
+
Clean and normalize input text
|
| 86 |
+
|
| 87 |
+
Args:
|
| 88 |
+
text: Raw input text
|
| 89 |
+
|
| 90 |
+
Returns:
|
| 91 |
+
str: Cleaned text
|
| 92 |
+
"""
|
| 93 |
+
# Remove extra whitespace
|
| 94 |
+
text = " ".join(text.split())
|
| 95 |
+
|
| 96 |
+
# Normalize Korean characters (if needed)
|
| 97 |
+
# Add any specific text cleaning rules here
|
| 98 |
+
|
| 99 |
+
return text.strip()
|
| 100 |
+
|
| 101 |
+
# Example usage
|
| 102 |
+
if __name__ == "__main__":
|
| 103 |
+
# Load tokenizer
|
| 104 |
+
tokenizer = AutoTokenizer.from_pretrained(".")
|
| 105 |
+
|
| 106 |
+
# Example preprocessing
|
| 107 |
+
example_texts = [
|
| 108 |
+
"μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ ",
|
| 109 |
+
"μ²μ°κ°μ€λ₯Ό μ°λ£λ‘ μ¬μ©νμ¬ κ³ λ‘λ₯Ό κ°μ΄",
|
| 110 |
+
"μνμμ 첨κ°νμ¬ μ¬λκ·Έλ₯Ό νμ±"
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
# Clean and validate texts
|
| 114 |
+
cleaned_texts = []
|
| 115 |
+
for text in example_texts:
|
| 116 |
+
if validate_input(text):
|
| 117 |
+
cleaned_text = clean_text(text)
|
| 118 |
+
cleaned_texts.append(cleaned_text)
|
| 119 |
+
|
| 120 |
+
# Preprocess
|
| 121 |
+
examples = {"text": cleaned_texts}
|
| 122 |
+
tokenized = preprocess_function(examples, tokenizer)
|
| 123 |
+
|
| 124 |
+
print("=== Preprocessing Example ===")
|
| 125 |
+
print(f"Input texts: {cleaned_texts}")
|
| 126 |
+
print(f"Tokenized shape: {tokenized['input_ids'].shape}")
|
| 127 |
+
print(f"Attention mask shape: {tokenized['attention_mask'].shape}")
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch>=1.9.0
|
| 2 |
+
transformers>=4.35.0
|
| 3 |
+
numpy>=1.21.0
|
| 4 |
+
scikit-learn>=1.0.0
|
| 5 |
+
scipy>=1.7.0
|
| 6 |
+
matplotlib>=3.5.0
|
| 7 |
+
seaborn>=0.11.0
|
| 8 |
+
pandas>=1.3.0
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<s>",
|
| 3 |
+
"cls_token": "<s>",
|
| 4 |
+
"eos_token": "</s>",
|
| 5 |
+
"mask_token": {
|
| 6 |
+
"content": "<mask>",
|
| 7 |
+
"lstrip": true,
|
| 8 |
+
"normalized": false,
|
| 9 |
+
"rstrip": false,
|
| 10 |
+
"single_word": false
|
| 11 |
+
},
|
| 12 |
+
"pad_token": "<pad>",
|
| 13 |
+
"sep_token": "</s>",
|
| 14 |
+
"unk_token": "<unk>"
|
| 15 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f1cc44ad7faaeec47241864835473fd5403f2da94673f3f764a77ebcb0a803ec
|
| 3 |
+
size 17083009
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"250001": {
|
| 36 |
+
"content": "<mask>",
|
| 37 |
+
"lstrip": true,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"bos_token": "<s>",
|
| 45 |
+
"clean_up_tokenization_spaces": true,
|
| 46 |
+
"cls_token": "<s>",
|
| 47 |
+
"eos_token": "</s>",
|
| 48 |
+
"mask_token": "<mask>",
|
| 49 |
+
"model_max_length": 512,
|
| 50 |
+
"pad_token": "<pad>",
|
| 51 |
+
"sep_token": "</s>",
|
| 52 |
+
"tokenizer_class": "XLMRobertaTokenizer",
|
| 53 |
+
"unk_token": "<unk>"
|
| 54 |
+
}
|
usage.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Steel Material Classification Model
|
| 2 |
+
|
| 3 |
+
## Quick Start
|
| 4 |
+
|
| 5 |
+
```python
|
| 6 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 7 |
+
import torch
|
| 8 |
+
|
| 9 |
+
# Load model
|
| 10 |
+
model_name = "your-username/steel-material-classifier"
|
| 11 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 12 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 13 |
+
|
| 14 |
+
# Predict
|
| 15 |
+
text = "μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ "
|
| 16 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
| 17 |
+
|
| 18 |
+
with torch.no_grad():
|
| 19 |
+
outputs = model(**inputs)
|
| 20 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 21 |
+
predicted_class = torch.argmax(predictions, dim=1).item()
|
| 22 |
+
|
| 23 |
+
label = model.config.id2label[predicted_class]
|
| 24 |
+
confidence = predictions[0][predicted_class].item()
|
| 25 |
+
print(f"Predicted: {label} (Confidence: {confidence:.4f})")
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Model Information
|
| 29 |
+
|
| 30 |
+
- **Base Model**: XLM-RoBERTa
|
| 31 |
+
- **Task**: Sequence Classification
|
| 32 |
+
- **Labels**: 66 steel industry materials
|
| 33 |
+
- **Languages**: Korean, English
|
| 34 |
+
- **Model Size**: ~1GB
|
| 35 |
+
|
| 36 |
+
## Supported Labels
|
| 37 |
+
|
| 38 |
+
The model can classify 66 different steel industry materials including:
|
| 39 |
+
|
| 40 |
+
- **Raw Materials**: μ² κ΄μ, μνμ, μμ μ½ν¬μ€, 무μ°ν, κ°ν
|
| 41 |
+
- **Fuels**: μ²μ°κ°μ€, μ‘νμ²μ°κ°μ€, κ²½μ , νλ°μ , λ±μ
|
| 42 |
+
- **Gases**: μΌμ°ννμ, λ©ν, μν, κ³ λ‘κ°μ€, μ½ν¬μ€ μ€λΈ κ°μ€
|
| 43 |
+
- **Products**: κ°μ² , μ μ² , μ² , μ΄κ°μ±νμ² (HBI), κ³ μ¨ μ±ν νμμ²
|
| 44 |
+
- **By-products**: κ³ λ‘ μ¬λκ·Έ, μμ° μ€μΌμΌ, λΆμ§, μ¬λ¬μ§, μ μμΉ©
|
| 45 |
+
- **Others**: μ κΈ°, λκ°μ, μ€νμ , ν¬μ₯μ¬, μ΄μ μ
|
| 46 |
+
|
| 47 |
+
## Performance
|
| 48 |
+
|
| 49 |
+
- **Label Independence**: Good (average similarity: 0.1166)
|
| 50 |
+
- **Orthogonality**: Good (average dot product: 0.2043)
|
| 51 |
+
- **Overall Assessment**: The model shows good separation between different material categories
|
| 52 |
+
|
| 53 |
+
## Usage Examples
|
| 54 |
+
|
| 55 |
+
### Single Prediction
|
| 56 |
+
```python
|
| 57 |
+
text = "μ²μ°κ°μ€λ₯Ό μ°λ£λ‘ μ¬μ©νμ¬ κ³ λ‘λ₯Ό κ°μ΄"
|
| 58 |
+
# Returns: "μ²μ°κ°μ€" with confidence score
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Batch Prediction
|
| 62 |
+
```python
|
| 63 |
+
texts = [
|
| 64 |
+
"μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ ",
|
| 65 |
+
"μνμμ 첨κ°νμ¬ μ¬λκ·Έλ₯Ό νμ±"
|
| 66 |
+
]
|
| 67 |
+
# Returns: ["μ² κ΄μ", "μνμ"] with confidence scores
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Installation
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
pip install torch transformers
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
## License
|
| 77 |
+
|
| 78 |
+
[Add your license information]
|
| 79 |
+
|
| 80 |
+
## Citation
|
| 81 |
+
|
| 82 |
+
If you use this model in your research, please cite:
|
| 83 |
+
|
| 84 |
+
```bibtex
|
| 85 |
+
[Add citation information here]
|
| 86 |
+
```
|
μΉμ¬μ΄νΈ_μ
λ‘λ_κ°μ΄λ.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# νκΉ
νμ΄μ€ μΉμ¬μ΄νΈ μ
λ‘λ λ°©λ²
|
| 2 |
+
|
| 3 |
+
## 1λ¨κ³: νκΉ
νμ΄μ€ κ³μ μμ±/λ‘κ·ΈμΈ
|
| 4 |
+
1. https://huggingface.co μ μ μ
|
| 5 |
+
2. νμκ°μ
λλ λ‘κ·ΈμΈ
|
| 6 |
+
|
| 7 |
+
## 2λ¨κ³: μ λͺ¨λΈ μ μ₯μ μμ±
|
| 8 |
+
1. μ°μΈ‘ μλ¨μ "New" λ²νΌ ν΄λ¦
|
| 9 |
+
2. "Model" μ ν
|
| 10 |
+
3. μ μ₯μ μ΄λ¦ μ
λ ₯: `steel-material-classifier`
|
| 11 |
+
4. "Create repository" ν΄λ¦
|
| 12 |
+
|
| 13 |
+
## 3λ¨κ³: νμΌ μ
λ‘λ
|
| 14 |
+
1. μμ±λ μ μ₯μ νμ΄μ§μμ "Files and versions" ν ν΄λ¦
|
| 15 |
+
2. "Add file" β "Upload files" ν΄λ¦
|
| 16 |
+
3. λ€μ νμΌλ€μ λͺ¨λ μ ννμ¬ μ
λ‘λ:
|
| 17 |
+
|
| 18 |
+
### νμ νμΌλ€:
|
| 19 |
+
- `config.json`
|
| 20 |
+
- `model.safetensors`
|
| 21 |
+
- `tokenizer.json`
|
| 22 |
+
- `tokenizer_config.json`
|
| 23 |
+
- `special_tokens_map.json`
|
| 24 |
+
- `label_mapping.json`
|
| 25 |
+
|
| 26 |
+
### μΆκ° νμΌλ€:
|
| 27 |
+
- `classifier.pkl`
|
| 28 |
+
- `label_embeddings.pkl`
|
| 29 |
+
- `label_embeddings.pkl.backup`
|
| 30 |
+
- `README.md`
|
| 31 |
+
- `requirements.txt`
|
| 32 |
+
- `inference.py`
|
| 33 |
+
- `preprocessor.py`
|
| 34 |
+
- `model_card.md`
|
| 35 |
+
- `usage.md`
|
| 36 |
+
|
| 37 |
+
## 4λ¨κ³: μ»€λ° λ©μμ§ μμ±
|
| 38 |
+
- "Commit message"μ "Initial commit: Steel material classification model" μ
λ ₯
|
| 39 |
+
- "Commit changes to main" ν΄λ¦
|
| 40 |
+
|
| 41 |
+
## 5λ¨κ³: λͺ¨λΈ μ 보 μ€μ
|
| 42 |
+
1. μ μ₯μ νμ΄μ§μμ "Settings" ν ν΄λ¦
|
| 43 |
+
2. "Model Card" μΉμ
μμ λͺ¨λΈ μ 보 μμ :
|
| 44 |
+
- License: μ μ ν λΌμ΄μ μ€ μ ν
|
| 45 |
+
- Model Card: model_card.md λ΄μ© μ°Έκ³ νμ¬ μμ±
|
| 46 |
+
|
| 47 |
+
## 6λ¨κ³: μ¬μ© ν
μ€νΈ
|
| 48 |
+
μ
λ‘λ μλ£ ν λ€μ μ½λλ‘ ν
μ€νΈ:
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 52 |
+
import torch
|
| 53 |
+
|
| 54 |
+
# λͺ¨λΈ λ‘λ
|
| 55 |
+
model_name = "YOUR_USERNAME/steel-material-classifier"
|
| 56 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 57 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 58 |
+
|
| 59 |
+
# μμΈ‘ ν
μ€νΈ
|
| 60 |
+
text = "μ² κ΄μμ κ³ λ‘μμ νμνμ¬ μ μ² μ μ μ‘°νλ κ³Όμ "
|
| 61 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
| 62 |
+
|
| 63 |
+
with torch.no_grad():
|
| 64 |
+
outputs = model(**inputs)
|
| 65 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 66 |
+
predicted_class = torch.argmax(predictions, dim=1).item()
|
| 67 |
+
|
| 68 |
+
label = model.config.id2label[predicted_class]
|
| 69 |
+
confidence = predictions[0][predicted_class].item()
|
| 70 |
+
print(f"Predicted: {label} (Confidence: {confidence:.4f})")
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## μ£Όμμ¬ν
|
| 74 |
+
- νμΌ ν¬κΈ°κ° ν° κ²½μ° μ
λ‘λμ μκ°μ΄ 걸릴 μ μμ΅λλ€
|
| 75 |
+
- `model.safetensors` νμΌμ΄ μ½ 1GBμ΄λ―λ‘ μμ μ μΈ μΈν°λ· μ°κ²°μ΄ νμν©λλ€
|
| 76 |
+
- μ
λ‘λ μ€μλ λΈλΌμ°μ λ₯Ό λ«μ§ λ§μΈμ
|