Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,29 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-sa-4.0
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-sa-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
---
|
| 7 |
+
# Purpose
|
| 8 |
+
As a part of a project assignment in EPFL's CS-401 class, we need a simple model to extract the importance of the character from the movie plot. From the movie script data crawled from https://imsdb.com/, we calculated gold portion of each character over entire script, and joined it with movie plot datasets.
|
| 9 |
+
|
| 10 |
+
# Model info
|
| 11 |
+
- Input prompt format
|
| 12 |
+
f"Predict the percentage of a movie's plot that a character takes up.\nCharacter: {character_name} \nPlot: {plot}"
|
| 13 |
+
- Output
|
| 14 |
+
13.4
|
| 15 |
+
|
| 16 |
+
We used max_token = 2048 for training.
|
| 17 |
+
Sample code
|
| 18 |
+
```python
|
| 19 |
+
tokenizer = T5Tokenizer.from_pretrained("Hyeongdon/t5-large-character_plot_portion") # same as default t5 tokenizer
|
| 20 |
+
model = T5ForConditionalGeneration.from_pretrained("Hyeongdon/t5-large-character_plot_portion")
|
| 21 |
+
model_inputs = tokenizer(prompts, max_length=2048, truncation=True, padding='max_length', return_tensors='pt')
|
| 22 |
+
model.eval()
|
| 23 |
+
with torch.no_grad():
|
| 24 |
+
probs = model.generate(input_ids=model_inputs['input_ids'].to(device), attention_mask=model_inputs['attention_mask'].to(device))
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
# Limitation & Tips
|
| 28 |
+
ChatGPT shows better performance without any fine-tuning. Based on our internal metric, T5-large slightly underperforms compared to GPT-3.5 or GPT-4.
|
| 29 |
+
If you are interested in our research project, refer https://margg00.github.io/
|