pierluigic commited on
Commit
5c07c21
·
verified ·
1 Parent(s): 97226a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -5,9 +5,87 @@ base_model:
5
  - meta-llama/Meta-Llama-3-8B
6
  pipeline_tag: text2text-generation
7
  ---
8
- # Janus
 
9
  (Built with Meta Llama 3)
10
 
11
- A model for _dictionary example sentence generation_
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- More details will be provided soon.
 
 
 
 
 
 
 
 
 
 
5
  - meta-llama/Meta-Llama-3-8B
6
  pipeline_tag: text2text-generation
7
  ---
8
+
9
+ ## Janus
10
  (Built with Meta Llama 3)
11
 
12
+ ### Model Details
13
+ - **Model Name**: Janus (Sense-Specific Historical Word Usage Generation)
14
+ - **Version**: 1.0
15
+ - **Developers**: Pierluigi Cassotti, Nina Tahmasebi
16
+ - **Affiliation**: University of Gothenburg
17
+ - **License**: MIT
18
+ - **Repository**: [Hugging Face Model Hub](https://huggingface.co/ChangeIsKey/llama3-janus)
19
+ - **Paper**: [Sense-specific Historical Word Usage Generation](https://arxiv.org/abs/XXXXXXX)
20
+ - **Contact**: pierluigi.cassotti@gu.se
21
+
22
+ ### Model Description
23
+ Janus is a fine-tuned **Llama 3 8B** model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a historical year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for **semantic change detection**, **historical NLP**, and **linguistic research**.
24
+
25
+ ### Intended Use
26
+ - **Semantic Change Detection**: Investigating how word meanings evolve over time.
27
+ - **Historical Text Processing**: Enhancing the understanding and modeling of historical texts.
28
+ - **Corpus Expansion**: Generating sense-annotated corpora for linguistic studies.
29
+
30
+ ### Training Data
31
+ - **Dataset**: Extracted from the **Oxford English Dictionary (OED)**
32
+ - **Size**: Over **1.2 million** sense-annotated historical usages
33
+ - **Time Span**: **1700 - 2020**
34
+ - **Data Format**:
35
+ ```
36
+ <year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>
37
+ ```
38
+ - **Janus (PoS) Format**:
39
+ ```
40
+ <year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>
41
+ ```
42
+
43
+ ### Training Procedure
44
+ - **Base Model**: `meta-llama/Llama-3-8B`
45
+ - **Optimization**: **QLoRA** (Quantized Low-Rank Adaptation)
46
+ - **Batch Size**: **4**
47
+ - **Learning Rate**: **2e-4**
48
+ - **Epochs**: **1**
49
+ - **Framework**: Hugging Face Transformers
50
+ - **Fine-tuning Script**: `finetuning.py`
51
+
52
+ ### Model Performance
53
+ - **Temporal Accuracy**: Root mean squared error (RMSE) of **~52.7 years** (close to OED ground truth)
54
+ - **Semantic Accuracy**: Comparable to human evaluations on OED test data
55
+ - **Context Variability**: Low lexical repetition, preserving natural linguistic diversity
56
+
57
+ ### Usage Example
58
+ #### Generating Historical Usages
59
+ ```python
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+ import torch
62
+
63
+ model_name = "ChangeIsKey/llama3-janus"
64
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
65
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
66
+
67
+ input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|s|>"
68
+ inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
69
+
70
+ output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
71
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
72
+ ```
73
+
74
+ For batch processing, refer to `predict_finetuned.py`.
75
+
76
+ ### Limitations & Ethical Considerations
77
+ - **Historical Bias**: The model may reflect biases present in historical texts.
78
+ - **Time Granularity**: The temporal resolution is approximate (~50 years RMSE).
79
+ - **Modern Influence**: Despite fine-tuning, the model may still generate modern phrases in older contexts.
80
+ - **Not Trained for Fairness**: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content.
81
 
82
+ ### Citation
83
+ If you use Janus, please cite:
84
+ ```
85
+ @article{Cassotti2024Janus,
86
+ author = {Pierluigi Cassotti and Nina Tahmasebi},
87
+ title = {Sense-specific Historical Word Usage Generation},
88
+ journal = {TACL},
89
+ year = {2025}
90
+ }
91
+ ```