Sampath1987 commited on
Commit
438976f
·
verified ·
1 Parent(s): 00bf15b

fine-tuned EnergyEmbed-v2 2 epochs

Browse files
1_Pooling/config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "word_embedding_dimension": 768,
3
- "pooling_mode_cls_token": false,
4
- "pooling_mode_mean_tokens": true,
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
 
1
  {
2
  "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
README.md CHANGED
@@ -7,7 +7,7 @@ tags:
7
  - generated_from_trainer
8
  - dataset_size:89129
9
  - loss:MultipleNegativesRankingLoss
10
- base_model: google/embeddinggemma-300m
11
  widget:
12
  - source_sentence: How does vendor-specific data acquisition affect DTS profile interpretation?
13
  sentences:
@@ -485,7 +485,7 @@ library_name: sentence-transformers
485
  metrics:
486
  - cosine_accuracy
487
  model-index:
488
- - name: SentenceTransformer based on google/embeddinggemma-300m
489
  results:
490
  - task:
491
  type: triplet
@@ -495,20 +495,20 @@ model-index:
495
  type: ai-job-validation
496
  metrics:
497
  - type: cosine_accuracy
498
- value: 0.7938246130943298
499
  name: Cosine Accuracy
500
  ---
501
 
502
- # SentenceTransformer based on google/embeddinggemma-300m
503
 
504
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) on the [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
505
 
506
  ## Model Details
507
 
508
  ### Model Description
509
  - **Model Type:** Sentence Transformer
510
- - **Base model:** [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) <!-- at revision 57c266a740f537b4dc058e1b0cda161fd15afa75 -->
511
- - **Maximum Sequence Length:** 2048 tokens
512
  - **Output Dimensionality:** 768 dimensions
513
  - **Similarity Function:** Cosine Similarity
514
  - **Training Dataset:**
@@ -526,11 +526,9 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [g
526
 
527
  ```
528
  SentenceTransformer(
529
- (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
530
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
531
- (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
532
- (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
533
- (4): Normalize()
534
  )
535
  ```
536
 
@@ -551,23 +549,21 @@ from sentence_transformers import SentenceTransformer
551
  # Download from the 🤗 Hub
552
  model = SentenceTransformer("Sampath1987/EnergyEmbed-v2")
553
  # Run inference
554
- queries = [
555
- "What role did anti-collision analysis play in the drilling of the dual lateral well?",
556
- ]
557
- documents = [
558
  'This paper aims to analyze the impact of appraising and developing marginal fields with multiple stacked reservoirs which is quite challenging in terms of techno commercial value. The development of such marginal reservoirs using conventional single horizontal wells drilling and completion is uneconomical. Therefore, it was necessary to engineer a solution that can enhance the commercial value of the project by reducing CAPEX and OPEX. This paper will present the first comprehensive business case, where multiple stacked reservoirs with marginal reserves were studied to produce independently using multilateral completions, granting full accessibility of the laterals while achieving production monitoring and reservoir surveillance.',
559
  "The most common challenge in horizontal drilling is depth uncertainty which can be due to poor seismic data or interpretation. It is arguable that a successful landing of the wellbore in the reservoir optimally and within the desired zone is the most challenging in most geosteering operation. The presence of fluid contacts such as oil-water-contact (OWC) and gas-oil-contact (GOC) complicates the whole drilling process, most especially if these fluid contacts are not well defined or known. Additionally, the ability to map the boundaries of the reservoir as the BHA drills the lateral section is an added advantage to remaining within the desired reservoir section.\nThe success of any reservoir navigation service where seismic uncertainty at the reservoir top is high will rely largely on how effective the geosteering system is and how the geosteering engineer is able to react promptly to changes while landing the well in the reservoir and drilling the lateral section with without exiting the reservoir.\nReservoir Navigation Service (RNS) provides the means for the drilling near horizontal or horizontal wells for the purpose of increasing hydrocarbon extraction from the earth's subsurface. This involves the use of a pre-defined bottom hole assembly (BHA) with inbuilt downhole logging while drilling (LWD) and measurement while drilling (MWD) sensors. The measurements from these downhole sensors are uplinked to the surface of the wellbore where they are converted to meaningful petrophysical data. The goal is to use the downhole petrophysical data such as gamma ray, propagation resistivity and so on, to update an existing pre-well geological model of a section of the earth in such a way that the final result depicts the true model picture of the earth subsurface.\nThis paper focuses on using well CBH-44L to showcase how the use of real-time distance-to-boundary (D2B) measurement from a deep reading azimuthal propagation resistivity tool is use to correct for depth uncertainty in seismic, thereby, improving the chance of successfully landing and drilling a horizontal well.",
560
- 'This paper is a comprehensive analytic driven study on the use and sizing of membrane filters to improve the injected water quality for maintaining injectivity in tight carbonate reservoirs. Out of the different mechanisms of formation damage, the pore plugging with the migration of particles within the injectant fluids by bridging at the pore throat junctions and/or by pore filling can lead to the buildup of an internal filter cake away from the wellbore that limits the well’s injectivity and can affect the vertical and lateral sweep.\nThis type of formation damage is very difficult to treat with any kind of stimulation and the impact will be manifested especially in tight formations with interbedded stylolites layers with a total range of permeabilities from 2 to less than 1 milli-Darcy and a median pore throat size ranging from 2.5 to 0.3 micron meters. The study comprises several parts starting with a geological analysis that was conducted to identify areas and layers most prone to formation pore plugging by analyzing thin-sections and MICP data. Second, in the lack of core flood tests, a reservoir and well study analyzed existing water injectors situated in similar or slightly higher quality rock areas through the analysis of injectivity index behavior to estimate the impact of damage and the expected injector’s half-life.\nAs a result, through the application of an analytical mathematical model for defining deep bed filtration parameter, a correlation was established based on average injected particle size and reservoir rock quality to aid in selecting the proper water injection filter size. In order to confirm that, a dedicated injectivity test in a horizontal well utilizing membrane filters was carried out to assess eventual formation damage and the filters efficiency by conducting a series of multiple pressure fall-off tests coupled with injection profile logging to monitor any induced damage within the wellbore region.\nFinally, the operational aspects and the integration within field development plans were addressed, especially with the recommended well placement and completion. This culminated in a field development strategy for formation damage mitigation in tight carbonate reservoirs during production and injection phase that can be used in other similar fields.',
561
  ]
562
- query_embeddings = model.encode_query(queries)
563
- document_embeddings = model.encode_document(documents)
564
- print(query_embeddings.shape, document_embeddings.shape)
565
- # [1, 768] [3, 768]
566
 
567
  # Get the similarity scores for the embeddings
568
- similarities = model.similarity(query_embeddings, document_embeddings)
569
  print(similarities)
570
- # tensor([[0.2598, 0.1674, 0.3299]])
 
 
571
  ```
572
 
573
  <!--
@@ -605,7 +601,7 @@ You can finetune this model on your own dataset.
605
 
606
  | Metric | Value |
607
  |:--------------------|:-----------|
608
- | **cosine_accuracy** | **0.7938** |
609
 
610
  <!--
611
  ## Bias, Risks and Limitations
@@ -629,10 +625,10 @@ You can finetune this model on your own dataset.
629
  * Size: 89,129 training samples
630
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
631
  * Approximate statistics based on the first 1000 samples:
632
- | | anchor | positive | negative |
633
- |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
634
- | type | string | string | string |
635
- | details | <ul><li>min: 11 tokens</li><li>mean: 21.04 tokens</li><li>max: 51 tokens</li></ul> | <ul><li>min: 35 tokens</li><li>mean: 399.5 tokens</li><li>max: 833 tokens</li></ul> | <ul><li>min: 29 tokens</li><li>mean: 379.46 tokens</li><li>max: 986 tokens</li></ul> |
636
  * Samples:
637
  | anchor | positive | negative |
638
  |:--------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -656,10 +652,10 @@ You can finetune this model on your own dataset.
656
  * Size: 11,141 evaluation samples
657
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
658
  * Approximate statistics based on the first 1000 samples:
659
- | | anchor | positive | negative |
660
- |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
661
- | type | string | string | string |
662
- | details | <ul><li>min: 11 tokens</li><li>mean: 20.75 tokens</li><li>max: 46 tokens</li></ul> | <ul><li>min: 39 tokens</li><li>mean: 393.73 tokens</li><li>max: 948 tokens</li></ul> | <ul><li>min: 38 tokens</li><li>mean: 374.6 tokens</li><li>max: 953 tokens</li></ul> |
663
  * Samples:
664
  | anchor | positive | negative |
665
  |:------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -679,7 +675,10 @@ You can finetune this model on your own dataset.
679
  #### Non-Default Hyperparameters
680
 
681
  - `eval_strategy`: steps
 
 
682
  - `learning_rate`: 2e-05
 
683
  - `warmup_ratio`: 0.1
684
 
685
  #### All Hyperparameters
@@ -689,8 +688,8 @@ You can finetune this model on your own dataset.
689
  - `do_predict`: False
690
  - `eval_strategy`: steps
691
  - `prediction_loss_only`: True
692
- - `per_device_train_batch_size`: 8
693
- - `per_device_eval_batch_size`: 8
694
  - `per_gpu_train_batch_size`: None
695
  - `per_gpu_eval_batch_size`: None
696
  - `gradient_accumulation_steps`: 1
@@ -702,7 +701,7 @@ You can finetune this model on your own dataset.
702
  - `adam_beta2`: 0.999
703
  - `adam_epsilon`: 1e-08
704
  - `max_grad_norm`: 1.0
705
- - `num_train_epochs`: 3
706
  - `max_steps`: -1
707
  - `lr_scheduler_type`: linear
708
  - `lr_scheduler_kwargs`: {}
@@ -808,39 +807,17 @@ You can finetune this model on your own dataset.
808
  ### Training Logs
809
  | Epoch | Step | Training Loss | Validation Loss | ai-job-validation_cosine_accuracy |
810
  |:------:|:-----:|:-------------:|:---------------:|:---------------------------------:|
811
- | 0.0898 | 1000 | - | 1.0302 | 0.6678 |
812
- | 0.1795 | 2000 | - | 1.0889 | 0.6811 |
813
- | 0.2693 | 3000 | - | 1.1934 | 0.6205 |
814
- | 0.3590 | 4000 | - | 1.2005 | 0.6324 |
815
- | 0.4488 | 5000 | 1.1715 | 1.1797 | 0.6148 |
816
- | 0.5385 | 6000 | - | 1.1440 | 0.6334 |
817
- | 0.6283 | 7000 | - | 1.1464 | 0.6311 |
818
- | 0.7180 | 8000 | - | 1.1275 | 0.6468 |
819
- | 0.8078 | 9000 | - | 1.0974 | 0.6652 |
820
- | 0.8975 | 10000 | 1.1407 | 1.0905 | 0.6423 |
821
- | 0.9873 | 11000 | - | 1.0600 | 0.6574 |
822
- | 1.0770 | 12000 | - | 1.0500 | 0.6718 |
823
- | 1.1668 | 13000 | - | 1.0548 | 0.6770 |
824
- | 1.2565 | 14000 | - | 0.9979 | 0.6871 |
825
- | 1.3463 | 15000 | 0.9591 | 0.9846 | 0.6877 |
826
- | 1.4360 | 16000 | - | 0.9691 | 0.7066 |
827
- | 1.5258 | 17000 | - | 0.9458 | 0.7059 |
828
- | 1.6155 | 18000 | - | 0.9298 | 0.7215 |
829
- | 1.7053 | 19000 | - | 0.9101 | 0.7230 |
830
- | 1.7950 | 20000 | 0.869 | 0.9103 | 0.7333 |
831
- | 1.8848 | 21000 | - | 0.8882 | 0.7280 |
832
- | 1.9745 | 22000 | - | 0.8878 | 0.7346 |
833
- | 2.0643 | 23000 | - | 0.9356 | 0.7624 |
834
- | 2.1540 | 24000 | - | 0.9204 | 0.7587 |
835
- | 2.2438 | 25000 | 0.6282 | 0.9052 | 0.7653 |
836
- | 2.3335 | 26000 | - | 0.8738 | 0.7686 |
837
- | 2.4233 | 27000 | - | 0.8665 | 0.7791 |
838
- | 2.5130 | 28000 | - | 0.8532 | 0.7761 |
839
- | 2.6028 | 29000 | - | 0.8280 | 0.7867 |
840
- | 2.6925 | 30000 | 0.4219 | 0.8481 | 0.7779 |
841
- | 2.7823 | 31000 | - | 0.8139 | 0.7938 |
842
- | 2.8720 | 32000 | - | 0.7972 | 0.7929 |
843
- | 2.9618 | 33000 | - | 0.7943 | 0.7938 |
844
 
845
 
846
  ### Framework Versions
 
7
  - generated_from_trainer
8
  - dataset_size:89129
9
  - loss:MultipleNegativesRankingLoss
10
+ base_model: Alibaba-NLP/gte-multilingual-base
11
  widget:
12
  - source_sentence: How does vendor-specific data acquisition affect DTS profile interpretation?
13
  sentences:
 
485
  metrics:
486
  - cosine_accuracy
487
  model-index:
488
+ - name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
489
  results:
490
  - task:
491
  type: triplet
 
495
  type: ai-job-validation
496
  metrics:
497
  - type: cosine_accuracy
498
+ value: 0.7581904530525208
499
  name: Cosine Accuracy
500
  ---
501
 
502
+ # SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
503
 
504
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) on the [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
505
 
506
  ## Model Details
507
 
508
  ### Model Description
509
  - **Model Type:** Sentence Transformer
510
+ - **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 9bbca17d9273fd0d03d5725c7a4b0f6b45142062 -->
511
+ - **Maximum Sequence Length:** 8192 tokens
512
  - **Output Dimensionality:** 768 dimensions
513
  - **Similarity Function:** Cosine Similarity
514
  - **Training Dataset:**
 
526
 
527
  ```
528
  SentenceTransformer(
529
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NewModel'})
530
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
531
+ (2): Normalize()
 
 
532
  )
533
  ```
534
 
 
549
  # Download from the 🤗 Hub
550
  model = SentenceTransformer("Sampath1987/EnergyEmbed-v2")
551
  # Run inference
552
+ sentences = [
553
+ 'What role did anti-collision analysis play in the drilling of the dual lateral well?',
 
 
554
  'This paper aims to analyze the impact of appraising and developing marginal fields with multiple stacked reservoirs which is quite challenging in terms of techno commercial value. The development of such marginal reservoirs using conventional single horizontal wells drilling and completion is uneconomical. Therefore, it was necessary to engineer a solution that can enhance the commercial value of the project by reducing CAPEX and OPEX. This paper will present the first comprehensive business case, where multiple stacked reservoirs with marginal reserves were studied to produce independently using multilateral completions, granting full accessibility of the laterals while achieving production monitoring and reservoir surveillance.',
555
  "The most common challenge in horizontal drilling is depth uncertainty which can be due to poor seismic data or interpretation. It is arguable that a successful landing of the wellbore in the reservoir optimally and within the desired zone is the most challenging in most geosteering operation. The presence of fluid contacts such as oil-water-contact (OWC) and gas-oil-contact (GOC) complicates the whole drilling process, most especially if these fluid contacts are not well defined or known. Additionally, the ability to map the boundaries of the reservoir as the BHA drills the lateral section is an added advantage to remaining within the desired reservoir section.\nThe success of any reservoir navigation service where seismic uncertainty at the reservoir top is high will rely largely on how effective the geosteering system is and how the geosteering engineer is able to react promptly to changes while landing the well in the reservoir and drilling the lateral section with without exiting the reservoir.\nReservoir Navigation Service (RNS) provides the means for the drilling near horizontal or horizontal wells for the purpose of increasing hydrocarbon extraction from the earth's subsurface. This involves the use of a pre-defined bottom hole assembly (BHA) with inbuilt downhole logging while drilling (LWD) and measurement while drilling (MWD) sensors. The measurements from these downhole sensors are uplinked to the surface of the wellbore where they are converted to meaningful petrophysical data. The goal is to use the downhole petrophysical data such as gamma ray, propagation resistivity and so on, to update an existing pre-well geological model of a section of the earth in such a way that the final result depicts the true model picture of the earth subsurface.\nThis paper focuses on using well CBH-44L to showcase how the use of real-time distance-to-boundary (D2B) measurement from a deep reading azimuthal propagation resistivity tool is use to correct for depth uncertainty in seismic, thereby, improving the chance of successfully landing and drilling a horizontal well.",
 
556
  ]
557
+ embeddings = model.encode(sentences)
558
+ print(embeddings.shape)
559
+ # [3, 768]
 
560
 
561
  # Get the similarity scores for the embeddings
562
+ similarities = model.similarity(embeddings, embeddings)
563
  print(similarities)
564
+ # tensor([[1.0000, 0.5285, 0.4524],
565
+ # [0.5285, 1.0000, 0.5421],
566
+ # [0.4524, 0.5421, 1.0000]])
567
  ```
568
 
569
  <!--
 
601
 
602
  | Metric | Value |
603
  |:--------------------|:-----------|
604
+ | **cosine_accuracy** | **0.7582** |
605
 
606
  <!--
607
  ## Bias, Risks and Limitations
 
625
  * Size: 89,129 training samples
626
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
627
  * Approximate statistics based on the first 1000 samples:
628
+ | | anchor | positive | negative |
629
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
630
+ | type | string | string | string |
631
+ | details | <ul><li>min: 12 tokens</li><li>mean: 24.68 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 37 tokens</li><li>mean: 437.61 tokens</li><li>max: 983 tokens</li></ul> | <ul><li>min: 28 tokens</li><li>mean: 410.96 tokens</li><li>max: 1188 tokens</li></ul> |
632
  * Samples:
633
  | anchor | positive | negative |
634
  |:--------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 
652
  * Size: 11,141 evaluation samples
653
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
654
  * Approximate statistics based on the first 1000 samples:
655
+ | | anchor | positive | negative |
656
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
657
+ | type | string | string | string |
658
+ | details | <ul><li>min: 12 tokens</li><li>mean: 24.37 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 38 tokens</li><li>mean: 428.35 tokens</li><li>max: 978 tokens</li></ul> | <ul><li>min: 29 tokens</li><li>mean: 405.3 tokens</li><li>max: 1111 tokens</li></ul> |
659
  * Samples:
660
  | anchor | positive | negative |
661
  |:------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 
675
  #### Non-Default Hyperparameters
676
 
677
  - `eval_strategy`: steps
678
+ - `per_device_train_batch_size`: 16
679
+ - `per_device_eval_batch_size`: 16
680
  - `learning_rate`: 2e-05
681
+ - `num_train_epochs`: 2
682
  - `warmup_ratio`: 0.1
683
 
684
  #### All Hyperparameters
 
688
  - `do_predict`: False
689
  - `eval_strategy`: steps
690
  - `prediction_loss_only`: True
691
+ - `per_device_train_batch_size`: 16
692
+ - `per_device_eval_batch_size`: 16
693
  - `per_gpu_train_batch_size`: None
694
  - `per_gpu_eval_batch_size`: None
695
  - `gradient_accumulation_steps`: 1
 
701
  - `adam_beta2`: 0.999
702
  - `adam_epsilon`: 1e-08
703
  - `max_grad_norm`: 1.0
704
+ - `num_train_epochs`: 2
705
  - `max_steps`: -1
706
  - `lr_scheduler_type`: linear
707
  - `lr_scheduler_kwargs`: {}
 
807
  ### Training Logs
808
  | Epoch | Step | Training Loss | Validation Loss | ai-job-validation_cosine_accuracy |
809
  |:------:|:-----:|:-------------:|:---------------:|:---------------------------------:|
810
+ | 0.1795 | 1000 | - | 1.1499 | 0.6699 |
811
+ | 0.3590 | 2000 | - | 1.0897 | 0.6842 |
812
+ | 0.5385 | 3000 | - | 1.0492 | 0.7062 |
813
+ | 0.7180 | 4000 | - | 1.0260 | 0.7209 |
814
+ | 0.8975 | 5000 | 1.1918 | 1.0042 | 0.7281 |
815
+ | 1.0770 | 6000 | - | 1.0018 | 0.7334 |
816
+ | 1.2565 | 7000 | - | 0.9779 | 0.7456 |
817
+ | 1.4360 | 8000 | - | 0.9708 | 0.7486 |
818
+ | 1.6155 | 9000 | - | 0.9583 | 0.7560 |
819
+ | 1.7950 | 10000 | 0.9342 | 0.9509 | 0.7606 |
820
+ | 1.9745 | 11000 | - | 0.9479 | 0.7582 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
821
 
822
 
823
  ### Framework Versions
config.json CHANGED
@@ -1,61 +1,49 @@
1
  {
2
- "_sliding_window_pattern": 6,
3
  "architectures": [
4
- "Gemma3TextModel"
5
  ],
6
- "attention_bias": false,
7
- "attention_dropout": 0.0,
8
- "attn_logit_softcapping": null,
9
- "bos_token_id": 2,
10
- "dtype": "float32",
11
- "eos_token_id": 1,
12
- "final_logit_softcapping": null,
13
- "head_dim": 256,
14
- "hidden_activation": "gelu_pytorch_tanh",
 
 
 
 
15
  "hidden_size": 768,
 
 
 
16
  "initializer_range": 0.02,
17
- "intermediate_size": 1152,
18
- "layer_types": [
19
- "sliding_attention",
20
- "sliding_attention",
21
- "sliding_attention",
22
- "sliding_attention",
23
- "sliding_attention",
24
- "full_attention",
25
- "sliding_attention",
26
- "sliding_attention",
27
- "sliding_attention",
28
- "sliding_attention",
29
- "sliding_attention",
30
- "full_attention",
31
- "sliding_attention",
32
- "sliding_attention",
33
- "sliding_attention",
34
- "sliding_attention",
35
- "sliding_attention",
36
- "full_attention",
37
- "sliding_attention",
38
- "sliding_attention",
39
- "sliding_attention",
40
- "sliding_attention",
41
- "sliding_attention",
42
- "full_attention"
43
- ],
44
- "max_position_embeddings": 2048,
45
- "model_type": "gemma3_text",
46
- "num_attention_heads": 3,
47
- "num_hidden_layers": 24,
48
- "num_key_value_heads": 1,
49
- "pad_token_id": 0,
50
- "query_pre_attn_scalar": 256,
51
- "rms_norm_eps": 1e-06,
52
- "rope_local_base_freq": 10000.0,
53
- "rope_scaling": null,
54
- "rope_theta": 1000000.0,
55
- "sliding_window": 512,
56
  "torch_dtype": "float32",
57
  "transformers_version": "4.53.3",
58
- "use_bidirectional_attention": true,
59
- "use_cache": true,
60
- "vocab_size": 262144
 
61
  }
 
1
  {
 
2
  "architectures": [
3
+ "NewModel"
4
  ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration.NewConfig",
8
+ "AutoModel": "modeling.NewModel",
9
+ "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
10
+ "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
11
+ "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
12
+ "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
13
+ "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
14
+ },
15
+ "classifier_dropout": 0.0,
16
+ "hidden_act": "gelu",
17
+ "hidden_dropout_prob": 0.1,
18
  "hidden_size": 768,
19
+ "id2label": {
20
+ "0": "LABEL_0"
21
+ },
22
  "initializer_range": 0.02,
23
+ "intermediate_size": 3072,
24
+ "label2id": {
25
+ "LABEL_0": 0
26
+ },
27
+ "layer_norm_eps": 1e-12,
28
+ "layer_norm_type": "layer_norm",
29
+ "logn_attention_clip1": false,
30
+ "logn_attention_scale": false,
31
+ "max_position_embeddings": 8192,
32
+ "model_type": "new",
33
+ "num_attention_heads": 12,
34
+ "num_hidden_layers": 12,
35
+ "pack_qkv": true,
36
+ "pad_token_id": 1,
37
+ "position_embedding_type": "rope",
38
+ "rope_scaling": {
39
+ "factor": 8.0,
40
+ "type": "ntk"
41
+ },
42
+ "rope_theta": 20000,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  "torch_dtype": "float32",
44
  "transformers_version": "4.53.3",
45
+ "type_vocab_size": 1,
46
+ "unpad_inputs": false,
47
+ "use_memory_efficient_attention": false,
48
+ "vocab_size": 250048
49
  }
config_sentence_transformers.json CHANGED
@@ -6,20 +6,8 @@
6
  "pytorch": "2.8.0+cu128"
7
  },
8
  "prompts": {
9
- "query": "task: search result | query: ",
10
- "document": "title: none | text: ",
11
- "BitextMining": "task: search result | query: ",
12
- "Clustering": "task: clustering | query: ",
13
- "Classification": "task: classification | query: ",
14
- "InstructionRetrieval": "task: code retrieval | query: ",
15
- "MultilabelClassification": "task: classification | query: ",
16
- "PairClassification": "task: sentence similarity | query: ",
17
- "Reranking": "task: search result | query: ",
18
- "Retrieval": "task: search result | query: ",
19
- "Retrieval-query": "task: search result | query: ",
20
- "Retrieval-document": "title: none | text: ",
21
- "STS": "task: sentence similarity | query: ",
22
- "Summarization": "task: summarization | query: "
23
  },
24
  "default_prompt_name": null,
25
  "similarity_fn_name": "cosine"
 
6
  "pytorch": "2.8.0+cu128"
7
  },
8
  "prompts": {
9
+ "query": "",
10
+ "document": ""
 
 
 
 
 
 
 
 
 
 
 
 
11
  },
12
  "default_prompt_name": null,
13
  "similarity_fn_name": "cosine"
configuration.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ NEW model configuration"""
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class NewConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
26
+ instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
27
+ configuration with the defaults will yield a similar configuration to that of the NEW
28
+ [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
29
+
30
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
31
+ documentation from [`PretrainedConfig`] for more information.
32
+
33
+
34
+ Args:
35
+ vocab_size (`int`, *optional*, defaults to 30522):
36
+ Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
37
+ `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
38
+ hidden_size (`int`, *optional*, defaults to 768):
39
+ Dimensionality of the encoder layers and the pooler layer.
40
+ num_hidden_layers (`int`, *optional*, defaults to 12):
41
+ Number of hidden layers in the Transformer encoder.
42
+ num_attention_heads (`int`, *optional*, defaults to 12):
43
+ Number of attention heads for each attention layer in the Transformer encoder.
44
+ intermediate_size (`int`, *optional*, defaults to 3072):
45
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
46
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
47
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
48
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
49
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
52
+ The dropout ratio for the attention probabilities.
53
+ max_position_embeddings (`int`, *optional*, defaults to 512):
54
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
55
+ just in case (e.g., 512 or 1024 or 2048).
56
+ type_vocab_size (`int`, *optional*, defaults to 2):
57
+ The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
58
+ initializer_range (`float`, *optional*, defaults to 0.02):
59
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
60
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
61
+ The epsilon used by the layer normalization layers.
62
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
63
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
64
+ rope_theta (`float`, *optional*, defaults to 10000.0):
65
+ The base period of the RoPE embeddings.
66
+ rope_scaling (`Dict`, *optional*):
67
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
68
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
69
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
70
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
71
+ these scaling strategies behave:
72
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
73
+ experimental feature, subject to breaking API changes in future versions.
74
+ classifier_dropout (`float`, *optional*):
75
+ The dropout ratio for the classification head.
76
+
77
+ Examples:
78
+
79
+ ```python
80
+ >>> from transformers import NewConfig, NewModel
81
+
82
+ >>> # Initializing a NEW izhx/new-base-en style configuration
83
+ >>> configuration = NewConfig()
84
+
85
+ >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
86
+ >>> model = NewModel(configuration)
87
+
88
+ >>> # Accessing the model configuration
89
+ >>> configuration = model.config
90
+ ```"""
91
+
92
+ model_type = "new"
93
+
94
+ def __init__(
95
+ self,
96
+ vocab_size=30528,
97
+ hidden_size=768,
98
+ num_hidden_layers=12,
99
+ num_attention_heads=12,
100
+ intermediate_size=3072,
101
+ hidden_act="gelu",
102
+ hidden_dropout_prob=0.1,
103
+ attention_probs_dropout_prob=0.0,
104
+ max_position_embeddings=2048,
105
+ type_vocab_size=1,
106
+ initializer_range=0.02,
107
+ layer_norm_type='layer_norm',
108
+ layer_norm_eps=1e-12,
109
+ # pad_token_id=0,
110
+ position_embedding_type="rope",
111
+ rope_theta=10000.0,
112
+ rope_scaling=None,
113
+ classifier_dropout=None,
114
+ pack_qkv=True,
115
+ unpad_inputs=False,
116
+ use_memory_efficient_attention=False,
117
+ logn_attention_scale=False,
118
+ logn_attention_clip1=False,
119
+ **kwargs,
120
+ ):
121
+ super().__init__(**kwargs)
122
+
123
+ self.vocab_size = vocab_size
124
+ self.hidden_size = hidden_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.hidden_act = hidden_act
128
+ self.intermediate_size = intermediate_size
129
+ self.hidden_dropout_prob = hidden_dropout_prob
130
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
131
+ self.max_position_embeddings = max_position_embeddings
132
+ self.type_vocab_size = type_vocab_size
133
+ self.initializer_range = initializer_range
134
+ self.layer_norm_type = layer_norm_type
135
+ self.layer_norm_eps = layer_norm_eps
136
+ self.position_embedding_type = position_embedding_type
137
+ self.rope_theta = rope_theta
138
+ self.rope_scaling = rope_scaling
139
+ self.classifier_dropout = classifier_dropout
140
+
141
+ self.pack_qkv = pack_qkv
142
+ self.unpad_inputs = unpad_inputs
143
+ self.use_memory_efficient_attention = use_memory_efficient_attention
144
+ self.logn_attention_scale = logn_attention_scale
145
+ self.logn_attention_clip1 = logn_attention_clip1
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:154c8df9198511f81bb3de5566d9dc164c54553f709650fbe24846f313a672c0
3
- size 1211486072
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d66f6643071d9573074c7a4691839ad92b784214f89d319e67c4f7ab524fc3e
3
+ size 1221487872
modeling.py ADDED
@@ -0,0 +1,1418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """PyTorch NEW model."""
17
+
18
+ import math
19
+ from dataclasses import dataclass
20
+ from typing import List, Optional, Tuple, Union
21
+
22
+ import torch
23
+ import torch.utils.checkpoint
24
+ from torch import nn
25
+
26
+ from transformers.activations import ACT2FN
27
+ from transformers.modeling_outputs import (
28
+ BaseModelOutput,
29
+ BaseModelOutputWithPooling,
30
+ MaskedLMOutput,
31
+ MultipleChoiceModelOutput,
32
+ QuestionAnsweringModelOutput,
33
+ SequenceClassifierOutput,
34
+ ModelOutput,
35
+ )
36
+ from transformers.modeling_utils import PreTrainedModel
37
+ from transformers.utils import logging
38
+
39
+ try:
40
+ import xformers.ops as xops
41
+ except ImportError as e:
42
+ xops = None
43
+
44
+ from .configuration import NewConfig
45
+
46
+
47
+ logger = logging.get_logger(__name__)
48
+
49
+
50
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
51
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
52
+ class IndexFirstAxis(torch.autograd.Function):
53
+ @staticmethod
54
+ def forward(ctx, input, indices):
55
+ ctx.save_for_backward(indices)
56
+ assert input.ndim >= 2
57
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
58
+ second_dim = other_shape.numel()
59
+ # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
60
+ # return input[indices]
61
+ # return torch.gather(
62
+ # rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
63
+ # ).reshape(-1, *other_shape)
64
+ return torch.gather(
65
+ input.view(ctx.first_axis_dim, second_dim),
66
+ 0,
67
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
68
+ ).reshape(-1, *other_shape)
69
+
70
+ @staticmethod
71
+ def backward(ctx, grad_output):
72
+ (indices,) = ctx.saved_tensors
73
+ assert grad_output.ndim >= 2
74
+ other_shape = grad_output.shape[1:]
75
+ # grad_output = rearrange(grad_output, "b ... -> b (...)")
76
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
77
+ grad_input = torch.zeros(
78
+ [ctx.first_axis_dim, grad_output.shape[1]],
79
+ device=grad_output.device,
80
+ dtype=grad_output.dtype,
81
+ )
82
+ # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
83
+ # grad_input[indices] = grad_output
84
+ # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
85
+ grad_input.scatter_(
86
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
87
+ )
88
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
89
+
90
+
91
+ index_first_axis = IndexFirstAxis.apply
92
+
93
+
94
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
95
+ """
96
+ Arguments:
97
+ hidden_states: (batch, seqlen, ...)
98
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
99
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
100
+ Return:
101
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
102
+ """
103
+ if indices is None:
104
+ assert attention_mask is not None
105
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
106
+
107
+ # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
108
+ # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
109
+ # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
110
+ # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
111
+ # so we write custom forward and backward to make it a bit faster.
112
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
113
+ return index_first_axis(hidden_states, indices)
114
+
115
+
116
+ class IndexPutFirstAxis(torch.autograd.Function):
117
+ @staticmethod
118
+ def forward(
119
+ ctx,
120
+ values: torch.Tensor,
121
+ indices: torch.Tensor,
122
+ first_axis_dim
123
+ ) -> torch.Tensor:
124
+ ctx.save_for_backward(indices)
125
+ assert indices.ndim == 1
126
+ assert values.ndim >= 2
127
+ output = torch.zeros(
128
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
129
+ )
130
+ output[indices] = values
131
+ return output
132
+
133
+ @staticmethod
134
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
135
+ indices, = ctx.saved_tensors
136
+ grad_values = grad_output[indices]
137
+ return grad_values, None, None
138
+
139
+
140
+ index_put_first_axis = IndexPutFirstAxis.apply
141
+
142
+
143
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
144
+ """Add padding to sequences.
145
+
146
+ Arguments:
147
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
148
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
149
+ batch: int batch_size
150
+ seqlen: int max sequence length
151
+
152
+ Returns:
153
+ inputs: (batch, seqlen, ...)
154
+ """
155
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
156
+ return output.view(batch, seqlen, *inputs.shape[1:])
157
+
158
+
159
+ def rotate_half(x):
160
+ """Rotates half the hidden dims of the input."""
161
+ x1 = x[..., : x.shape[-1] // 2]
162
+ x2 = x[..., x.shape[-1] // 2 :]
163
+ return torch.cat((-x2, x1), dim=-1)
164
+
165
+
166
+ def apply_rotary_pos_emb(q, k, cos, sin):
167
+ """Applies Rotary Position Embedding to the query and key tensors.
168
+
169
+ Args:
170
+ q (`torch.Tensor`): The query tensor.
171
+ k (`torch.Tensor`): The key tensor.
172
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
173
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
174
+ Returns:
175
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
176
+ """
177
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
178
+ q_embed = (q * cos) + (rotate_half(q) * sin)
179
+ k_embed = (k * cos) + (rotate_half(k) * sin)
180
+ return q_embed, k_embed
181
+
182
+
183
+ class RotaryEmbedding(torch.nn.Module):
184
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
185
+ super().__init__()
186
+
187
+ self.dim = dim
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.base = base
190
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
191
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
192
+
193
+ # Build here to make `torch.jit.trace` work.
194
+ self._set_cos_sin_cache(
195
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
196
+ )
197
+
198
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
199
+ self.max_seq_len_cached = seq_len
200
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
201
+
202
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
203
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
204
+ emb = torch.cat((freqs, freqs), dim=-1)
205
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
206
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
207
+
208
+ def forward(self, x, seq_len=None):
209
+ # x: [bs, num_attention_heads, seq_len, head_size]
210
+ if seq_len > self.max_seq_len_cached:
211
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
212
+
213
+ return (
214
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
215
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
216
+ )
217
+
218
+
219
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
220
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
221
+
222
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
223
+ self.scaling_factor = scaling_factor
224
+ self.mixed_b = mixed_b
225
+ super().__init__(dim, max_position_embeddings, base, device)
226
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
227
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
228
+
229
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
230
+ self.max_seq_len_cached = seq_len
231
+
232
+ if seq_len > self.max_position_embeddings:
233
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
234
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
235
+
236
+ if self.mixed_b is None:
237
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim) # (6)
238
+ else:
239
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b # (13)
240
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp() # (12)
241
+ inv_freq = inv_freq / lambda_1_m # (10)
242
+
243
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
244
+
245
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
246
+
247
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
248
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
249
+ emb = torch.cat((freqs, freqs), dim=-1)
250
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
251
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
252
+
253
+
254
+ class RMSNorm(nn.Module):
255
+ def __init__(self, hidden_size, eps=1e-6):
256
+ """
257
+ RMSNorm is equivalent to T5LayerNorm
258
+ """
259
+ super().__init__()
260
+ self.weight = nn.Parameter(torch.ones(hidden_size))
261
+ self.variance_epsilon = eps
262
+
263
+ def forward(self, hidden_states):
264
+ input_dtype = hidden_states.dtype
265
+ hidden_states = hidden_states.to(torch.float32)
266
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
267
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
268
+ return self.weight * hidden_states.to(input_dtype)
269
+
270
+
271
+ LAYER_NORM = {
272
+ 'layer_norm': nn.LayerNorm,
273
+ 'rms_norm': RMSNorm
274
+ }
275
+
276
+
277
+ class NewEmbeddings(nn.Module):
278
+ """
279
+ Embedding and Unpadding.
280
+ """
281
+
282
+ def __init__(self, config: NewConfig):
283
+ super().__init__()
284
+ self.padding_idx = config.pad_token_id
285
+ self.word_embeddings = nn.Embedding(
286
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
287
+ )
288
+
289
+ self.position_embedding_type = config.position_embedding_type
290
+ if self.position_embedding_type == 'absolute':
291
+ self.position_embeddings = nn.Embedding(
292
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
293
+ )
294
+ elif self.position_embedding_type == 'rope':
295
+ self._init_rope(config)
296
+ else:
297
+ raise ValueError
298
+
299
+ self.type_vocab_size = config.type_vocab_size
300
+ if self.type_vocab_size > 0:
301
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
302
+
303
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
304
+ # any TensorFlow checkpoint file
305
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
306
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
307
+ # position_ids is contiguous in memory and excluded when serialized
308
+ self.register_buffer(
309
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
310
+ )
311
+
312
+ def _init_rope(self, config):
313
+ kwargs = dict(
314
+ dim=int(config.hidden_size / config.num_attention_heads),
315
+ max_position_embeddings=config.max_position_embeddings,
316
+ base=config.rope_theta
317
+ )
318
+ if config.rope_scaling is None:
319
+ self.rotary_emb = RotaryEmbedding(**kwargs)
320
+ else:
321
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
322
+ scaling_type = config.rope_scaling["type"]
323
+ if scaling_type == 'ntk':
324
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
325
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
326
+ # elif scaling_type == "linear":
327
+ # self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
328
+ # elif scaling_type == "dynamic":
329
+ # self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
330
+ else:
331
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
332
+
333
+ def forward(
334
+ self,
335
+ unpad_inputs: bool,
336
+ input_ids: Optional[torch.Tensor] = None,
337
+ attention_mask: Optional[torch.Tensor] = None,
338
+ length: Optional[List[int]] = None,
339
+ token_type_ids: Optional[torch.Tensor] = None,
340
+ position_ids: Optional[torch.Tensor] = None,
341
+ inputs_embeds: Optional[torch.Tensor] = None,
342
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
343
+ """
344
+ """
345
+ if inputs_embeds is None:
346
+ device, input_shape = input_ids.device, input_ids.shape
347
+ else:
348
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
349
+ batch_size, seq_length = input_shape
350
+
351
+ # Set attention_mask if it's None
352
+ if attention_mask is None:
353
+ attention_mask = torch.ones(input_shape, device=device)
354
+ if length is not None:
355
+ for i, l in enumerate(length):
356
+ attention_mask[i, l:] = 0
357
+
358
+ # Set attention_mask_bool for unpadding
359
+ if unpad_inputs:
360
+ attention_mask_bool = attention_mask.bool()
361
+ if length is None:
362
+ length = attention_mask.sum(-1).tolist()
363
+
364
+ # Get word embeddings
365
+ if inputs_embeds is None:
366
+ if unpad_inputs:
367
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
368
+ inputs_embeds = self.word_embeddings(input_ids)
369
+ else:
370
+ if unpad_inputs:
371
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
372
+ embeddings = inputs_embeds
373
+
374
+ # Set and unpad position_ids
375
+ if position_ids is None:
376
+ if seq_length > self.position_ids.size(0):
377
+ self.register_buffer(
378
+ "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
379
+ )
380
+ if unpad_inputs:
381
+ # [1, cumsum_seq_len]
382
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
383
+ else:
384
+ # [bs, seq_len]
385
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
386
+ elif unpad_inputs:
387
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0) # [1, cumsum_seq_len]
388
+
389
+ # Compute rotary embedding
390
+ if self.position_embedding_type == 'rope':
391
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
392
+ rope_cos = rope_cos[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
393
+ rope_sin = rope_sin[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
394
+ rope_embeds = rope_cos, rope_sin
395
+ else:
396
+ rope_embeds = None
397
+
398
+ if self.type_vocab_size > 0:
399
+ if token_type_ids is None:
400
+ token_type_ids = position_ids.mul(0)
401
+ else:
402
+ if self.type_vocab_size < 2:
403
+ token_type_ids.mul_(0)
404
+ if unpad_inputs:
405
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
406
+
407
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
408
+ embeddings = embeddings + token_type_embeddings
409
+
410
+ # BERT position
411
+ if self.position_embedding_type == "absolute":
412
+ position_embeddings = self.position_embeddings(position_ids)
413
+ embeddings = embeddings + position_embeddings
414
+
415
+ embeddings = self.LayerNorm(embeddings)
416
+ embeddings = self.dropout(embeddings)
417
+
418
+ return embeddings, attention_mask, rope_embeds, length
419
+
420
+
421
+ class NewAttention(nn.Module):
422
+ def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
423
+ super().__init__()
424
+ self.config = config
425
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
426
+ raise ValueError(
427
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
428
+ f"heads ({config.num_attention_heads})"
429
+ )
430
+
431
+ self.hidden_size = config.hidden_size
432
+ self.num_attention_heads = config.num_attention_heads
433
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
434
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
435
+
436
+ if pack_qkv is None:
437
+ pack_qkv = config.pack_qkv
438
+ self.pack_qkv = pack_qkv
439
+
440
+ if self.pack_qkv:
441
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
442
+ else:
443
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
444
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
445
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
446
+
447
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
448
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
449
+
450
+ if use_memory_efficient_attention is None:
451
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
452
+ self.use_memory_efficient_attention = use_memory_efficient_attention
453
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
454
+ if self.use_memory_efficient_attention:
455
+ assert self.memory_efficient_attention is not None, 'please install xformers'
456
+
457
+ def forward(
458
+ self,
459
+ hidden_states: torch.Tensor,
460
+ attention_bias: torch.FloatTensor,
461
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
462
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
463
+ attention_scale: Optional[torch.FloatTensor] = None,
464
+ head_mask: Optional[torch.FloatTensor] = None,
465
+ output_attentions: Optional[bool] = False,
466
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
467
+ ) -> Tuple[torch.Tensor, ...]:
468
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
469
+ # qkv
470
+ if self.pack_qkv and qkv_inputs is None:
471
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
472
+ else:
473
+ if qkv_inputs is None:
474
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
475
+ qkv_pack = [
476
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
477
+ ]
478
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
479
+
480
+ if self.config.position_embedding_type == 'rope':
481
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
482
+
483
+ dtype = query_states.dtype
484
+
485
+ if self.config.logn_attention_scale and attention_scale is not None:
486
+ # https://kexue.fm/archives/8823
487
+ query_states = query_states * attention_scale.to(dtype)
488
+
489
+ if padding_inputs is not None:
490
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
491
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
492
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
493
+
494
+ if self.use_memory_efficient_attention:
495
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
496
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
497
+ assert head_mask is None, "Not support yet"
498
+ attention_probs = None
499
+ if torch.is_tensor(attention_bias):
500
+ attention_bias = attention_bias.to(dtype)
501
+ context_layer = self.memory_efficient_attention(
502
+ query_states,
503
+ key_states,
504
+ value_states,
505
+ attn_bias=attention_bias,
506
+ p=self.dropout.p
507
+ )
508
+ else:
509
+ if output_attentions and isinstance(self, NewSdpaAttention):
510
+ raise RuntimeError("SDPA do not output attentions")
511
+ context_layer, attention_probs = self._attention(
512
+ query_states, key_states, value_states, attention_bias, head_mask
513
+ )
514
+
515
+ if padding_inputs is not None:
516
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
517
+
518
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
519
+ context_layer = context_layer.view(new_context_layer_shape)
520
+
521
+ # output proj
522
+ attn_output = self.o_proj(context_layer)
523
+
524
+ # add attentions if we output them
525
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
526
+ return outputs
527
+
528
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
529
+ """
530
+ Args:
531
+ q/k/v: (B, L, n_head, head_dim),
532
+ Returns:
533
+ attn_output: (B L, n_head, head_dim)
534
+ """
535
+ query_states = query_states.transpose(1, 2)
536
+ key_states = key_states.transpose(1, 2)
537
+ value_states = value_states.transpose(1, 2)
538
+ # Take the dot product between "query" and "key" to get the raw attention scores.
539
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
540
+
541
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
542
+ if attention_bias is not None:
543
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
544
+ attention_scores = attention_scores + attention_bias
545
+
546
+ # Normalize the attention scores to probabilities.
547
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
548
+
549
+ # This is actually dropping out entire tokens to attend to, which might
550
+ # seem a bit unusual, but is taken from the original Transformer paper.
551
+ if self.dropout.p > 0:
552
+ attention_probs = self.dropout(attention_probs)
553
+
554
+ # Mask heads if we want to
555
+ if head_mask is not None:
556
+ attention_probs = attention_probs * head_mask
557
+
558
+ context_layer = torch.matmul(attention_probs, value_states)
559
+
560
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
561
+ return context_layer, attention_probs
562
+
563
+
564
+ class NewSdpaAttention(NewAttention):
565
+ """
566
+ New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
567
+ `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
568
+ SDPA API.
569
+ """
570
+ def __init__(self, config: NewConfig, **kwargs):
571
+ super().__init__(config, **kwargs)
572
+ # torch.backends.cuda.enable_mem_efficient_sdp(False)
573
+ # logger.warning(
574
+ # "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
575
+ # "`use_memory_efficient_attention=True` if it expected to use."
576
+ # )
577
+
578
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
579
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
580
+ query_states.transpose(1, 2),
581
+ key_states.transpose(1, 2),
582
+ value_states.transpose(1, 2),
583
+ attn_mask=attention_bias,
584
+ dropout_p=self.dropout.p if self.training else 0.0,
585
+ )
586
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
587
+ return attn_output, None
588
+
589
+
590
+ NEW_ATTENTION_CLASSES = {
591
+ "eager": NewAttention,
592
+ # "flash_attention_2": , # TODO
593
+ "sdpa": NewSdpaAttention,
594
+ }
595
+
596
+
597
+ class NewGatedMLP(nn.Module):
598
+ """
599
+ GLU Variants Improve Transformer.
600
+ """
601
+
602
+ def __init__(self, config: NewConfig):
603
+ super().__init__()
604
+ self.intermediate_size = config.intermediate_size
605
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
606
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
607
+ self.act_fn = ACT2FN[config.hidden_act]
608
+ if config.hidden_dropout_prob > 0:
609
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
610
+ else:
611
+ self.hidden_dropout = None
612
+
613
+ def forward(self, hidden_states):
614
+ up_gate = self.up_gate_proj(hidden_states)
615
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
616
+ gate = self.act_fn(gate)
617
+ gated_states = gate * up_states
618
+ if self.hidden_dropout is not None:
619
+ gated_states = self.hidden_dropout(gated_states)
620
+ down_states = self.down_proj(gated_states)
621
+ return down_states
622
+
623
+
624
+ class NewLayer(nn.Module):
625
+ def __init__(
626
+ self,
627
+ config: NewConfig,
628
+ pack_qkv=None,
629
+ use_memory_efficient_attention=None,
630
+ attn_implementation=None
631
+ ):
632
+ super().__init__()
633
+ if attn_implementation is None:
634
+ attn_implementation = config._attn_implementation
635
+ if use_memory_efficient_attention is None:
636
+ use_memory_efficient_attention = config.use_memory_efficient_attention
637
+ if use_memory_efficient_attention:
638
+ if attn_implementation != 'eager':
639
+ logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
640
+ attn_implementation = 'eager' # Since it will be SDPA by default for torch>=2.1.1
641
+ self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
642
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
643
+ )
644
+ self.mlp = NewGatedMLP(config)
645
+
646
+ ln_class = LAYER_NORM[config.layer_norm_type]
647
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
648
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
649
+
650
+ if config.hidden_dropout_prob > 0:
651
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
652
+ else:
653
+ self.hidden_dropout = None
654
+
655
+ def forward(
656
+ self,
657
+ hidden_states: torch.Tensor,
658
+ attention_bias: torch.FloatTensor,
659
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
660
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
661
+ attention_scale: Optional[torch.FloatTensor] = None,
662
+ subset_indices: Optional[torch.LongTensor] = None,
663
+ head_mask: Optional[torch.FloatTensor] = None,
664
+ output_attentions: Optional[bool] = False,
665
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
666
+ ) -> Tuple[torch.Tensor, ...]:
667
+ # Multi head self attention
668
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
669
+ attention_outputs = self.attention(
670
+ hidden_states,
671
+ attention_bias,
672
+ rope_embeds,
673
+ padding_inputs,
674
+ attention_scale,
675
+ head_mask,
676
+ output_attentions=output_attentions,
677
+ qkv_inputs=qkv_inputs,
678
+ )
679
+ hidden_states = attention_outputs[0]
680
+ if self.hidden_dropout is not None:
681
+ hidden_states = self.hidden_dropout(hidden_states)
682
+ hidden_states = residual + hidden_states
683
+
684
+ # In pretraining, after the attention of last layer, we only need the masked tokens.
685
+ if subset_indices is not None:
686
+ hidden_states = hidden_states[subset_indices]
687
+
688
+ hidden_states = self.attn_ln(hidden_states)
689
+
690
+ # Fully Connected
691
+ residual = hidden_states
692
+ hidden_states = self.mlp(hidden_states)
693
+ if self.hidden_dropout is not None:
694
+ hidden_states = self.hidden_dropout(hidden_states)
695
+ hidden_states = residual + hidden_states
696
+ hidden_states = self.mlp_ln(hidden_states)
697
+
698
+ # add self attentions if we output attention weights
699
+ outputs = (hidden_states,) + attention_outputs[1:]
700
+ return outputs
701
+
702
+
703
+ class NewEncoder(nn.Module):
704
+ def __init__(self, config):
705
+ super().__init__()
706
+ self.config = config
707
+ self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
708
+ self.gradient_checkpointing = False
709
+
710
+ def forward(
711
+ self,
712
+ hidden_states: torch.Tensor,
713
+ attention_bias: Optional[torch.FloatTensor] = None,
714
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
715
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
716
+ attention_scale: Optional[torch.FloatTensor] = None,
717
+ subset_indices: Optional[torch.LongTensor] = None,
718
+ head_mask: Optional[torch.FloatTensor] = None,
719
+ output_attentions: Optional[bool] = False,
720
+ output_hidden_states: Optional[bool] = False,
721
+ return_dict: Optional[bool] = True,
722
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
723
+ all_hidden_states = () if output_hidden_states else None
724
+ all_self_attentions = () if output_attentions else None
725
+
726
+ for i, layer_module in enumerate(self.layer):
727
+ if output_hidden_states:
728
+ all_hidden_states = all_hidden_states + (hidden_states,)
729
+
730
+ if i >= len(self.layer) - 1:
731
+ layer_subset_indices = subset_indices
732
+ else:
733
+ layer_subset_indices = None
734
+
735
+ layer_head_mask = head_mask[i] if head_mask is not None else None
736
+
737
+ if self.gradient_checkpointing and self.training:
738
+ layer_outputs = self._gradient_checkpointing_func(
739
+ layer_module.__call__,
740
+ hidden_states,
741
+ attention_bias,
742
+ rope_embeds,
743
+ padding_inputs,
744
+ attention_scale,
745
+ layer_subset_indices,
746
+ layer_head_mask,
747
+ )
748
+ else:
749
+ layer_outputs = layer_module(
750
+ hidden_states,
751
+ attention_bias,
752
+ rope_embeds,
753
+ padding_inputs,
754
+ attention_scale,
755
+ layer_subset_indices,
756
+ layer_head_mask,
757
+ output_attentions,
758
+ )
759
+
760
+ hidden_states = layer_outputs[0]
761
+ if output_attentions:
762
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
763
+
764
+ if output_hidden_states:
765
+ all_hidden_states = all_hidden_states + (hidden_states,)
766
+
767
+ if not return_dict:
768
+ return tuple(
769
+ v
770
+ for v in [
771
+ hidden_states,
772
+ all_hidden_states,
773
+ all_self_attentions,
774
+ ]
775
+ if v is not None
776
+ )
777
+ return BaseModelOutput(
778
+ last_hidden_state=hidden_states,
779
+ hidden_states=all_hidden_states,
780
+ attentions=all_self_attentions,
781
+ )
782
+
783
+
784
+ # Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
785
+ class NewPooler(nn.Module):
786
+ def __init__(self, config):
787
+ super().__init__()
788
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
789
+ self.activation = nn.Tanh()
790
+
791
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
792
+ # We "pool" the model by simply taking the hidden state corresponding
793
+ # to the first token.
794
+ first_token_tensor = hidden_states[:, 0]
795
+ pooled_output = self.dense(first_token_tensor)
796
+ pooled_output = self.activation(pooled_output)
797
+ return pooled_output
798
+
799
+
800
+ class NewPreTrainedModel(PreTrainedModel):
801
+ """
802
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
803
+ models.
804
+ """
805
+
806
+ config_class = NewConfig
807
+ base_model_prefix = "new"
808
+ supports_gradient_checkpointing = True
809
+ _supports_sdpa = True
810
+
811
+ def _init_weights(self, module):
812
+ """Initialize the weights"""
813
+ if isinstance(module, nn.Linear):
814
+ # Slightly different from the TF version which uses truncated_normal for initialization
815
+ # cf https://github.com/pytorch/pytorch/pull/5617
816
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
817
+ if module.bias is not None:
818
+ module.bias.data.zero_()
819
+ elif isinstance(module, nn.Embedding):
820
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
821
+ if module.padding_idx is not None:
822
+ module.weight.data[module.padding_idx].zero_()
823
+ elif isinstance(module, nn.LayerNorm):
824
+ module.bias.data.zero_()
825
+ module.weight.data.fill_(1.0)
826
+
827
+
828
+ class NewModel(NewPreTrainedModel):
829
+ """
830
+ The bare New Model transformer outputting raw hidden-states without any specific head on top.
831
+ """
832
+
833
+ def __init__(self, config: NewConfig, add_pooling_layer=False):
834
+ super().__init__(config)
835
+ self.config = config
836
+
837
+ self.embeddings = NewEmbeddings(config)
838
+ self.encoder = NewEncoder(config)
839
+
840
+ self.pooler = NewPooler(config) if add_pooling_layer else None
841
+
842
+ # Initialize weights and apply final processing
843
+ self.post_init()
844
+
845
+ def get_input_embeddings(self):
846
+ return self.embeddings.word_embeddings
847
+
848
+ def set_input_embeddings(self, value):
849
+ self.embeddings.word_embeddings = value
850
+
851
+ def forward(
852
+ self,
853
+ input_ids: Optional[torch.Tensor] = None,
854
+ attention_mask: Optional[torch.Tensor] = None,
855
+ length: Optional[List[int]] = None,
856
+ subset_indices: Optional[torch.LongTensor] = None,
857
+ token_type_ids: Optional[torch.Tensor] = None,
858
+ position_ids: Optional[torch.Tensor] = None,
859
+ head_mask: Optional[torch.Tensor] = None,
860
+ inputs_embeds: Optional[torch.Tensor] = None,
861
+ output_attentions: Optional[bool] = None,
862
+ output_hidden_states: Optional[bool] = None,
863
+ return_dict: Optional[bool] = None,
864
+ unpad_inputs: Optional[bool] = None,
865
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
866
+ r"""
867
+ length (`list` of length `batch_size`, *optional*):
868
+ If is `None`, return padded `last_hidden_state`.
869
+ subset_indices ():
870
+ pass
871
+ unpad_inputs (`bool`, *optional*):
872
+ pass
873
+ """
874
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
875
+ output_hidden_states = (
876
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
877
+ )
878
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
879
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
880
+ output_padded = length is None
881
+
882
+ if input_ids is not None and inputs_embeds is not None:
883
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
884
+ elif input_ids is not None:
885
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
886
+ input_shape = input_ids.size()
887
+ elif inputs_embeds is not None:
888
+ input_shape = inputs_embeds.size()[:-1]
889
+ else:
890
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
891
+
892
+ # TODO: not used
893
+ # # Prepare head mask if needed
894
+ # # 1.0 in head_mask indicate we keep the head
895
+ # # attention_probs has shape bsz x n_heads x N x N
896
+ # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
897
+ # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
898
+ # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
899
+
900
+ # Get embeddings, may unpad them
901
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
902
+ unpad_inputs,
903
+ input_ids=input_ids,
904
+ attention_mask=attention_mask,
905
+ length=length,
906
+ token_type_ids=token_type_ids,
907
+ position_ids=position_ids,
908
+ inputs_embeds=inputs_embeds
909
+ )
910
+
911
+ batch_size, seq_length = input_shape
912
+ if unpad_inputs and self.config.use_memory_efficient_attention:
913
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
914
+ else:
915
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
916
+ # ourselves in which case we just need to make it broadcastable to all heads.
917
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
918
+ if self.config.use_memory_efficient_attention:
919
+ # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
920
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
921
+
922
+ padding_inputs = None
923
+ if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
924
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
925
+ if not self.config.use_memory_efficient_attention:
926
+ padding_inputs = (indices, *input_shape)
927
+
928
+ attention_scale = None
929
+ if self.config.logn_attention_scale:
930
+ logger.warning_once("TODO: logn_attention_scale")
931
+ # # attention scale log_512(input_len)
932
+ # attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
933
+ # # inference-time logn scale need clip 1
934
+ # if self.config.logn_attention_clip1:
935
+ # attention_scale.clip_(1)
936
+ # attention_scale = attention_scale[:, None, None, None]
937
+ # else:
938
+ # attention_scale = None
939
+
940
+ encoder_outputs = self.encoder(
941
+ embedding_output,
942
+ attention_bias=attention_bias,
943
+ rope_embeds=rope_embeds,
944
+ padding_inputs=padding_inputs,
945
+ attention_scale=attention_scale,
946
+ subset_indices=subset_indices,
947
+ head_mask=head_mask,
948
+ output_attentions=output_attentions,
949
+ output_hidden_states=output_hidden_states,
950
+ return_dict=return_dict,
951
+ )
952
+ sequence_output = encoder_outputs[0]
953
+ if unpad_inputs and output_padded:
954
+ sequence_output = pad_input(
955
+ sequence_output.squeeze(), indices, batch_size, seq_length
956
+ )
957
+
958
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
959
+
960
+ if not return_dict:
961
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
962
+
963
+ return BaseModelOutputWithPooling(
964
+ last_hidden_state=sequence_output,
965
+ pooler_output=pooled_output,
966
+ hidden_states=encoder_outputs.hidden_states,
967
+ attentions=encoder_outputs.attentions,
968
+ )
969
+
970
+
971
+ class NewLMPredictionHead(nn.Module):
972
+ def __init__(self, config):
973
+ super().__init__()
974
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
975
+ self.transform_act_fn = ACT2FN[config.hidden_act]
976
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
977
+
978
+ # The output weights are the same as the input embeddings, but there is
979
+ # an output-only bias for each token.
980
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
981
+
982
+ def forward(self, hidden_states):
983
+ hidden_states = self.dense(hidden_states)
984
+ hidden_states = self.transform_act_fn(hidden_states)
985
+ hidden_states = self.norm(hidden_states)
986
+ hidden_states = self.decoder(hidden_states)
987
+ return hidden_states
988
+
989
+
990
+ class NewForMaskedLM(NewPreTrainedModel):
991
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
992
+
993
+ def __init__(self, config: NewConfig):
994
+ super().__init__(config)
995
+ self.new = NewModel(config, add_pooling_layer=False)
996
+ self.lm_head = NewLMPredictionHead(config)
997
+ self.loss_fct = nn.CrossEntropyLoss()
998
+
999
+ # Initialize weights and apply final processing
1000
+ self.post_init()
1001
+
1002
+ def get_output_embeddings(self):
1003
+ return self.lm_head.decoder
1004
+
1005
+ def set_output_embeddings(self, new_embeddings):
1006
+ self.lm_head.decoder = new_embeddings
1007
+
1008
+ def forward(
1009
+ self,
1010
+ input_ids: Optional[torch.Tensor] = None,
1011
+ attention_mask: Optional[torch.Tensor] = None,
1012
+ token_type_ids: Optional[torch.Tensor] = None,
1013
+ position_ids: Optional[torch.Tensor] = None,
1014
+ head_mask: Optional[torch.Tensor] = None,
1015
+ inputs_embeds: Optional[torch.Tensor] = None,
1016
+ labels: Optional[torch.Tensor] = None,
1017
+ output_attentions: Optional[bool] = None,
1018
+ output_hidden_states: Optional[bool] = None,
1019
+ return_dict: Optional[bool] = None,
1020
+ unpad_inputs: Optional[bool] = None,
1021
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1022
+ r"""
1023
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1024
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1025
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1026
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1027
+ """
1028
+
1029
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1030
+
1031
+ if labels is None or not self.new.config.unpad_inputs:
1032
+ length = None
1033
+ subset_indices = None
1034
+ else:
1035
+ length = attention_mask.sum(-1).tolist()
1036
+ labels = labels[attention_mask.bool()].unsqueeze(0)
1037
+ subset_indices = labels > -100
1038
+
1039
+ outputs = self.new(
1040
+ input_ids,
1041
+ attention_mask=attention_mask,
1042
+ length=length,
1043
+ subset_indices=subset_indices,
1044
+ token_type_ids=token_type_ids,
1045
+ position_ids=position_ids,
1046
+ head_mask=head_mask,
1047
+ inputs_embeds=inputs_embeds,
1048
+ output_attentions=output_attentions,
1049
+ output_hidden_states=output_hidden_states,
1050
+ return_dict=return_dict,
1051
+ unpad_inputs=unpad_inputs,
1052
+ )
1053
+
1054
+ sequence_output = outputs[0]
1055
+ prediction_scores = self.lm_head(sequence_output)
1056
+
1057
+ masked_lm_loss = None
1058
+ if labels is not None:
1059
+ if subset_indices is None:
1060
+ mask = attention_mask.bool()
1061
+ prediction_scores = prediction_scores[mask]
1062
+ labels = labels[mask]
1063
+ else:
1064
+ labels = labels[subset_indices]
1065
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
1066
+
1067
+ if not return_dict:
1068
+ output = (prediction_scores,) + outputs[2:]
1069
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1070
+
1071
+ return MaskedLMOutput(
1072
+ loss=masked_lm_loss,
1073
+ logits=prediction_scores,
1074
+ hidden_states=outputs.hidden_states,
1075
+ attentions=outputs.attentions,
1076
+ )
1077
+
1078
+
1079
+ class NewForSequenceClassification(NewPreTrainedModel):
1080
+ def __init__(self, config):
1081
+ super().__init__(config)
1082
+ self.num_labels = config.num_labels
1083
+ self.config = config
1084
+
1085
+ self.new = NewModel(config, add_pooling_layer=True)
1086
+ classifier_dropout = (
1087
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1088
+ )
1089
+ self.dropout = nn.Dropout(classifier_dropout)
1090
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1091
+
1092
+ # Initialize weights and apply final processing
1093
+ self.post_init()
1094
+
1095
+ def forward(
1096
+ self,
1097
+ input_ids: Optional[torch.Tensor] = None,
1098
+ attention_mask: Optional[torch.Tensor] = None,
1099
+ token_type_ids: Optional[torch.Tensor] = None,
1100
+ position_ids: Optional[torch.Tensor] = None,
1101
+ head_mask: Optional[torch.Tensor] = None,
1102
+ inputs_embeds: Optional[torch.Tensor] = None,
1103
+ labels: Optional[torch.Tensor] = None,
1104
+ output_attentions: Optional[bool] = None,
1105
+ output_hidden_states: Optional[bool] = None,
1106
+ return_dict: Optional[bool] = None,
1107
+ unpad_inputs: Optional[bool] = None,
1108
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1109
+ r"""
1110
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1111
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1112
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1113
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1114
+ """
1115
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1116
+
1117
+ outputs = self.new(
1118
+ input_ids,
1119
+ attention_mask=attention_mask,
1120
+ token_type_ids=token_type_ids,
1121
+ position_ids=position_ids,
1122
+ head_mask=head_mask,
1123
+ inputs_embeds=inputs_embeds,
1124
+ output_attentions=output_attentions,
1125
+ output_hidden_states=output_hidden_states,
1126
+ return_dict=return_dict,
1127
+ unpad_inputs=unpad_inputs,
1128
+ )
1129
+
1130
+ pooled_output = outputs[1]
1131
+
1132
+ pooled_output = self.dropout(pooled_output)
1133
+ logits = self.classifier(pooled_output)
1134
+
1135
+ loss = None
1136
+ if labels is not None:
1137
+ if self.config.problem_type is None:
1138
+ if self.num_labels == 1:
1139
+ self.config.problem_type = "regression"
1140
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1141
+ self.config.problem_type = "single_label_classification"
1142
+ else:
1143
+ self.config.problem_type = "multi_label_classification"
1144
+
1145
+ if self.config.problem_type == "regression":
1146
+ loss_fct = nn.MSELoss()
1147
+ if self.num_labels == 1:
1148
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1149
+ else:
1150
+ loss = loss_fct(logits, labels)
1151
+ elif self.config.problem_type == "single_label_classification":
1152
+ loss_fct = nn.CrossEntropyLoss()
1153
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1154
+ elif self.config.problem_type == "multi_label_classification":
1155
+ loss_fct = nn.BCEWithLogitsLoss()
1156
+ loss = loss_fct(logits, labels)
1157
+
1158
+ if not return_dict:
1159
+ output = (logits,) + outputs[2:]
1160
+ return ((loss,) + output) if loss is not None else output
1161
+
1162
+ return SequenceClassifierOutput(
1163
+ loss=loss,
1164
+ logits=logits,
1165
+ hidden_states=outputs.hidden_states,
1166
+ attentions=outputs.attentions,
1167
+ )
1168
+
1169
+
1170
+ class NewForMultipleChoice(NewPreTrainedModel):
1171
+ def __init__(self, config):
1172
+ super().__init__(config)
1173
+
1174
+ self.new = NewModel(config, add_pooling_layer=True)
1175
+ classifier_dropout = (
1176
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1177
+ )
1178
+ self.dropout = nn.Dropout(classifier_dropout)
1179
+ self.classifier = nn.Linear(config.hidden_size, 1)
1180
+
1181
+ # Initialize weights and apply final processing
1182
+ self.post_init()
1183
+
1184
+ def forward(
1185
+ self,
1186
+ input_ids: Optional[torch.Tensor] = None,
1187
+ attention_mask: Optional[torch.Tensor] = None,
1188
+ token_type_ids: Optional[torch.Tensor] = None,
1189
+ position_ids: Optional[torch.Tensor] = None,
1190
+ head_mask: Optional[torch.Tensor] = None,
1191
+ inputs_embeds: Optional[torch.Tensor] = None,
1192
+ labels: Optional[torch.Tensor] = None,
1193
+ output_attentions: Optional[bool] = None,
1194
+ output_hidden_states: Optional[bool] = None,
1195
+ return_dict: Optional[bool] = None,
1196
+ unpad_inputs: Optional[bool] = None,
1197
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1198
+ r"""
1199
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1200
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1201
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1202
+ `input_ids` above)
1203
+ """
1204
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1205
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1206
+
1207
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1208
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1209
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1210
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1211
+ inputs_embeds = (
1212
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1213
+ if inputs_embeds is not None
1214
+ else None
1215
+ )
1216
+
1217
+ outputs = self.new(
1218
+ input_ids,
1219
+ attention_mask=attention_mask,
1220
+ token_type_ids=token_type_ids,
1221
+ position_ids=position_ids,
1222
+ head_mask=head_mask,
1223
+ inputs_embeds=inputs_embeds,
1224
+ output_attentions=output_attentions,
1225
+ output_hidden_states=output_hidden_states,
1226
+ return_dict=return_dict,
1227
+ unpad_inputs=unpad_inputs,
1228
+ )
1229
+
1230
+ pooled_output = outputs[1]
1231
+
1232
+ pooled_output = self.dropout(pooled_output)
1233
+ logits = self.classifier(pooled_output)
1234
+ reshaped_logits = logits.view(-1, num_choices)
1235
+
1236
+ loss = None
1237
+ if labels is not None:
1238
+ loss_fct = nn.CrossEntropyLoss()
1239
+ loss = loss_fct(reshaped_logits, labels)
1240
+
1241
+ if not return_dict:
1242
+ output = (reshaped_logits,) + outputs[2:]
1243
+ return ((loss,) + output) if loss is not None else output
1244
+
1245
+ return MultipleChoiceModelOutput(
1246
+ loss=loss,
1247
+ logits=reshaped_logits,
1248
+ hidden_states=outputs.hidden_states,
1249
+ attentions=outputs.attentions,
1250
+ )
1251
+
1252
+
1253
+ @dataclass
1254
+ class NewTokenClassifierOutput(ModelOutput):
1255
+ loss: Optional[torch.FloatTensor] = None
1256
+ logits: torch.FloatTensor = None
1257
+ last_hidden_state: torch.FloatTensor = None
1258
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
1259
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
1260
+
1261
+
1262
+ class NewForTokenClassification(NewPreTrainedModel):
1263
+ def __init__(self, config):
1264
+ super().__init__(config)
1265
+ self.num_labels = config.num_labels
1266
+
1267
+ self.new = NewModel(config, add_pooling_layer=False)
1268
+ classifier_dropout = (
1269
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1270
+ )
1271
+ self.dropout = nn.Dropout(classifier_dropout)
1272
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1273
+
1274
+ # Initialize weights and apply final processing
1275
+ self.post_init()
1276
+
1277
+ def forward(
1278
+ self,
1279
+ input_ids: Optional[torch.Tensor] = None,
1280
+ attention_mask: Optional[torch.Tensor] = None,
1281
+ token_type_ids: Optional[torch.Tensor] = None,
1282
+ position_ids: Optional[torch.Tensor] = None,
1283
+ head_mask: Optional[torch.Tensor] = None,
1284
+ inputs_embeds: Optional[torch.Tensor] = None,
1285
+ labels: Optional[torch.Tensor] = None,
1286
+ output_attentions: Optional[bool] = None,
1287
+ output_hidden_states: Optional[bool] = None,
1288
+ return_dict: Optional[bool] = None,
1289
+ unpad_inputs: Optional[bool] = None,
1290
+ ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
1291
+ r"""
1292
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1293
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1294
+ """
1295
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1296
+
1297
+ outputs = self.new(
1298
+ input_ids,
1299
+ attention_mask=attention_mask,
1300
+ token_type_ids=token_type_ids,
1301
+ position_ids=position_ids,
1302
+ head_mask=head_mask,
1303
+ inputs_embeds=inputs_embeds,
1304
+ output_attentions=output_attentions,
1305
+ output_hidden_states=output_hidden_states,
1306
+ return_dict=return_dict,
1307
+ unpad_inputs=unpad_inputs,
1308
+ )
1309
+
1310
+ sequence_output = outputs[0]
1311
+
1312
+ sequence_output = self.dropout(sequence_output)
1313
+ logits = self.classifier(sequence_output)
1314
+
1315
+ loss = None
1316
+ if labels is not None:
1317
+ loss_fct = nn.CrossEntropyLoss()
1318
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1319
+
1320
+ if not return_dict:
1321
+ output = (logits,) + outputs[2:]
1322
+ return ((loss,) + output) if loss is not None else output
1323
+
1324
+ return NewTokenClassifierOutput(
1325
+ loss=loss,
1326
+ logits=logits,
1327
+ last_hidden_state=sequence_output,
1328
+ hidden_states=outputs.hidden_states,
1329
+ attentions=outputs.attentions,
1330
+ )
1331
+
1332
+
1333
+ class NewForQuestionAnswering(NewPreTrainedModel):
1334
+ def __init__(self, config):
1335
+ super().__init__(config)
1336
+ self.num_labels = config.num_labels
1337
+
1338
+ self.new = NewModel(config, add_pooling_layer=False)
1339
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1340
+
1341
+ # Initialize weights and apply final processing
1342
+ self.post_init()
1343
+
1344
+ def forward(
1345
+ self,
1346
+ input_ids: Optional[torch.Tensor] = None,
1347
+ attention_mask: Optional[torch.Tensor] = None,
1348
+ token_type_ids: Optional[torch.Tensor] = None,
1349
+ position_ids: Optional[torch.Tensor] = None,
1350
+ head_mask: Optional[torch.Tensor] = None,
1351
+ inputs_embeds: Optional[torch.Tensor] = None,
1352
+ start_positions: Optional[torch.Tensor] = None,
1353
+ end_positions: Optional[torch.Tensor] = None,
1354
+ output_attentions: Optional[bool] = None,
1355
+ output_hidden_states: Optional[bool] = None,
1356
+ return_dict: Optional[bool] = None,
1357
+ unpad_inputs: Optional[bool] = None,
1358
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1359
+ r"""
1360
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1361
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1362
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1363
+ are not taken into account for computing the loss.
1364
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1365
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1366
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1367
+ are not taken into account for computing the loss.
1368
+ """
1369
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1370
+
1371
+ outputs = self.new(
1372
+ input_ids,
1373
+ attention_mask=attention_mask,
1374
+ token_type_ids=token_type_ids,
1375
+ position_ids=position_ids,
1376
+ head_mask=head_mask,
1377
+ inputs_embeds=inputs_embeds,
1378
+ output_attentions=output_attentions,
1379
+ output_hidden_states=output_hidden_states,
1380
+ return_dict=return_dict,
1381
+ unpad_inputs=unpad_inputs,
1382
+ )
1383
+
1384
+ sequence_output = outputs[0]
1385
+
1386
+ logits = self.qa_outputs(sequence_output)
1387
+ start_logits, end_logits = logits.split(1, dim=-1)
1388
+ start_logits = start_logits.squeeze(-1).contiguous()
1389
+ end_logits = end_logits.squeeze(-1).contiguous()
1390
+
1391
+ total_loss = None
1392
+ if start_positions is not None and end_positions is not None:
1393
+ # If we are on multi-GPU, split add a dimension
1394
+ if len(start_positions.size()) > 1:
1395
+ start_positions = start_positions.squeeze(-1)
1396
+ if len(end_positions.size()) > 1:
1397
+ end_positions = end_positions.squeeze(-1)
1398
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1399
+ ignored_index = start_logits.size(1)
1400
+ start_positions = start_positions.clamp(0, ignored_index)
1401
+ end_positions = end_positions.clamp(0, ignored_index)
1402
+
1403
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1404
+ start_loss = loss_fct(start_logits, start_positions)
1405
+ end_loss = loss_fct(end_logits, end_positions)
1406
+ total_loss = (start_loss + end_loss) / 2
1407
+
1408
+ if not return_dict:
1409
+ output = (start_logits, end_logits) + outputs[2:]
1410
+ return ((total_loss,) + output) if total_loss is not None else output
1411
+
1412
+ return QuestionAnsweringModelOutput(
1413
+ loss=total_loss,
1414
+ start_logits=start_logits,
1415
+ end_logits=end_logits,
1416
+ hidden_states=outputs.hidden_states,
1417
+ attentions=outputs.attentions,
1418
+ )
modules.json CHANGED
@@ -14,19 +14,7 @@
14
  {
15
  "idx": 2,
16
  "name": "2",
17
- "path": "2_Dense",
18
- "type": "sentence_transformers.models.Dense"
19
- },
20
- {
21
- "idx": 3,
22
- "name": "3",
23
- "path": "3_Dense",
24
- "type": "sentence_transformers.models.Dense"
25
- },
26
- {
27
- "idx": 4,
28
- "name": "4",
29
- "path": "4_Normalize",
30
  "type": "sentence_transformers.models.Normalize"
31
  }
32
  ]
 
14
  {
15
  "idx": 2,
16
  "name": "2",
17
+ "path": "2_Normalize",
 
 
 
 
 
 
 
 
 
 
 
 
18
  "type": "sentence_transformers.models.Normalize"
19
  }
20
  ]
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 2048,
3
  "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 8192,
3
  "do_lower_case": false
4
  }
special_tokens_map.json CHANGED
@@ -1,21 +1,32 @@
1
  {
2
- "boi_token": "<start_of_image>",
3
  "bos_token": {
4
- "content": "<bos>",
 
 
 
 
 
 
 
5
  "lstrip": false,
6
  "normalized": false,
7
  "rstrip": false,
8
  "single_word": false
9
  },
10
- "eoi_token": "<end_of_image>",
11
  "eos_token": {
12
- "content": "<eos>",
13
  "lstrip": false,
14
  "normalized": false,
15
  "rstrip": false,
16
  "single_word": false
17
  },
18
- "image_token": "<image_soft_token>",
 
 
 
 
 
 
19
  "pad_token": {
20
  "content": "<pad>",
21
  "lstrip": false,
@@ -23,6 +34,13 @@
23
  "rstrip": false,
24
  "single_word": false
25
  },
 
 
 
 
 
 
 
26
  "unk_token": {
27
  "content": "<unk>",
28
  "lstrip": false,
 
1
  {
 
2
  "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
  "lstrip": false,
12
  "normalized": false,
13
  "rstrip": false,
14
  "single_word": false
15
  },
 
16
  "eos_token": {
17
+ "content": "</s>",
18
  "lstrip": false,
19
  "normalized": false,
20
  "rstrip": false,
21
  "single_word": false
22
  },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
  "pad_token": {
31
  "content": "<pad>",
32
  "lstrip": false,
 
34
  "rstrip": false,
35
  "single_word": false
36
  },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
  "unk_token": {
45
  "content": "<unk>",
46
  "lstrip": false,
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:216e2a79606fe879c9f17c529c71cd241338407fd5646b595ffd3c4b9ea1d503
3
- size 33385262
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
3
+ size 17082988
tokenizer_config.json CHANGED
The diff for this file is too large to render. See raw diff