fine-tuned EnergyEmbed-nv1 1 epochs

Browse files

Files changed (13) hide show

.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +814 -0
config.json +49 -0
config_sentence_transformers.json +14 -0
configuration.py +145 -0
model.safetensors +3 -0
modeling.py +1418 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +55 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 768,
+    "pooling_mode_cls_token": true,
+    "pooling_mode_mean_tokens": false,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,814 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dense
+- generated_from_trainer
+- dataset_size:44838
+- loss:MultipleNegativesRankingLoss
+base_model: Alibaba-NLP/gte-multilingual-base
+widget:
+- source_sentence: How does the volume and flow rate of cement affect the cementing
+    process in oil and gas wells?
+  sentences:
+  - "Overview of International Offshore Decommissioning Regulations: Volume 1 – Facilities\
+    \  \nThe Petroleum Code does not make any specific requirements in relation to\
+    \ whether\noffshore facilities need to be removed following cessation of production.\
+    \ However, as a\nsignatory to UNCLOS III/IMO and the Abidjan Convention, the Republic\
+    \ of Guinea is bound\nby these international and regional agreements.  \nThe Environment\
+    \ Code is enforced by the Ministry of Natural Resources, Energy and\nEnvironment.\
+    \ Its key aims are to protect the environment while promoting the use of\nnatural\
+    \ resources. Title 2/Chapter III of the Environment Code deals with maritime waters\n\
+    and their resources and Title 5 deals with EIA requirements for major projects."
+  - 'Well Cementing design is a critical component of Well engineering, as efficient
+    cement design ensures the protection of the casing assemblies from fluid corrosion,
+    and ensures the mechanical support of the well. It also ensures that hydraulic
+    communication between different zones is prevented.
+    Well abandonment is also critical as the design of the slurry required needs to
+    be designed to efficiently keep hydrocarbons in the wellbore and prevent any immediate,
+    short term or long term migration of hydrocarbons to surface.
+    There are numerous studies and publications discussing the causes of gas migration
+    after primary cement jobs and well abandonment, some of the causes of gas migration
+    have been linked to poor fluid loss control, poor drilling fluid displacement
+    (reduces seal efficiency at the interfaces), and long cement setting times which
+    allows time for gas to percolate through the partially set cement slurry.
+    This paper highlights the engineering methods, and how they can be used to properly
+    evaluate the cement slurry design to ensure that gas flow through the cement lattice
+    is completely prevented. It assumes that all other issues which involving poor
+    execution (mud displacement, poor slurry mixing, use of low quality materials
+    and chemicals, human errors), are annulled.
+    The correlations/equations discussed and used for the evaluation of the abandoned
+    case study well (Well XRT) are the Gas Flow Potential, Slurry Performance Number,
+    Hydrostatic Number and Pressure Decay Limit Parameter. Results from critical evaluation
+    with these equations confirmed that the Well XRT was efficiently abandoned.
+    The paper further recommends that these equations should be used by Well Engineers
+    be used to evaluate slurry designs for casing cementing and abandonment operations,
+    as they will help ensure that the mechanical and hydraulic isolation is efficiently
+    designed for and achieved.'
+  - 'This article discusses the big volume top job of oil and gas wells, specifically
+    wells A and B which were drilled in Kuwait. The process involves pumping a larger
+    volume of mixture of cement, water, and other additives into the annulus to seal
+    the wellbore, prevent fluid migration and provide structural support.
+    The article highlights the need for precision and control to ensure proper placement.
+    The conventional methods like two stage method and lightweight systems used for
+    the wells A and B were not sufficient to get the good zonal isolation throughout
+    the well bore due to the lower fracture gradient observed in this well. The successful
+    zonal isolation was achieved due to pumping large volumes from the annulus.
+    The wells were under losses before and during the primary cementing process, which
+    was difficult to achieve the desired top of cement (up to surface). To overcome
+    these challenges, the well was cemented in unique unconventional method which
+    is pumping the bigger volumes from the annulus to cover up to loss zone and eliminate
+    any other fluid column in between. Cement Bond Log (CBL) and Variable Density
+    Log (VDL) were taken after a 24 Hrs wait on cement and the results were good,
+    indicating that the wellbore is properly sealed, and the well is structurally
+    stable.
+    Pumping large volumes of cement through the annulus can be challenging, as it
+    requires a high level of precision and control to ensure that the cement is properly
+    placed. This process is different to that of conventional top jobs carried out
+    by installing cement baskets. The intention of conventional top job methods is
+    to just seal the annulus at surface without paying any attention to mud caps left
+    in the open hole. This has resulted in remedial jobs which has increased the cost
+    or reduced the life span of wells.
+    One of the key considerations when pumping cement through the annulus is the volumes
+    considered and thickening time. The rate of flow must be carefully controlled
+    to ensure that the cement is properly mixed along with the additives and that
+    it does not become too thick or too thin. In addition, the rate of flow must be
+    adjusted to account for the variations in pressure and temperature that occur
+    as the cement moves through the well.
+    Cementing also plays an important role in preventing fluid migration. If the well
+    is not properly sealed, there might be inter communication of the fluids which
+    affects the life of the well. The extremely lower frac gradient wells undergo
+    losses Inspite of using the conventional methods (light weight systems and two
+    stage method) and is the reason to follow the unconventional method of cementing
+    from the annulus so that entire well bore from shoe to the surface is properly
+    sealed with cement. This will result in reducing the unnecessary remedial jobs
+    during the life of the well.'
+- source_sentence: How do the various water cut measurement techniques compare for
+    suitability in permanent downhole deployment?
+  sentences:
+  - Optimization of hole cleaning remains a vital challenge when planning and drilling
+    deviated, high angle and extended reach wells. Hole cleaning depends on a number
+    of factors and as to date most existing models have been deployed in solving hole
+    cleaning problems. However, the flow rate predicted by these models may not be
+    feasible to apply practically in field operations because it gives a pressure
+    exceeding allowable limits of the pop-up valves on the mud pump. This is the major
+    cause of downtime during drilling operations. This research is aimed at adding
+    value to the existing models in achieving better hole cleaning and reduced down
+    time. This was made possible through the use of cutting monitoring model which
+    is a real time and quantitative tool. A case study on a well being drilled in
+    the Niger Delta was conducted whose from which it was observed that within 5800ft
+    to 11500ft, the hole was not properly clean as less cuttings were recovered. This
+    information was used to initiate hole cleaning procedure. From the validation,
+    the results shows Non-Productive Time associated with hole cleaning has a significant
+    drop of 2-5 days when the cutting monitoring model is used in conjunction with
+    the existing models.
+  - Exhumation describes vertical displacements of rocks from maximum depth of burial
+    that results from the removal of overburden material. In this study we invert
+    seismic velocity profiles from 2D and 3D seismic reflection datasets to constrain
+    the distribution and the magnitude of exhumation within the Slyne Basin, offshore
+    NW Ireland. The method has already been successfully applied to 2D datasets offshore
+    Britain and Africa; this study is the first attempt to extract exhumation estimates
+    from 3D seismic data. Inversion of 3D seismic velocity data yields a continuous
+    map of exhumation across the entire 3D footprint. Exhumation estimates from 2D
+    seismic sections agree with estimates from co-located 3D data. However, there
+    is greater scatter in the 2D-derived exhumation estimates, most easily seen at
+    line ties. This scatter in the 2D measurements arises because 2D seismic stacking
+    velocities are less well constrained than 3D velocities. Together, the 2D and
+    3D seismic stacking velocity profiles can be used to estimate exhumation patterns
+    on spatial scales >10 km to an accuracy of ±200 m. Many estimated changes in exhumation
+    are associated with geological structures, suggesting confidence in the results.
+    The margins of Slyne Basin have undergone about 1 km more erosion than the basin
+    centre to form the Jurassic-Miocene composite unconformity. Inversion anticlines
+    in the centre of the basin have undergone a few hundred metres more erosion at
+    their crests than at their flanks. There is good agreement between 3D seismic-derived
+    exhumation estimates and existing exhumation estimates using traditional techniques
+    applied to borehole data. Overall, our results show that regional exhumation can
+    be mapped in hitherto unprecedented detail using good quality seismic stacking
+    velocity data.
+  - This paper addresses the need and challenges associated with the permanent downhole
+    water cut measurement in multiphase flow at an individual lateral level for efficient
+    and reliable water cut management in a multilateral horizontal well environment.
+    Furthermore, it reviews the available water cut measurement techniques and evaluates
+    their suitability for permanent downhole deployment in multilateral horizontal
+    wells. A comprehensive analysis of the state-of-the-art water cut measurement
+    techniques is presented for the first time in this paper to evaluate their suitability
+    for permanent downhole deployment. Downhole water cut measurement challenges are
+    described in detail and a table is presented comparing various techniques against
+    a set of requirements suitable for permanent downhole water cut measurement.
+- source_sentence: What role does AI play in the integrated logistics process in the
+    offshore sector?
+  sentences:
+  - Sustainability has become a pivotal point in the maritime industry, encompassing
+    environmental, economic, and social dimensions. This study investigates the impact
+    of Industry 4.0 technologies on improving maritime logistics sustainability. An
+    extensive literature review will identify key technologies and sustainability
+    goals across these dimensions. Using advanced decision-making frameworks like
+    AI and ML-enabled decision intelligence or Neutrosophic-TOPSIS methods, the impact
+    of these technologies will be quantified and ranked. The results will yield a
+    prioritization of technologies and a strategic roadmap for their implementation,
+    aimed at optimizing resource allocation and enhancing sustainability. This research
+    provides an integrated approach to sustainability and technological adoption,
+    offering a novel, industry-specific roadmap.
+  - Detection of production and well events is crucial for planning of production
+    and operational strategies. Event detection is especially challenging in mature
+    fields in which various off-normal events might occur simultaneously. Manual detection
+    of these events by an engineer is a tedious task and prone to errors. On the other
+    hand, abundance of data in mature fields provides an opportunity to employ data-driven
+    methods for an accurate and robust production event detection. In this study a
+    data-driven workflow to automatically detect production events based on signatures
+    of events provided by experts is demonstrated. In the developed workflow, state-of-the-art
+    data-driven methods were integrated with the domain knowledge for an accurate
+    and robust detection. The methodology was applied on several case studies of mature
+    fields suffering from production issues, such as scaling and liquid loading. It
+    was found that the workflow is accurate, robust and computationally efficient
+    which could detect new events (verified by the expert). The demonstrated method
+    could be implemented both in the real-time or offline fashion. Such a workflow
+    is sufficiently generic which can be applied for detection of different events
+    and anomalies than tested and verified in this paper, such as leakage, production
+    losses, …
+  - 'This case study aims to showcase how integrated logistics in the offshore sector
+    streamline the supply chain process, reduce costs, and improves efficiency. The
+    scope of integrated logistics includes planning, transportation, warehousing,
+    inventory management, and information management, focusing on collaboration and
+    transparency between all stakeholders in the offshore supply chain.
+    The process of integrated logistics in the offshore sector begins with the cargo
+    booking. A detailed logistics plan and schedule are then developed, outlining
+    the supply chain network, transportation modes, and inventory management strategies.
+    The process is managed by an AI-based platform that automatically creates short
+    and long-term schedules using various cargo and telemetric data. During the execution
+    phase, real-time tracking and monitoring of the supply chain process are crucial
+    to managing disruptions. Continuous improvement is key to optimising the integrated
+    logistics process with a machine learning element to the logistics tool, resulting
+    in increased efficiency, reduced costs, and improved safety and reliability.
+    Implementing integrated logistics in the offshore sector has yielded several positive
+    results. Firstly, it has improved efficiency in the supply chain process, reducing
+    the time and cost required to move goods and equipment from the point of origin
+    to the point of consumption. Delivery time has been reduced by 23%, achieved by
+    using an AI planning system, real-time tracking, and optimised transportation
+    modes.
+    Secondly, integrated logistics has helped to maintain high levels of safety by
+    reducing the number of entries into the 500M zone by consolidating cargo and increasing
+    back deck utilisation. Standardised procedures for logistics operations have been
+    established, minimising the risk of errors and improving overall safety.
+    Thirdly, the implementation of integrated logistics has led to increased collaboration
+    and communication between stakeholders involved in offshore operations, resulting
+    in improved decision-making and reduced delays, as well as better transparency
+    between all elements of the supply chain.
+    Real-time tracking and monitoring of the supply chain process have been crucial
+    for effectively managing disruptions and addressing issues, which is made possible
+    by automating the process using AI, which is more efficient than manual processes.
+    The use of integrated logistics in the offshore sector has resulted in an overall
+    cost reduction of 23% on the shipment of goods and a reduction of CO2 emissions
+    by 32%, enabling effective management of the movement of goods and equipment while
+    promoting sustainability.
+    This approach to integrated offshore logistics will enable effective management
+    of the movement of goods and equipment from the point of origin to the point of
+    consumption and reduce costs for the oil and gas sector while ensuring compliance
+    with regulatory requirements.'
+- source_sentence: How does the incorporation of polyamine and encapsulation polymer
+    in the HPWBM contribute to clay stabilization?
+  sentences:
+  - Clay bearing shale formations tend to swell upon contact with water-based drilling
+    fluid. The migration of hydrogen ions into the nano-spacing of shale platelets
+    is mainly responsible for its disintegration and swelling. To mitigate the clay
+    swelling problem, various shale stabilization materials are added in the water-based
+    muds (WBMs). Before adding these additives, it is crucial to understand their
+    physical and chemical interactions with clay minerals as well as within fluid.
+    In this study, Taro Root Mucilage (TRM) is used as a green chemical in WBM to
+    decrease the shale swelling characteristics. Taro root was boiled in distilled
+    water at 40°C for 24 h and mucilage was prepared, which was characterized by FTIR
+    and XRD pattern. It was then made part of a mud system, which then interacted
+    with the shale sample collected from the western zone of Pakistan. Moreover, this
+    mucilage was compared with sodium alginate mud system, a biopolymer commonly used
+    in industry. The results of the experimental studies showed that TRM appreciably
+    reduces clay swelling characteristics compared with the distilled water and sodium
+    alginate. Moreover, all the rheological parameters fall under the recommended
+    API range for TRM samples. Furthermore, it was found that the TRM produces a thin
+    filter cake and minimizes fluid loss volume. In addition, during the shale cutting
+    recovery test, 50%, 80% and 100% recoveries were obtained from base mud, whereas
+    10% and 20% were obtained from TRM based WBM respectively. TRM encapsulates the
+    drilled cutting and preserves it from breaking into smaller fragments. In addition,
+    TRM concentration in drilling mud increases the hydrophobicity of the shale sample.
+    The adsorption of TRM over the surface of shale allows less penetration of water
+    in the nano-spacing of shale structure and improves the shale stability. Hence,
+    the finding in this article implies that TRM can be used as a green and sustainable
+    substitute for traditional clay stabilizers in drilling operations to reduce formation
+    damage. It has all the desired properties that help it to become an alternate
+    solution in the form of a clay swelling inhibitor.
+  - 'Exploration drilling obviously requires a robust drilling fluid system to be
+    a key factor in overcoming both the known and unexpected challenges of a structure
+    that consists of reactive clay and lost circulation zones. Extra consideration
+    has to be given to regulatory environmental requirements and complications resulting
+    from regional politics. A High-Performance Water Based Mud (HPWBM) system was
+    selected to address the aforementioned issues.
+    The HPWBM was customized to respond to the subsurface conditions with the main
+    requirement to provide maximum shale inhibition through a non-dispersed environment.
+    Polyamine was utilized to stabilize all types of clay; an encapsulation polymer
+    and a non-ionic polymer were included to prevent dispersion and to seal micro-fractures.
+    A complete shale study was performed to determine the optimum concentration of
+    the base fluid and each shale inhibitor. Then hydraulic behaviour of the mud was
+    simulated with contractor proprietary software to understand the parameters for
+    optimal hole cleaning as well as Equivalent Circulating Density (ECD) simulation.
+    The HPWBM system successfully facilitated the execution of the exploration well
+    and provided highly effective clay stabilization. No Non-Productive Time (NPT)
+    was recorded as a result of reactive clay issues. The mud system also facilitated
+    a good rate of penetration (ROP), formation stability, and lubricity. Waste cuttings
+    transportation was not required. In addition, there is also no requirement for
+    costly base oil including its associated transportation costs. The successful
+    implementation of the HPWBM yielded an estimating saving of 25% compared to invert
+    emulsion fluids, prior to considering costs associated with an expensive Liquid
+    Mud Plant (LMP), environmental, and freight costs. Significant cost savings were
+    achieved by eliminating the need for LMP rental, mobilization and demobilization.
+    Another notable saving was realized from the reduced system maintenance of the
+    HPWBM as less dilution was required compared to a regular Water Based Mud.
+    Thinking outside of the box and embracing the departure from the default consideration
+    of an invert system with a thorough risk assessment augmented value to wellbore
+    construction. A smartly designed HPWBM system provided performance comparable
+    to an invert emulsion system but with superior benefits with respect to environmental
+    protection, simplified logistics and lower costs.'
+  - Business Process Outsourcing can be aptly described as the process of forging
+    a contractual relationship with external supplier for the provision of capacity
+    that has been previously undertaken within an organization. In the global oil
+    and gas industry, Business Process Outsourcing (BPO) has emerged in contemporary
+    times as a potent tool in their operational mix. This is particularly hinged on
+    the imperatives to find a delicate balance between rising global demand, diminishing
+    reserves in some of the world's major oil fields, while managing distribution
+    and operating costs. The collapse of crude oil prices from US$100.00 in May 2014
+    to about US$30.00 and even below in early 2016 has reinforced outsourcing. Empirical
+    studies reveal that outsourcing of non-core activities may result in 25% cost
+    saving associated with on-/near-site operations and as much as 50-75% for offshore
+    operations compared to the cost of engaging in same activities in-house. Apart
+    from cost-cutting, other benefits associated with BPO include a stronger focus
+    on core competencies; improved regulatory conformity and compliance; as well as
+    access to a larger talent pool and novel technologies. The oil and gas industry
+    has emerged as the cornerstone of Nigeria's economy, accounting for about 70%
+    of annual government revenue and more than 90% of the nation's foreign exchange
+    reserves. Since the 1990s, outsourcing has assumed an increasing dimension in
+    the nation's oil and gas industry. Empirical studies reveal, for example, that
+    up until the early 1990s, employees in the oil industry comprised about 70% and
+    30% of permanent and temporary employees, respectively. The temporary employees
+    were initially focused on non-core activities. However, in recent times core activities
+    are increasingly contracted to service providers, reversing the structure of employment
+    in the industry by 2010, with 40% of permanent employees, while 60% were permanent
+    employees. The increasing replacement of permanent employees with temporary ones
+    has fueled concern in the industry, led by labour unions, which have expressed
+    concern about the sub-standard welfare of contract workers. This development has
+    led the Federal government of Nigeria to issue guidelines on staff contracting
+    and outsourcing in the Nigerian oil and gas industry.
+- source_sentence: How does the predictive reservoir effectiveness model aid in the
+    exploration of the Winduck Interval?
+  sentences:
+  - 'In recent years, the challenge of reducing accident costs, the results of inquiries
+    into large-scale disasters has highlighted the important role of a proactive approach
+    to safety management.
+    This has led to many organizations assigning high priority to improve an organization''s
+    safety culture. Safety Culture of any organization has an impact on organization
+    image, productivity and profitability.
+    This paper describes the importance of applying safety culture into the company
+    business and provide a practical knowledge required to put safety culture characteristics
+    in place. Many organizations have realized that this provides the perfect opportunity
+    for them to streamline their operational process and optimize the associated management
+    and control system.
+    It is also true to say that people do not really know what a "safety culture"
+    is.
+    Busy Managers asked ‘what does an identifiable safety culture look like?’
+    Definition saying that it is the product of people''s values and beliefs, their
+    behavior, and their commitment to Health and Safety programs.
+    Different levels of efforts are concerned with developing strategic plans, converting
+    these into action plans and implementing these so that the organization can fully
+    integrate safety into all of its systems. Then the most important indicator of
+    a positive safety culture is the extent to which employees are actively in safety
+    on daily basis.
+    So many organizational endeavors, one of the most salient features that affects
+    people''s motivation is the total commitment of senior management and line management.
+    This feature in particular has been shown to account for much of the variation
+    in safety performance at many different levels in an organization. Since the development
+    of a proactive safety culture is an empowering process that aims to win people''s
+    hearts and minds, it is absolutely vital that senior management actively demonstrate
+    their commitment by providing the necessary leadership.'
+  - 'In this multi-Tcf subsea gas development off the North West coast of Australia,
+    reservoir simulation supports the key business decisions and processes. An important
+    factor when providing production forecasts is ensuring that a range of possible
+    outcomes (low-mid-high) are captured accurately by the models. The output from
+    these models may then be used by decision makers for evaluating different developments
+    and scenarios. The design of experiments (DoE) is commonly employed to aid the
+    evaluation of subsurface uncertainties and characterise the impact and influence
+    to key model outcomes supporting development decisions.
+    Field production performance is often driven by uncertainty in reservoir outcome.
+    This paper is helpful to practitioners involved in any computer modelling of petroleum
+    reservoirs who are interested in capturing the uncertainty inherent in a field
+    and building an appropriate workflow for the development and sensitivity of a
+    range of models. Both model building and using DoE to evaluate developments and
+    Value of Information (VoI) studies for reservoir management will be shared. Integrated
+    DoE focusing on static, dynamic and well-based uncertainties will be illustrated.
+    Results will cover:
+    –
+    Lessons learned and best practices using ED (Experimental Design) to generate
+    low-mid-high reservoir simulation models
+    –
+    Understanding reservoir and well based uncertainties separately
+    –
+    Evaluating incremental field developments using ED
+    –
+    Utilizing ED to anticipate range of surveillance responses
+    Few papers exist on the integrated application of ED to giant gas fields using
+    reservoir simulation. Firstly, this case study will highlight some pitfalls to
+    avoid during the workflow. Secondly, the authors will discuss the important issue
+    of how to integrate or separate static, dynamic, well and facility based uncertainties.
+    Thirdly, the work will show the unique application of ED in VoI and field development
+    scoping.'
+  - The latest Silurian to Early Devonian Winduck Interval of the extensive but poorly
+    exposed Neckarboo Sub-basin, consists of several thousands of metres of a quartzose
+    siliciclastic sandstone succession that has been divided into three sequence divisions
+    called (in ascending parasequence order) parasequence A (coarse-grained quartz
+    sandstone), parasequence B (fining-upward succession of sandstone with siltstone
+    and sandstone beds thicken upward) and parasequence C (coarse-grained quartz sandstone
+    with siltstone and interbedded calcareous sandstones). These three geophysically
+    defined parasequences are separated by slightly discordant erosion surfaces. The
+    erosion surfaces are characterised by abrupt breaks at the top of parasequences
+    A and B and the surface at the top of parasequence B represents relatively local
+    erosion. The top of parasequence C is marked by a major unconformity with the
+    Snake Cave Interval. Gamma ray and self-potential signatures within the parasequences
+    can be correlated throughout the Neckarboo Sub-basin. The three sequence divisions
+    are further subdivided into depositional parasequences, which are readily recognised
+    from core sedimentology and electrofacies analysis. The parasequences provide
+    the framework for a detailed sedimentological analysis, which focuses on the identification
+    of lithofacies successions and parasequences. Petrophysical data are recorded
+    and their relationships to the depositional parasequences are discussed. This
+    paper presents a predictive reservoir effectiveness model that has been developed
+    to aid exploration of the Winduck Interval. The aim is to find the distribution
+    of parasequences (based on variations in porosity, net effective thickness and
+    lithofacies with burial depth) and to provide a dataset for lithostratigraphic
+    units within the Winduck Interval and parameter input for exploration prospect
+    evaluation. Parasequence stratigraphic analyses were obtained where there is good
+    lithofacies control. The porosity and permeability results have been analyzed
+    in a number of parasequences and poor reservoir quality may be due to the effects
+    of structure and fluid flow. This approach provides for better and more precise
+    stratigraphic trap analysis.
+datasets:
+- Sampath1987/offshore_energy_v1
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+metrics:
+- cosine_accuracy
+model-index:
+- name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
+  results:
+  - task:
+      type: triplet
+      name: Triplet
+    dataset:
+      name: ai job validation
+      type: ai-job-validation
+    metrics:
+    - type: cosine_accuracy
+      value: 0.9800142645835876
+      name: Cosine Accuracy
+---
+# SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) on the [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 9bbca17d9273fd0d03d5725c7a4b0f6b45142062 -->
+- **Maximum Sequence Length:** 8192 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+- **Training Dataset:**
+    - [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1)
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NewModel'})
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("Sampath1987/EnergyEmbed-nv1")
+# Run inference
+sentences = [
+    'How does the predictive reservoir effectiveness model aid in the exploration of the Winduck Interval?',
+    'The latest Silurian to Early Devonian Winduck Interval of the extensive but poorly exposed Neckarboo Sub-basin, consists of several thousands of metres of a quartzose siliciclastic sandstone succession that has been divided into three sequence divisions called (in ascending parasequence order) parasequence A (coarse-grained quartz sandstone), parasequence B (fining-upward succession of sandstone with siltstone and sandstone beds thicken upward) and parasequence C (coarse-grained quartz sandstone with siltstone and interbedded calcareous sandstones). These three geophysically defined parasequences are separated by slightly discordant erosion surfaces. The erosion surfaces are characterised by abrupt breaks at the top of parasequences A and B and the surface at the top of parasequence B represents relatively local erosion. The top of parasequence C is marked by a major unconformity with the Snake Cave Interval. Gamma ray and self-potential signatures within the parasequences can be correlated throughout the Neckarboo Sub-basin. The three sequence divisions are further subdivided into depositional parasequences, which are readily recognised from core sedimentology and electrofacies analysis. The parasequences provide the framework for a detailed sedimentological analysis, which focuses on the identification of lithofacies successions and parasequences. Petrophysical data are recorded and their relationships to the depositional parasequences are discussed. This paper presents a predictive reservoir effectiveness model that has been developed to aid exploration of the Winduck Interval. The aim is to find the distribution of parasequences (based on variations in porosity, net effective thickness and lithofacies with burial depth) and to provide a dataset for lithostratigraphic units within the Winduck Interval and parameter input for exploration prospect evaluation. Parasequence stratigraphic analyses were obtained where there is good lithofacies control. The porosity and permeability results have been analyzed in a number of parasequences and poor reservoir quality may be due to the effects of structure and fluid flow. This approach provides for better and more precise stratigraphic trap analysis.',
+    'In this multi-Tcf subsea gas development off the North West coast of Australia, reservoir simulation supports the key business decisions and processes. An important factor when providing production forecasts is ensuring that a range of possible outcomes (low-mid-high) are captured accurately by the models. The output from these models may then be used by decision makers for evaluating different developments and scenarios. The design of experiments (DoE) is commonly employed to aid the evaluation of subsurface uncertainties and characterise the impact and influence to key model outcomes supporting development decisions.\nField production performance is often driven by uncertainty in reservoir outcome. This paper is helpful to practitioners involved in any computer modelling of petroleum reservoirs who are interested in capturing the uncertainty inherent in a field and building an appropriate workflow for the development and sensitivity of a range of models. Both model building and using DoE to evaluate developments and Value of Information (VoI) studies for reservoir management will be shared. Integrated DoE focusing on static, dynamic and well-based uncertainties will be illustrated.\nResults will cover:\n–\nLessons learned and best practices using ED (Experimental Design) to generate low-mid-high reservoir simulation models\n–\nUnderstanding reservoir and well based uncertainties separately\n–\nEvaluating incremental field developments using ED\n–\nUtilizing ED to anticipate range of surveillance responses\nFew papers exist on the integrated application of ED to giant gas fields using reservoir simulation. Firstly, this case study will highlight some pitfalls to avoid during the workflow. Secondly, the authors will discuss the important issue of how to integrate or separate static, dynamic, well and facility based uncertainties. Thirdly, the work will show the unique application of ED in VoI and field development scoping.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities)
+# tensor([[1.0000, 0.6207, 0.1418],
+#         [0.6207, 1.0000, 0.0860],
+#         [0.1418, 0.0860, 1.0000]])
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+## Evaluation
+### Metrics
+#### Triplet
+* Dataset: `ai-job-validation`
+* Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
+| Metric              | Value    |
+|:--------------------|:---------|
+| **cosine_accuracy** | **0.98** |
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### offshore_energy_v1
+* Dataset: [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1) at [d4682d4](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1/tree/d4682d4c446c51dfc8da8976e83e9499ef082de5)
+* Size: 44,838 training samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                             | positive                                                                              | negative                                                                              |
+  |:--------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
+  | type    | string                                                                             | string                                                                                | string                                                                                |
+  | details | <ul><li>min: 13 tokens</li><li>mean: 24.54 tokens</li><li>max: 46 tokens</li></ul> | <ul><li>min: 33 tokens</li><li>mean: 430.25 tokens</li><li>max: 1027 tokens</li></ul> | <ul><li>min: 45 tokens</li><li>mean: 423.92 tokens</li><li>max: 1204 tokens</li></ul> |
+* Samples:
+  | anchor                                                                                                                                       | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+  |:---------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>What benefits were realized through the adoption of remote operations services in the North Sea?</code>                                | <code>The North Sea has always been a pioneer for the adoption of remote operations services (ROS) in offshore drilling applications. Drilling services such as Measurement While Drilling (MWD), Logging While Drilling (LWD) and/or mud logging (ML) have been performed with an element of ROS for over the last two decades. Early adoption of these remote services delivered initial benefits to operators such as reducing HSE risks related to the travel and accommodation of field service employees at offshore rig sites. Meanwhile service companies were able to explore the added efficiencies gained by having multi-skilled employees providing a higher level of support to customers while also gaining additional agility to manage their personnel through tighter market cycles. The mutual benefit of this early adoption created a solid foundation for ROS to expand the scope of influence in drilling operations to include Directional Drilling (DD).<br>Despite the maturity of ROS within a select community of ope...</code> | <code>A new program for the development of graduate engineers has been implemented in Denmark on a stimulation vessel in the North Sea. It is designed to provide graduate engineers with a three-year period of extensive experience in offshore operations, knowledge of equipment and designing effective stimulation jobs. There are many components to the program that address training, skills, demonstration of capabilities and evidence of competence. These are essential components that ultimately lead to improved operational performance and highlights.<br>The North Sea oil and gas industry requires a constant effort to maintain the engineering skills of its offshore workers so vital to continued success. Paradoxically, there are numerous factors that hinder on site development of young engineering talent in the North Sea. There is a lack of offshore accommodation that often restricts onsite time for trainees. This is exacerbated by a low frequency of many operations compared to other provinces in the...</code>          |
+  | <code>What is the estimated storage capacity for CO2 in the analyzed study area?</code>                                                      | <code>The oil and gas industry is a significant contributor to carbon dioxide (CO2) emissions, which have a major impact on climate change. Geoscientists in the industry play a crucial role in mitigating climate change by identifying and evaluating potential CO2 storage sites, monitoring CO2 behavior after injection, and exploring CO2 enhanced oil recovery (EOR) techniques. CO2 -EOR involves injecting CO2 into depleted oil reservoirs to increase oil production. Reservoir characterization using well log and seismic data analysis helps determine storage capacity, containment, and injectivity of reservoirs for CO2 sequestration and EOR. In this study, two sand reservoirs (RES 1 and RES 2) were analyzed, with RES 2 being considered more suitable for CO2 sequestration and CO2 -EOR. The estimated storage capacity of the study area was approximately 40 million metric tons (MT). Assessments of fault sealing capacity and reservoir properties were conducted to validate storage potential. Further inves...</code>    | <code>Transported and geologically stored CO2 contains several impurities that depend on its source and associated capture technology. Impurities in anthropogenic CO2 can have damaging impacts on the different elements of a CCS system, which must be considered when developing a CO2 specification (Table 1). Thus, characterising all the impurities and determining the required purity of the CO2 mixture is critically important for the safe design and operation of CCS transport and storage systems.<br>It is important to note that CO2 specifications relate to normal operations. Short-term excursions outside of the recommended maximum concentrations for each impurity may be permissible provided they do not lead to health and safety risks and / or risks to the mechanical integrity of the asset.</code>                                                                                                                                                                                                                                 |
+  | <code>What is the role of a Preventive Maintenance Program (PMP) in enhancing the reliability of Electrical Submersible Pumps (ESPs)?</code> | <code>The reliability of Electrical Submersible Pumps (ESPs) is a critical target for companies managing artificially lifted fields. While efforts to continuously improve the reliability in the downhole system are crucial, it is necessary to focus on the health and long-term reliability of the ESP surface equipment. One effective approach toward achieving this goal is through conducting a comprehensive Preventive Maintenance Program (PMP) for the different components of the ESP surface system.<br>An ESP PMP should be managed without jeopardizing production strategy. The design of the PMP must meet the production demand while maintaining the best-in-class PMP practices. The well operating condition, frequency, weather, well location, required periodic inspection and preemptive servicing and replacement of surface equipment components must be considered, based on studied criterion. The design of the PMP considers equipment upgrades and thermal imaging surveillance to guarantee healthy electrical ...</code> | <code>A family of exciting new Electric Submersible Pump (ESP) technologies promises to radically improve the development economics of many oilfields and field extensions. This technology is particularly relevant to prospects in the range 5-100 million barrels reserves, which are located greater than 15 kilometres from existing platforms and often suffer uncertainties on reservoir performance (pressure, sweep, heterogeneities inflow performance etc.). Prospects in that category generally offer mediocre to inadequate economics or unacceptable risks of ‘downside’ potential. Platform development entails untenable capex exposure, whereas conventional subsea development (e.g. by gas lift) will result in very inferior production performance.<br>The new technologies which ‘unlock’ the economics of such fields are:<br>Viable subsea ESP technology is available now and will be field proven during 1994/95.<br>Proven high reliability pump systems are now available, underwritten by performance contract.<br>Bottom di...</code> |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim",
+      "gather_across_devices": false
+  }
+  ```
+### Evaluation Dataset
+#### offshore_energy_v1
+* Dataset: [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1) at [d4682d4](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1/tree/d4682d4c446c51dfc8da8976e83e9499ef082de5)
+* Size: 5,604 evaluation samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                             | positive                                                                              | negative                                                                              |
+  |:--------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
+  | type    | string                                                                             | string                                                                                | string                                                                                |
+  | details | <ul><li>min: 14 tokens</li><li>mean: 24.45 tokens</li><li>max: 41 tokens</li></ul> | <ul><li>min: 47 tokens</li><li>mean: 440.51 tokens</li><li>max: 1091 tokens</li></ul> | <ul><li>min: 56 tokens</li><li>mean: 426.21 tokens</li><li>max: 1152 tokens</li></ul> |
+* Samples:
+  | anchor                                                                                                              | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+  |:--------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>What is the role of nanocrystalline cellulose (NCC) in the formulation of hydraulic fracturing fluids?</code> | <code>Guar gum and its derivative based-gels cross-linked with boron have been used in hydraulic fracturing for decades. In order to achieve gel strength requirements, conventional fracturing requires the use of a large amount of thickener and cross-linking agent, which results in more residue and difficulty in the recovery of permeability. At the same time, the gel can be used to achieve the best thermal stability in a high pH environment. Therefore, we proposed a highly efficient organoboron nanocellulose cross-linker for low polymer loading fracturing fluids.<br>Nanocrystalline cellulose (NCC) resulted from sulfuric acid hydrolysis of cellulose microciystalline. Boron-modified nanoparticles were synthesized by one-pot reaction as nano boron cross-linker (NBC). Nanocrystalline cellulose (NCC), (3-Aminopropyl) triethoxysilane, Organic boron (OBC) was mixed at a ratio of 1:4:4 and stirred at a constant temperature of 85°C for 5 hours. The presence of surface modification was shown with FTIR spe...</code> | <code>The unstable wellbore created by the infiltration of drilling fluids into the reservoir formation is a great challenge in drilling operations. Reducing the fluid infiltration using nanoparticles (NPs) brings about a significant improvement in drilling operation. Herein, a mixture of iron oxide nanoparticle (IONP) and polyanionic cellulose nanoparticle (nano-PAC) additives were added to water-based mud (WBM) to determine their impact on rheological and filtration properties measured at 80 °F, 100 °F, and 250 °F. Polyanionic cellulose (PAC-R) was processed into nano-PAC by wet ball-milling process. The rheological behaviour, low-pressure low-temperature (LPLT), and high-pressure high-temperature (HPHT) filtration properties performance of IONP, nano-PAC, and IONP and nano-PAC mixtures were compared in the WBM. The results showed that IONP, nano-PAC, and synergy effect of IONP and nano-PAC in WBM at temperatures of 80 °F and 250 °F improved the density, 10-s and 10-min gel strength (10-s ...</code>                                     |
+  | <code>What is the definition of tail gas in oil and gas engineering processes?</code>                               | <code>#### T  <br>**Tail gas**  <br>Effluent gas at the end of a process.  <br>**Technical Potential**  <br>The amount by which it is possible to reduce greenhouse gas emissions by implementing a<br>technology or practice that has reached the demonstration phase.  <br>**Tectonically active area**  <br>Area of the Earth where deformation is presently causing structural changes.  <br>**Thermocline**  <br>The ocean phenomenon characterized by a sharp change in temperature with depth.  <br>**Thermohaline**  <br>The vertical overturning of water masses due to seasonal heating, evaporation, and cooling.  <br>**Third party**  <br>Entity that is independent of the parties involved with the issues in question Top-down model.<br>A model based on applying macro-economic theory and econometric techniques to historical<br>data about consumption, prices, etc.  <br>**Tracer**  <br>A chemical compound or isotope added in small quantities to trace flow patterns.  <br>36</code>                                              | <code>SUSTAINABILITY REPORTING GUIDANCE FOR THE OIL AND GAS INDUSTRY  <br>**Particulate matter:** A complex mixture of small particles or droplets such as salts, organic<br>chemicals, metals and soil particles [ENV-5].  <br>**Petrochemicals:** Chemical products derived from oil and gas.  <br>**Pipelines:** Construction and use of facilities to transport liquid or gaseous hydrocarbons<br>over long distances in above-ground, below-ground or underwater pipes.  <br>**Primary containment:** The vessel, pipe, barrel, equipment or other barrier that is designed<br>to keep a material within it [ENV-6, ENV-7, SHS-6].  <br>**Primary energy:** The energy content of a hydrocarbon fuel or other energy source used to<br>produce power, usually in the form of electricity, heat or steam [CCE-6].  <br>**Process safety:** A systematic approach to ensuring the safe containment of hazardous<br>materials or energy by applying good design, construction and operating principles [SHS-6].<br>In this Guidance, this term is used synonymously with Asset i...</code> |
+  | <code>How is dense phase acid gas injected back into the formation to mitigate environmental impacts?</code>        | <code>A systematic hazard management approach was used to identify, assess and mitigate hazards at the conceptual design stage of a large onshore sour gas development in Abu Dhabi. The potential environmental impact of sulphur block production and poor prospects of a sulphur market led to a concept involving injection of dense phase acid gas back into the formation. Significant Health, Safety and Environmental (HSE) challenges were addressed relating to the scale of the sour gas development which included the gathering, processing and injection of sour/acid gas containing 33% – 80% H2S. Quantitative Risk Assessment and H2S dispersion calculations were performed to evaluate the risk reduction effectiveness of specific HSE design considerations including material selection, pipeline design, pipeline routing, well design and the location of the processing facility and sour/acid gas wells. These HSE design considerations were integrated into the concept selection. Best industry practices in desi...</code>    | <code>Nowadays, as the deep gas reservoirs in Daqing are explored, the complex volcanic reservoirs have been the major reservoirs in deep natural gas exploration and production. The reserves of volcanic gas reservoirs take up 88% of the total gas reserves. However, the deep complex gas reservoirs may cause heavy pollution during the drilling completion, and some of the barriers between target zones of the wells are very thin, leading to a poor stability. Additionally, because of the complex water/gas relations in the formation, such as appearance of bottom water and water and gas sharing the same formation in some wells, the fracturing operations will induce water channeling. All these facts may cause the failure of the fracturing operations.<br>Especially, when the fractured formation is close to the water/gas interface, the fractures will easily extend into the water layer. The existence of water in the gas wells directly leads to the reduction of production and recovery rate of the gas reser...</code>                                  |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim",
+      "gather_across_devices": false
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `eval_strategy`: steps
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `learning_rate`: 2e-05
+- `num_train_epochs`: 1
+- `warmup_ratio`: 0.1
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: steps
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 2e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 1
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.1
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `hub_revision`: None
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `liger_kernel_config`: None
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: proportional
+- `router_mapping`: {}
+- `learning_rate_mapping`: {}
+</details>
+### Training Logs
+| Epoch  | Step | Validation Loss | ai-job-validation_cosine_accuracy |
+|:------:|:----:|:---------------:|:---------------------------------:|
+| 0.3568 | 1000 | 0.0982          | 0.9764                            |
+| 0.7135 | 2000 | 0.0870          | 0.9800                            |
+### Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 5.1.0
+- Transformers: 4.53.3
+- PyTorch: 2.8.0+cu128
+- Accelerate: 1.9.0
+- Datasets: 4.0.0
+- Tokenizers: 0.21.2
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "architectures": [
+    "NewModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration.NewConfig",
+    "AutoModel": "modeling.NewModel",
+    "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
+    "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
+    "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
+    "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
+    "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
+  },
+  "classifier_dropout": 0.0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "layer_norm_type": "layer_norm",
+  "logn_attention_clip1": false,
+  "logn_attention_scale": false,
+  "max_position_embeddings": 8192,
+  "model_type": "new",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pack_qkv": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "rope",
+  "rope_scaling": {
+    "factor": 8.0,
+    "type": "ntk"
+  },
+  "rope_theta": 20000,
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.3",
+  "type_vocab_size": 1,
+  "unpad_inputs": false,
+  "use_memory_efficient_attention": false,
+  "vocab_size": 250048
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.1.0",
+    "transformers": "4.53.3",
+    "pytorch": "2.8.0+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

configuration.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" NEW model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NewConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
+    instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the NEW
+    [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"rope"`):
+            Type of position embedding. Choose one of `"absolute"`, `"rope"`.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import NewConfig, NewModel
+    >>> # Initializing a NEW izhx/new-base-en style configuration
+    >>> configuration = NewConfig()
+    >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
+    >>> model = NewModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "new"
+    def __init__(
+        self,
+        vocab_size=30528,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=2048,
+        type_vocab_size=1,
+        initializer_range=0.02,
+        layer_norm_type='layer_norm',
+        layer_norm_eps=1e-12,
+        # pad_token_id=0,
+        position_embedding_type="rope",
+        rope_theta=10000.0,
+        rope_scaling=None,
+        classifier_dropout=None,
+        pack_qkv=True,
+        unpad_inputs=False,
+        use_memory_efficient_attention=False,
+        logn_attention_scale=False,
+        logn_attention_clip1=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_type = layer_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.classifier_dropout = classifier_dropout
+        self.pack_qkv = pack_qkv
+        self.unpad_inputs = unpad_inputs
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.logn_attention_scale = logn_attention_scale
+        self.logn_attention_clip1 = logn_attention_clip1

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef910e1bc051c7278474399f603dc788cfdcf73b75c5e41128bd990822828f7f
+size 1221487872

modeling.py ADDED Viewed

	@@ -0,0 +1,1418 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch NEW model."""
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    ModelOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+try:
+    import xformers.ops as xops
+except ImportError as e:
+    xops = None
+from .configuration import NewConfig
+logger = logging.get_logger(__name__)
+# Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
+# Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
+class IndexFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, indices):
+        ctx.save_for_backward(indices)
+        assert input.ndim >= 2
+        ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
+        second_dim = other_shape.numel()
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        # return input[indices]
+        # return torch.gather(
+        #     rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
+        # ).reshape(-1, *other_shape)
+        return torch.gather(
+            input.view(ctx.first_axis_dim, second_dim),
+            0,
+            indices.unsqueeze(-1).expand(indices.size(0), second_dim)
+        ).reshape(-1, *other_shape)
+    @staticmethod
+    def backward(ctx, grad_output):
+        (indices,) = ctx.saved_tensors
+        assert grad_output.ndim >= 2
+        other_shape = grad_output.shape[1:]
+        # grad_output = rearrange(grad_output, "b ... -> b (...)")
+        grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
+        grad_input = torch.zeros(
+            [ctx.first_axis_dim, grad_output.shape[1]],
+            device=grad_output.device,
+            dtype=grad_output.dtype,
+        )
+        # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
+        # grad_input[indices] = grad_output
+        # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
+        grad_input.scatter_(
+            0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
+        )
+        return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
+index_first_axis = IndexFirstAxis.apply
+def unpad_input(hidden_states, attention_mask=None, indices=None):
+    """
+    Arguments:
+        hidden_states: (batch, seqlen, ...)
+        attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
+        indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
+    Return:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+    """
+    if indices is None:
+        assert attention_mask is not None
+        indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
+    # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
+    # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
+    # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
+    # so we write custom forward and backward to make it a bit faster.
+    hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
+    return index_first_axis(hidden_states, indices)
+class IndexPutFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        values: torch.Tensor,
+        indices: torch.Tensor,
+        first_axis_dim
+    ) -> torch.Tensor:
+        ctx.save_for_backward(indices)
+        assert indices.ndim == 1
+        assert values.ndim >= 2
+        output = torch.zeros(
+            first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
+        )
+        output[indices] = values
+        return output
+    @staticmethod
+    def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
+        indices, = ctx.saved_tensors
+        grad_values = grad_output[indices]
+        return grad_values, None, None
+index_put_first_axis = IndexPutFirstAxis.apply
+def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
+    """Add padding to sequences.
+    Arguments:
+        inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
+        batch: int batch_size
+        seqlen: int max sequence length
+    Returns:
+        inputs: (batch, seqlen, ...)
+    """
+    output = index_put_first_axis(inputs, indices, batch * seqlen)
+    return output.view(batch, seqlen, *inputs.shape[1:])
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos, sin = cos.to(q.dtype), sin.to(q.dtype)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
+        )
+class NTKScalingRotaryEmbedding(RotaryEmbedding):
+    """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
+    def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
+        self.scaling_factor = scaling_factor
+        self.mixed_b = mixed_b
+        super().__init__(dim, max_position_embeddings, base, device)
+        max_position_embeddings = max_position_embeddings * self.scaling_factor
+        self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            if self.mixed_b is None:
+                inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim)  # (6)
+            else:
+                a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b  # (13)
+                lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp()  # (12)
+                inv_freq = inv_freq / lambda_1_m  # (10)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+LAYER_NORM = {
+    'layer_norm': nn.LayerNorm,
+    'rms_norm': RMSNorm
+}
+class NewEmbeddings(nn.Module):
+    """
+    Embedding and Unpadding.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.padding_idx = config.pad_token_id
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
+        )
+        self.position_embedding_type = config.position_embedding_type
+        if self.position_embedding_type == 'absolute':
+            self.position_embeddings = nn.Embedding(
+                config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+            )
+        elif self.position_embedding_type == 'rope':
+            self._init_rope(config)
+        else:
+            raise ValueError
+        self.type_vocab_size = config.type_vocab_size
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids is contiguous in memory and excluded when serialized
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings), persistent=False
+        )
+    def _init_rope(self, config):
+        kwargs = dict(
+            dim=int(config.hidden_size / config.num_attention_heads),
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta
+        )
+        if config.rope_scaling is None:
+            self.rotary_emb = RotaryEmbedding(**kwargs)
+        else:
+            kwargs.update(scaling_factor=config.rope_scaling["factor"])
+            scaling_type = config.rope_scaling["type"]
+            if scaling_type == 'ntk':
+                kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
+                self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "linear":
+            #     self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "dynamic":
+            #     self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+    def forward(
+        self,
+        unpad_inputs: bool,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
+        """
+        """
+        if inputs_embeds is None:
+            device, input_shape = input_ids.device, input_ids.shape
+        else:
+            device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
+        batch_size, seq_length = input_shape
+        # Set attention_mask if it's None
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+            if length is not None:
+                for i, l in enumerate(length):
+                    attention_mask[i, l:] = 0
+        # Set attention_mask_bool for unpadding
+        if unpad_inputs:
+            attention_mask_bool = attention_mask.bool()
+            if length is None:
+                length = attention_mask.sum(-1).tolist()
+        # Get word embeddings
+        if inputs_embeds is None:
+            if unpad_inputs:
+                input_ids = input_ids[attention_mask_bool].unsqueeze(0)
+            inputs_embeds = self.word_embeddings(input_ids)
+        else:
+            if unpad_inputs:
+                inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
+        embeddings = inputs_embeds
+        # Set and unpad position_ids
+        if position_ids is None:
+            if seq_length > self.position_ids.size(0):
+                self.register_buffer(
+                    "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
+                )
+            if unpad_inputs:
+                # [1, cumsum_seq_len]
+                position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
+            else:
+                # [bs, seq_len]
+                position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
+        elif unpad_inputs:
+            position_ids = position_ids[attention_mask_bool].unsqueeze(0)  # [1, cumsum_seq_len]
+        # Compute rotary embedding
+        if self.position_embedding_type == 'rope':
+            rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
+            rope_cos = rope_cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_sin = rope_sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_embeds = rope_cos, rope_sin
+        else:
+            rope_embeds = None
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = position_ids.mul(0)
+            else:
+                if self.type_vocab_size < 2:
+                    token_type_ids.mul_(0)
+                if unpad_inputs:
+                    token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+        # BERT position
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings = embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings, attention_mask, rope_embeds, length
+class NewAttention(nn.Module):
+    def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        if pack_qkv is None:
+            pack_qkv = config.pack_qkv
+        self.pack_qkv = pack_qkv
+        if self.pack_qkv:
+            self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
+        else:
+            self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = self.config.use_memory_efficient_attention
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, 'please install xformers'
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        shape_hd = (self.num_attention_heads, self.attention_head_size)
+        # qkv
+        if self.pack_qkv and qkv_inputs is None:
+            qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
+        else:
+            if qkv_inputs is None:
+                qkv_inputs = (hidden_states, hidden_states, hidden_states)
+            qkv_pack = [
+                getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
+            ]
+        query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
+        if self.config.position_embedding_type == 'rope':
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
+        dtype = query_states.dtype
+        if self.config.logn_attention_scale and attention_scale is not None:
+            # https://kexue.fm/archives/8823
+            query_states = query_states * attention_scale.to(dtype)
+        if padding_inputs is not None:
+            query_states = pad_input(query_states.squeeze(), *padding_inputs)
+            key_states = pad_input(key_states.squeeze(), *padding_inputs)
+            value_states = pad_input(value_states.squeeze(), *padding_inputs)
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, "xformers is not loaded"
+            assert output_attentions is False, "memory_efficient_attention do not output attentions"
+            assert head_mask is None, "Not support yet"
+            attention_probs = None
+            if torch.is_tensor(attention_bias):
+                attention_bias = attention_bias.to(dtype)
+            context_layer = self.memory_efficient_attention(
+                query_states,
+                key_states,
+                value_states,
+                attn_bias=attention_bias,
+                p=self.dropout.p
+            )
+        else:
+            if output_attentions and isinstance(self, NewSdpaAttention):
+                raise RuntimeError("SDPA do not output attentions")
+            context_layer, attention_probs = self._attention(
+                query_states, key_states, value_states, attention_bias, head_mask
+            )
+        if padding_inputs is not None:
+            context_layer = unpad_input(context_layer, indices=padding_inputs[0])
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+        # output proj
+        attn_output = self.o_proj(context_layer)
+        # add attentions if we output them
+        outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
+        return outputs
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        """
+        Args:
+            q/k/v: (B, L, n_head, head_dim),
+        Returns:
+            attn_output: (B L, n_head, head_dim)
+        """
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_bias is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_bias
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        if self.dropout.p > 0:
+            attention_probs = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        context_layer = torch.matmul(attention_probs, value_states)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        return context_layer, attention_probs
+class NewSdpaAttention(NewAttention):
+    """
+    New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    def __init__(self, config: NewConfig, **kwargs):
+        super().__init__(config, **kwargs)
+        # torch.backends.cuda.enable_mem_efficient_sdp(False)
+        # logger.warning(
+        #     "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
+        #     "`use_memory_efficient_attention=True` if it expected to use."
+        # )
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states.transpose(1, 2),
+            key_states.transpose(1, 2),
+            value_states.transpose(1, 2),
+            attn_mask=attention_bias,
+            dropout_p=self.dropout.p if self.training else 0.0,
+        )
+        attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
+        return attn_output, None
+NEW_ATTENTION_CLASSES = {
+    "eager": NewAttention,
+    # "flash_attention_2": ,  # TODO
+    "sdpa": NewSdpaAttention,
+}
+class NewGatedMLP(nn.Module):
+    """
+    GLU Variants Improve Transformer.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.intermediate_size = config.intermediate_size
+        self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
+        self.act_fn = ACT2FN[config.hidden_act]
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(self, hidden_states):
+        up_gate = self.up_gate_proj(hidden_states)
+        up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
+        gate = self.act_fn(gate)
+        gated_states = gate * up_states
+        if self.hidden_dropout is not None:
+            gated_states = self.hidden_dropout(gated_states)
+        down_states = self.down_proj(gated_states)
+        return down_states
+class NewLayer(nn.Module):
+    def __init__(
+        self,
+        config: NewConfig,
+        pack_qkv=None,
+        use_memory_efficient_attention=None,
+        attn_implementation=None
+    ):
+        super().__init__()
+        if attn_implementation is None:
+            attn_implementation = config._attn_implementation
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = config.use_memory_efficient_attention
+        if use_memory_efficient_attention:
+            if attn_implementation != 'eager':
+                logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
+                attn_implementation = 'eager'  # Since it will be SDPA by default for torch>=2.1.1
+        self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
+            config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
+        )
+        self.mlp = NewGatedMLP(config)
+        ln_class = LAYER_NORM[config.layer_norm_type]
+        self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        # Multi head self attention
+        residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
+        attention_outputs = self.attention(
+            hidden_states,
+            attention_bias,
+            rope_embeds,
+            padding_inputs,
+            attention_scale,
+            head_mask,
+            output_attentions=output_attentions,
+            qkv_inputs=qkv_inputs,
+        )
+        hidden_states = attention_outputs[0]
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        # In pretraining, after the attention of last layer, we only need the masked tokens.
+        if subset_indices is not None:
+            hidden_states = hidden_states[subset_indices]
+        hidden_states = self.attn_ln(hidden_states)
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        hidden_states = self.mlp_ln(hidden_states)
+        # add self attentions if we output attention weights
+        outputs = (hidden_states,) + attention_outputs[1:]
+        return outputs
+class NewEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: Optional[torch.FloatTensor] = None,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if i >= len(self.layer) - 1:
+                layer_subset_indices = subset_indices
+            else:
+                layer_subset_indices = None
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                    output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    all_hidden_states,
+                    all_self_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
+class NewPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class NewPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = NewConfig
+    base_model_prefix = "new"
+    supports_gradient_checkpointing = True
+    _supports_sdpa = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+class NewModel(NewPreTrainedModel):
+    """
+    The bare New Model transformer outputting raw hidden-states without any specific head on top.
+    """
+    def __init__(self, config: NewConfig, add_pooling_layer=False):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = NewEmbeddings(config)
+        self.encoder = NewEncoder(config)
+        self.pooler = NewPooler(config) if add_pooling_layer else None
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
+        r"""
+        length  (`list` of length `batch_size`, *optional*):
+            If is `None`, return padded `last_hidden_state`.
+        subset_indices  ():
+            pass
+        unpad_inputs  (`bool`, *optional*):
+            pass
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
+        output_padded = length is None
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
+            input_shape = input_ids.size()
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        # TODO: not used
+        # # Prepare head mask if needed
+        # # 1.0 in head_mask indicate we keep the head
+        # # attention_probs has shape bsz x n_heads x N x N
+        # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        # Get embeddings, may unpad them
+        (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
+            unpad_inputs,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds
+        )
+        batch_size, seq_length = input_shape
+        if unpad_inputs and self.config.use_memory_efficient_attention:
+            attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
+        else:
+            # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+            # ourselves in which case we just need to make it broadcastable to all heads.
+            attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
+            if self.config.use_memory_efficient_attention:
+                # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
+                attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
+        padding_inputs = None
+        if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
+            indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+            if not self.config.use_memory_efficient_attention:
+                padding_inputs = (indices, *input_shape)
+        attention_scale = None
+        if self.config.logn_attention_scale:
+            logger.warning_once("TODO: logn_attention_scale")
+        #     # attention scale log_512(input_len)
+        #     attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
+        #     # inference-time logn scale need clip 1
+        #     if self.config.logn_attention_clip1:
+        #         attention_scale.clip_(1)
+        #     attention_scale = attention_scale[:, None, None, None]
+        # else:
+        #     attention_scale = None
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_bias=attention_bias,
+            rope_embeds=rope_embeds,
+            padding_inputs=padding_inputs,
+            attention_scale=attention_scale,
+            subset_indices=subset_indices,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        if unpad_inputs and output_padded:
+            sequence_output = pad_input(
+                sequence_output.squeeze(), indices, batch_size, seq_length
+            )
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+class NewLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.transform_act_fn = ACT2FN[config.hidden_act]
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class NewForMaskedLM(NewPreTrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
+    def __init__(self, config: NewConfig):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.lm_head = NewLMPredictionHead(config)
+        self.loss_fct = nn.CrossEntropyLoss()
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is None or not self.new.config.unpad_inputs:
+            length = None
+            subset_indices = None
+        else:
+            length = attention_mask.sum(-1).tolist()
+            labels = labels[attention_mask.bool()].unsqueeze(0)
+            subset_indices = labels > -100
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            subset_indices=subset_indices,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+        masked_lm_loss = None
+        if labels is not None:
+            if subset_indices is None:
+                mask = attention_mask.bool()
+                prediction_scores = prediction_scores[mask]
+                labels = labels[mask]
+            else:
+                labels = labels[subset_indices]
+            masked_lm_loss = self.loss_fct(prediction_scores, labels)
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForSequenceClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForMultipleChoice(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+            num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            `input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
+        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+@dataclass
+class NewTokenClassifierOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+class NewForTokenClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return NewTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=sequence_output,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForQuestionAnswering(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        start_positions: Optional[torch.Tensor] = None,
+        end_positions: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 8192,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
+size 17082988

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 8192,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizerFast",
+  "unk_token": "<unk>"
+}