fine-tuned model-v1

Browse files

Files changed (13) hide show

.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +881 -0
config.json +49 -0
config_sentence_transformers.json +14 -0
configuration.py +145 -0
model.safetensors +3 -0
modeling.py +1418 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +55 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 768,
+    "pooling_mode_cls_token": true,
+    "pooling_mode_mean_tokens": false,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,881 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dense
+- generated_from_trainer
+- dataset_size:89129
+- loss:MultipleNegativesRankingLoss
+base_model: Alibaba-NLP/gte-multilingual-base
+widget:
+- source_sentence: How does vendor-specific data acquisition affect DTS profile interpretation?
+  sentences:
+  - 'Bridging data management gap by gathering all well integrity data in one unique
+    data base. The aim of ADNOC Offshore in-house Well Integrity Data Management System
+    (WIDMS) is to comply with the 3A rule: Accessibility of the data, Accuracy by
+    performing regular quality check and Analysis. The analysis allows to maintain
+    wells barriers robust, to ensure personnel safety and to quickly identify integrity
+    issues to make qualified decisions about appropriate mitigations measures and
+    avoid risk escalation. WIDMS has been developed in-house with inputs and collaboration
+    of various stake holders. An enhancement list has been established selecting the
+    most relevant features that will be added value to the system. Therefore, Automation
+    for sub processes like thresholds calculations and Risk Assessment which gives
+    input for Well Passports that contains all the required information to evaluate
+    the well risks and implement the required mitigation measures.
+    End users are following a RACI Chart to keep WIDMS database on track and to ensure
+    no data falls through the cracks as all the data workflow is defined through the
+    different steps such as providing data, entering it in the system, informing relevant
+    stakeholders and providing technical clarifications if needed. The result of data
+    acquisition in WIDMS is that data flows across the entire organization, with defined
+    access rights in line with ADNOC Offshore policies. This data is collected from
+    various sources, is a robust data base, essential for evaluating and maintaining
+    well integrity.
+    It is enhancing well barriers system management by allowing to have full overview
+    of well''s barriers performance. Moreover, it allows to have reliable and continuously
+    available data such as annulus pressure data that is critical for well integrity
+    assurance, to avoid the uncontrolled release of hydrocarbons to the atmosphere.
+    Notifications have been implemented so alerts can be sent for engineers to inform
+    about any abnormality and non-compliance. As technology evolves, using paper-based
+    processes, excel spreadsheets, time-based equipment inspection and testing become
+    less effective. Well diagnostics are expensive so utilizing well data analytics
+    through this digital hub project will ease having detailed real time data and
+    quick analysis for early detection of failures and anticipation and reduction
+    of risk escalation.'
+  - "##### 2.3.1 Site characterization - secondary seal  \nSecondary seals might have\
+    \ a significant relevance in ensuring CO 2 containment, acting\nas additional\
+    \ barrier to flow, although it is not clear if it is considered a requirement\
+    \ for\nstandards. Two documents show some contradiction:  \nISO 27914 [36] is\
+    \ silent on secondary seal as a requirement until section 5.4.3.2 that describes\n\
+    its characterization. Moreover, if it is a requirement, characterization should\
+    \ include not\nonly geometry and lithology, but also integrity evaluation, which\
+    \ is not mentioned.  \nISO/TR 27915 [37] section 5.2.6 and Figure 2 state that\
+    \ the geological storage complex is\ncomposed of the reservoirs where CO 2 is\
+    \ injected and the caprock (or seals); it then states\nthat additional geologic\
+    \ layers are outside complex."
+  - 'Geothermal energy is considered a reliable, sustainable and abundant source of
+    energy with minimized environmental impact. The extracted geothermal energy may
+    be utilized for direct heating, or electricity generation. The main challenge
+    to access this energy is tremendous capital expenditures required for drilling
+    and completion. Therefore, this work discusses and evaluates retrofitting abandoned
+    petroleum wells to geothermal as a commonly proposed solution to the mentioned
+    challenge.
+    There are many oil and gas wells globally which are not used for production, injection
+    or other purposes. Well abandonment is commonly considered as an essential measure
+    to ensure safety and integrity of these wells, bearing huge costs and concerns
+    for the petroleum industry. By converting abandoned or non-activated oil and gas
+    wells to geothermal wells, it is claimed to be possible to produce geothermal
+    energy and generate power. As a crucial stage for the claim verification and evaluation
+    of feasibility or efficiency of this conversion, it is important to be aware of
+    the practical and simulation case studies.
+    Therefore, in this work, this work presents a comprehensive overview and analysis
+    of 20 case studies published from different countries, followed by important downhole
+    and surface parameters. As for the downhole characteristics, production scenarios
+    either open-loop or closed-loop, optimization of open-loop systems, borehole heat
+    exchangers with their different types and dimensions, and insulations are covered.
+    Next, surface cycles including organic Rankine cycle (ORCs), selection of circulation
+    fluids, flow rates, and working fluids are covered, followed by produced and net
+    powers with evaluation of coefficient of performance (COP) and thermal efficiency.
+    This investigation shows there is good potential for producing geothermal energy
+    from abandoned and non-activated petroleum wells.'
+- source_sentence: Why must welding consumables be limited to specific classifications
+    and manufacturers for EGW?
+  sentences:
+  - "8 API R ECOMMENDED P RACTICE 582  \n**5.2.6 EGW**  \nThe use of EGW shall be\
+    \ limited by the following conditions:  \na) EGW shall be used only with filler\
+    \ materials specifically intended for the EGW process (ASME/AWS SFA/A5.26/\nSFA/A5.26M),\
+    \  \nb) welding consumables shall be limited to the classification and the manufacturer’s\
+    \ trade name used in the PQR,  \nc) only filler materials having classifications\
+    \ with specified minimum impact test requirements should be used.  \n**5.2.7 SAW**\
+    \  \n**5.2.7.1** SAW procedures shall be requalified whenever the welding flux\
+    \ is changed from one manufacturer’s trade\nname to another. Equivalence under\
+    \ ASME _BPVC_ Section II, Part C, or AWS filler metal specifications shall not\
+    \ be\nconsidered adequate for substitution without requalification.  \nCOMMENTARY\
+    \ It is recognized that fluxes having the same classification can be very different\
+    \ in their\ncomposition. However, nominal flux composition is not included in\
+    \ AWS or ASME specifications/codes, and flux\nsuppliers do not normally provide\
+    \ this information. Differences among fluxes of the same classification can result\
+    \ in\ndifferent and unanticipated weld properties when these fluxes are used interchangeably\
+    \ over the range of variables\ntypically stated in weld procedure specifications.\
+    \  \n**5.2.7.2** Manually held (semiautomatic) SAW is not permitted for welding\
+    \ pressure-containing parts, unless approved\nby the purchaser.  \n**5.2.7.3**\
+    \ A separate qualification is required for SAW welds in which any pass is greater\
+    \ than [1] / 2 in.  \n**5.3 Single-sided Welded Joints**  \nFor single-sided welded\
+    \ joints where process side corrosion is a concern, welding processes using coatings\
+    \ or fluxes\nshall not be used for root pass welding of austenitic stainless steels,\
+    \ non-ferrous alloys and nickel-base alloys unless\nslag can be removed from the\
+    \ process side of root passes and the area inspected for slag removal.  \n**5.4\
+    \ Combining Welding Processes**  \nCombining two or more welding processes that\
+    \ use alloy filler metals of different nominal compositions, other than A1\nthro\
+    \ ~~ug~~ h A5, requires qualification as a combination procedure."
+  - 'Following multi-disciplinary reviews, an opportunity was identified to restore
+    production and unlock incremental reserves from well X, a dual completion in two
+    different reservoirs but subsequently deserted due to long term community crisis
+    that led to over 25years of non-production with complete vandalization of well
+    head and flowlines.
+    The method employed involved the strategic resolution of long-term crisis between
+    two communities where well X is located, via a multi-disciplinary effort involving
+    the operating company’s Community Relations, HSSE, Production Engineering, Operations
+    Support and Portfolio Management functions. The installation of a retrofitted
+    well head was done with first line and second line maintenance carried out. Wireline
+    drifting and static bottom hole pressures were acquired for both strings using
+    slickline equipment and a preliminary well test was conducted for both strings
+    with production to a flowback tank.
+    The preliminary result for the long string (LS) indicated a high water cut (>80%),
+    while the result from short string (SS) was in line with expectation (<57%). The
+    test result from the short string informed the decision to construct a new flowline
+    for restoring its production, while further subsurface evaluation is required
+    for the long string (LS). The significance of the short string (SS) result is
+    the unlocking of additional reserves ca. 1.0MMSTB from a reservoir with remaining
+    oil in place estimated at ca. 18MMSTB, where the short string (SS) is the only
+    drainage point currently completed on the reservoir.
+    This solution provides a cost effective and efficient way to increasing production
+    and reserves at minimal expenditure leveraging on multi-disciplinary expertise,
+    using existing infrastructures as well as resolving community crisis, where applicable.'
+  - "**Exploration & Production**  \n**General Specification** Date: 10/2007  \n**GS\
+    \ EP STR 301** Rev: 07  \nsuccessful practice of the process in previous similar\
+    \ jobs to the satisfaction of the\nCOMPANY.  \nb) Only Extra Low Hydrogen processes\
+    \ (max. 5 ml H2/100 g) shall be used for welding and tack  \nwelding of Special\
+    \ and First Category members or materials having specified YS above\n262 MPa (38,000\
+    \ psi). The same requirement shall apply for any welding on castings and\nforgings.\
+    \  \nc) For Second Category members and Non-Structural members, welding processes\
+    \ other than\nExtra Low Hydrogen processes may be used, subject to prior approval\
+    \ by COMPANY, for\nmaterials having specified YS up to 262 MPa together with thickness\
+    \ up to 12.70 mm\n(0.500”).  \nd) The number of different welding processes shall\
+    \ be minimized.  \ne) Different welding consumables qualities (basic extra low,\
+    \ basic low,) in a same type of\nconsumables shall be avoided.  \n**8.5 Welding\
+    \ consumables**  \n**8.5.1 Selection of consumables**  \na) Consumables shall\
+    \ conform to ANSI/AWS D 1.1 code and shall have been approved by an\ninternational\
+    \ recognized certification body (e.g. DNV, LLOYD’s, etc.).  \nb) If classification\
+    \ of the structure is required, welding consumables shall conform to rules of\
+    \ the\nClassification Society.  \nc) Cellulosic electrodes are strictly forbidden\
+    \ for structural use  \nd) Welds forming connections between steels of different\
+    \ grades of material shall develop the\nminimum specified tensile properties of\
+    \ the lower steel grades being joined, unless otherwise\npreviously approved by\
+    \ the COMPANY.  \nWelds forming connections between steels of different grades\
+    \ of material shall develop the\nminimum specified notch impact properties at\
+    \ the lowest temperature of steel grades being\njoined, unless otherwise previously\
+    \ approved by the COMPANY.  \ne) For repair welding or multiple repairs, “extra\
+    \ low hydrogen” electrodes are required\n(i.e. maximum specified hydrogen content\
+    \ of 5 ml per 100 gram of weld metal).  \nf) For welding castings or forgings,\
+    \ “extra low hydrogen” electrodes are required (i.e. maximum\nspecified hydrogen\
+    \ content of 5 ml per 100 gram of weld metal)."
+- source_sentence: What is the recommended use of blank samples in sampling procedures
+    involving the trapping or precipitation of components?
+  sentences:
+  - "elements. We do this by performing a similarity transformation on the matrix\
+    \ _k_ . The coordinate systems _x_ = { _x_ 1, _x_ 2 } and _y_ = { _y_ 1,, _y_\
+    \ 2 } are related by the similarity transformation matrix  \n_A_ such that  \n\
+    _y_ = _Ax_ . ................................................................\
+    \ (2.127)  \nThe two coordinate systems are shown in Fig. 2.6.\nAn angle ( _θ_\
+    \ ) is associated with the transformation in Eq. 2.127 by writing the 2D coordinate\
+    \ transformation as  \n_y_ 1  \n_y_ 2  \n_x_ 1  \n_x_ 2  \n=  \ncos _θ_ sin _θ_\
+    \  \n−sin _θ_ cos _θ_  \n. ........................................... (2.128)\
+    \  \nThe coordinate systems _x_ = { _x_ 1, _x_ 2 } and _y_ = { _y_ 1,, _y_ 2 }\
+    \ are related by the counterclockwise rotation shown in Fig. 2.6. We have an aligned\
+    \ coordinate system _y_ = { _y_ 1,, _y_ 2 } with the\nprincipal axes of the permeability\
+    \ tensor. The diagonal tensor in the coordinate system  \n_y_ = { _y_ 1,, _y_\
+    \ 2 } has the form  \n_k_ ′=\n(  \n_k_ max 0  \n0 _k_ _T_\n) [, .........................................................\
+    \ (2.129)]  \n**Print** **Search** **Chapter 1** **Home** **Chapter 3** **Bookmarks**\
+    \ **Help**"
+  - "**© 2010 COPYRIGHT MERCADO NEGRO, LAS PLAYITAS. MARACAIBO-EDO. ZULIA, VENEZUELA.**\
+    \  \n**PARA COMPRAR AL DETAL O AL MAYOR, ESTE Y OTROS PRODUCTOS, FAVOR PREGUNTAR\
+    \ POR EL GÖAJIRO BLANCO, EN EL MERCADO LAS PLAYITAS.**  \n**ADVERTENCIA: \"EL\
+    \ DERECHO DE AUTOR NO ES UNA FORMA DE PROPIEDAD SINO UN DERECHO CULTURAL. EXIGE\
+    \ TU DERECHO\"**  \nI-208 Petroleum Engineering Handbook—Vol. I  \n**Fig. 4.10—Chromatogram\
+    \ showing broad OBM peak.**  \n**Fig. 4.11—Chromatogram showing narrow OBM peak.**\
+    \  \nSpecial correction techniques are increasingly used within the oil industry,\
+    \ and because\nthese techniques vary between organizations and laboratories, sample\
+    \ selection should be done\nonly after considering which method to use. Many companies\
+    \ are forced to use oil-based\ndrilling muds to manage drilling costs in water-sensitive\
+    \ formations, and the added expense of\nhandling contaminated samples (and the\
+    \ risk associated with poorer-quality data) must be used\nto evaluate the overall\
+    \ economic balance.  \nFor water samples, comparisons of duplicates also give\
+    \ a good indication of quality. Where\nfluid concentration may be stabilizing\
+    \ (e.g., at the end of a cleanup), sequential samples should\nbe used to look\
+    \ for compositional trends and thus to help decide if representative fluid has\n\
+    been sampled. For some sampling procedures involving trapping or precipitation\
+    \ of particular\ncomponents, it is highly recommended to use blank “samples,”\
+    \ which undergo exactly the\nsame treatment and storage as the actual sample and\
+    \ provide a reference measurement to assist\nwith the interpretation of laboratory\
+    \ measurements. More details are available in API _RP 45._ [10]  \n**Print** **Search**\
+    \ **Chapter 3** **Home** **Chapter 5** **Bookmarks** **Help**"
+  - 'the ideal time to take samples? (6) Will on-site analyses be required? (7) Who
+    will perform
+    sampling and analysis duties?
+    Fluid-sampling operations are often left to service-company personnel, but because
+    significant variation in levels of competence exists within the industry and within
+    service companies
+    themselves, it is recommended either to use specialist laboratory personnel or
+    to supervise the
+    service-company operations closely.
+    General guidelines for choosing reservoir-fluid-sampling methods and sample quantities
+    required are summarized in **Table 4.2.** Regardless of the actual volumes mentioned,
+    you should
+    collect at least two separate samples of each fluid, referred to as duplicate
+    or replicate samples.
+    This reduces the chance of losing information if one of the samples leaks or is
+    accidentally
+    damaged during laboratory operations, and it allows a comparison between the samples
+    as part
+    of the quality-control procedures.
+    Surface-separator sampling is the most common technique, but the reservoir-fluid
+    sample
+    recombined in the laboratory is subject to errors in the measured GOR and any
+    imprecision in
+    the laboratory recombination procedure. Downhole samples (or wellhead samples)
+    are not affected by such inaccuracies but require the fluid to be in monophasic
+    condition when sampled;
+    this can be confirmed definitively only afterward in the laboratory. Also, there
+    is general reluctance to attempt downhole sampling in gas/condensate reservoirs
+    because many are saturated,
+    and the phases are likely to segregate in the wellbore. The ideal situation for
+    a laboratory is to
+    receive both surface and downhole samples because a choice is then available,
+    and a good idea
+    can be obtained of how representative the resulting fluid is.
+    In certain circumstances, it can be good practice to collect “backup” fluid samples
+    at the
+    earliest opportunity during a production test, even if a well has not cleaned
+    up properly. If the
+    test has to be aborted for some reason [well bridging, unexpected levels of hydrogen
+    sulfide
+    (H 2 S), etc.], the backup samples may be of great value, even if they are not
+    100% representative. If the test is completed successfully, the backup samples
+    can be discarded to avoid the
+    cost of unnecessary shipment and testing.
+    If sampling is part of a long-term monitoring program, such as those required
+    by government authorities or those forming part of custody-transfer contracts,
+    the methods defined in the
+    appropriate documentation or contracts must be followed as closely as possible,
+    even if this'
+- source_sentence: What is the significance of implementing a centralized, web-based
+    integrated surveillance tool for production optimization?
+  sentences:
+  - "P RESSURE                - RELIEVING AND D EPRESSURING S YSTEMS 141  \n**5.7.11.2.3\
+    \ Flare Gas Characteristics**  \nFlare gases can have widely varying compositions\
+    \ that shall be evaluated during specification of recovery systems.\nThe potential\
+    \ for materials that are not compatible with the flare gas treating systems or\
+    \ ultimate destinations shall be\ndetermined. For example, relief streams containing\
+    \ acid gases typically are routed directly to the flare, thereby\nbypassing the\
+    \ recovery system. Highly inert streams can also be incompatible with recovery\
+    \ systems.  \n**5.7.11.3 Design Considerations**  \n**5.7.11.3.1 Sizing**  \n\
+    Figure 13 shows a conceptual design for a flare gas recovery system. Typically,\
+    \ the system consists of one or more\nreciprocating compressors whose suction\
+    \ is directly connected to the flare header. The compressed gas is usually\nrouted\
+    \ to some type of treating system appropriate for the gas composition, then to\
+    \ fuel gas or processing systems.  \n3  \n**Key**\n1 compressor load control\n\
+    2 flare gas treating\n3 from process unit flare knockout drums  \n4 flare header\
+    \  \n5 flare knockout drum (if used)  \n6 water seal  \n7 flare  \na Compressor\
+    \ shutdown.  \n**Figure 13—Typical Flare Gas Recovery System**  \nCopyright American\
+    \ Petroleum Institute\nProvided by IHS under license with API Licensee=Petrofac\
+    \ International Ltd/5954785001, User=McNicol, William\nNo reproduction or networking\
+    \ permitted without license from IHS Not for Resale, 01/29/2014 03:10:03 MST"
+  - 'Inorganic scale precipitation and deposition in oil and gas wells can cause significant
+    production loss, which results in additional operational expenditure (OPEX) and
+    health safety and environmental (HSE) risks. Scale management requires a detailed
+    understanding of production rates, hydrocarbon and produced water compositions
+    as well as reservoir conditions. Accurate real-time analysis of produced water
+    compositions can immediately identifiy scaling risks in a production well and
+    can lead to significantly reduced production loss, optimized chemical dosages,
+    and fewer workovers, consequently lowering OPEX and mitigating HSE risk. This
+    paper introduces development of a device capable of measuring the most critical
+    parameters associated with inorganic scale in flowing produced water including
+    pH, alkalinity, strontium, barium, sulfate, total hardness, total dissolve solids
+    (TDS) and others.
+    In order to measure these water properties with the device, different methods
+    were tested, but eventually, a combination of spectrophotometric and other methods
+    were determined effective. One of the challenges of using spectrophotometric methods
+    is the reagent stability over time. Hence, customized reagents were prepared for
+    this application and the stability of these reagents was tested over time. Specific
+    calibration methods were designed in order to extend the usage of the reagents.
+    Static measurements were initially performed and the results showed precise measurements
+    of all the parameters. Results from dynamic tests utilizing real time flow and
+    static test were in agreement and the accuracy was confirmed by traditional methods.
+    Once the device prototype was built in our laboratories, production fluids were
+    used to test the complete device. This device can be placed at various attachment
+    points from the wellhead to the separator. This automated device is capable of
+    collecting a discrete production fluid sample, separating produced water from
+    the bulk phase and measuring various properties of produced water. These properties
+    are reported electronically and used as part of a combined real time scale risk
+    prevention system. In addition, this device measures parameters while maintaining
+    wellhead pressure and temperature in order to eliminate the potentials errors
+    in measurements, for instance pH of water changes due to degassing and precipitation
+    as a result of changes in pressure and temperature.
+    A field trial is planned to test the device under full flowing conditions. This
+    will be the first automated real-time produced water composition monitoring device
+    with high measurement accuracy while maintaining pressure and temperature of samples,
+    which can be attached at various points from wellhead to separator. This can be
+    beneficial to identify the scaling risk in production wells before severe scaling
+    occurs. The device is designed to enhance reliability of water properties measurements,
+    provide real-time measurements, and reduce downtime and costs associated with
+    scale problems and sampling.'
+  - 'In 2009, the Kuwait Integrated Digital Field (KwIDF) project was established
+    in the Sabriyah field in north Kuwait to boost production and reserves (Al-Jasmi
+    et al. 2014). The goal was to help realize the vision of sustained oil production
+    in Kuwait of four million barrels of oil equivalent per day (BOE/D) by 2030 (Goel
+    et al. 2013). The project involved the creation of 11 integrated, automated workflows,
+    and a real-time collaborative environment to help optimize production, reduce
+    downtime, and improve reservoir management:
+    Key performance monitoring—calculates and displays key parameters to monitor and
+    assess asset performance at the field and well levels (Al-Jasmi et al. 2013).
+    Well performance evaluation (WPE)—allows users to model and evaluate any well
+    in real time, from completion face to wellhead (Cullick et al. 2013).
+    Smart production surveillance (SPS)—helps enable users to control production and
+    make surveillance decisions in real time (Villamizar et al. 2013).
+    Production loss—an advance workflow for users to compare current oil production
+    to pre-established allowable rates (Villamizar et al. 2013).
+    Electric submersible pump (ESP) diagnostic and optimization—helps enable users
+    to interactively monitor and optimize ESP operated well operations (Velasquez
+    et al. 2013).
+    Production allocation—integrates the allocation process within the KwIDF environment,
+    increases the frequency of the allocation cycle, and improves the accuracy of
+    allocated volumes (Al-Jasmi et al. 2013).
+    Gas lift (GL) diagnostic and optimization—uses a smart real-time control that
+    provided proactive recommendations for GL systems optimization (Al-Jasmi et al.
+    2013).
+    Reporting and distribution—displays system generated alarms from all KwIDF workflows,
+    generates tickets, and reports ticket status (Al-Jasmi et al. 2013).
+    Simulation model update and ranking—an automated workflow for reservoir history
+    matching (Carvajal et al. 2013).
+    Reservoir visualization and analysis, and subsurface waterflood optimizer—helps
+    enable the monitoring of subsurface health during the waterflooding process, and
+    provides predictive reservoir optimization analysis and actions (Ranjan et al.
+    2013).
+    By 2012, KwIDF had been deployed on 49 wells, representing a pilot that served
+    as a proof of concept. By 2013, cumulative production gains of 756,000 barrels
+    of oil were reported (Singh et al. 2013). While the gains were impressive, and
+    management wanted to expand KwIDF, it was recognized that full deployment would
+    pose significant challenges and, without a set of necessary changes, the value
+    of KwIDF would not be realized.
+    The key challenge facing management was to identify the appropriate operating
+    model to deliver on the KwIDF vision and scale the program to accommodate future
+    expansion across the rest of the organization. A transition and deployment assessment
+    team was established by management to address this challenge.
+    The transition and deployment assessment project produced a recommended operating
+    model, a transition road map, change management strategy, risk and mitigation
+    plan, and project charters to assist the program team and steering group in the
+    deployment of KwIDF across the rest of North Kuwait.'
+- source_sentence: What role did anti-collision analysis play in the drilling of the
+    dual lateral well?
+  sentences:
+  - This paper aims to analyze the impact of appraising and developing marginal fields
+    with multiple stacked reservoirs which is quite challenging in terms of techno
+    commercial value. The development of such marginal reservoirs using conventional
+    single horizontal wells drilling and completion is uneconomical. Therefore, it
+    was necessary to engineer a solution that can enhance the commercial value of
+    the project by reducing CAPEX and OPEX. This paper will present the first comprehensive
+    business case, where multiple stacked reservoirs with marginal reserves were studied
+    to produce independently using multilateral completions, granting full accessibility
+    of the laterals while achieving production monitoring and reservoir surveillance.
+  - 'This paper is a comprehensive analytic driven study on the use and sizing of
+    membrane filters to improve the injected water quality for maintaining injectivity
+    in tight carbonate reservoirs. Out of the different mechanisms of formation damage,
+    the pore plugging with the migration of particles within the injectant fluids
+    by bridging at the pore throat junctions and/or by pore filling can lead to the
+    buildup of an internal filter cake away from the wellbore that limits the well’s
+    injectivity and can affect the vertical and lateral sweep.
+    This type of formation damage is very difficult to treat with any kind of stimulation
+    and the impact will be manifested especially in tight formations with interbedded
+    stylolites layers with a total range of permeabilities from 2 to less than 1 milli-Darcy
+    and a median pore throat size ranging from 2.5 to 0.3 micron meters. The study
+    comprises several parts starting with a geological analysis that was conducted
+    to identify areas and layers most prone to formation pore plugging by analyzing
+    thin-sections and MICP data. Second, in the lack of core flood tests, a reservoir
+    and well study analyzed existing water injectors situated in similar or slightly
+    higher quality rock areas through the analysis of injectivity index behavior to
+    estimate the impact of damage and the expected injector’s half-life.
+    As a result, through the application of an analytical mathematical model for defining
+    deep bed filtration parameter, a correlation was established based on average
+    injected particle size and reservoir rock quality to aid in selecting the proper
+    water injection filter size. In order to confirm that, a dedicated injectivity
+    test in a horizontal well utilizing membrane filters was carried out to assess
+    eventual formation damage and the filters efficiency by conducting a series of
+    multiple pressure fall-off tests coupled with injection profile logging to monitor
+    any induced damage within the wellbore region.
+    Finally, the operational aspects and the integration within field development
+    plans were addressed, especially with the recommended well placement and completion.
+    This culminated in a field development strategy for formation damage mitigation
+    in tight carbonate reservoirs during production and injection phase that can be
+    used in other similar fields.'
+  - 'The most common challenge in horizontal drilling is depth uncertainty which can
+    be due to poor seismic data or interpretation. It is arguable that a successful
+    landing of the wellbore in the reservoir optimally and within the desired zone
+    is the most challenging in most geosteering operation. The presence of fluid contacts
+    such as oil-water-contact (OWC) and gas-oil-contact (GOC) complicates the whole
+    drilling process, most especially if these fluid contacts are not well defined
+    or known. Additionally, the ability to map the boundaries of the reservoir as
+    the BHA drills the lateral section is an added advantage to remaining within the
+    desired reservoir section.
+    The success of any reservoir navigation service where seismic uncertainty at the
+    reservoir top is high will rely largely on how effective the geosteering system
+    is and how the geosteering engineer is able to react promptly to changes while
+    landing the well in the reservoir and drilling the lateral section with without
+    exiting the reservoir.
+    Reservoir Navigation Service (RNS) provides the means for the drilling near horizontal
+    or horizontal wells for the purpose of increasing hydrocarbon extraction from
+    the earth''s subsurface. This involves the use of a pre-defined bottom hole assembly
+    (BHA) with inbuilt downhole logging while drilling (LWD) and measurement while
+    drilling (MWD) sensors. The measurements from these downhole sensors are uplinked
+    to the surface of the wellbore where they are converted to meaningful petrophysical
+    data. The goal is to use the downhole petrophysical data such as gamma ray, propagation
+    resistivity and so on, to update an existing pre-well geological model of a section
+    of the earth in such a way that the final result depicts the true model picture
+    of the earth subsurface.
+    This paper focuses on using well CBH-44L to showcase how the use of real-time
+    distance-to-boundary (D2B) measurement from a deep reading azimuthal propagation
+    resistivity tool is use to correct for depth uncertainty in seismic, thereby,
+    improving the chance of successfully landing and drilling a horizontal well.'
+datasets:
+- Sampath1987/offshore_energy
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+metrics:
+- cosine_accuracy
+model-index:
+- name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
+  results:
+  - task:
+      type: triplet
+      name: Triplet
+    dataset:
+      name: ai job validation
+      type: ai-job-validation
+    metrics:
+    - type: cosine_accuracy
+      value: 0.7850282788276672
+      name: Cosine Accuracy
+---
+# SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) on the [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 9bbca17d9273fd0d03d5725c7a4b0f6b45142062 -->
+- **Maximum Sequence Length:** 8192 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+- **Training Dataset:**
+    - [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy)
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NewModel'})
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("Sampath1987/EnergyEmbed-v1")
+# Run inference
+sentences = [
+    'What role did anti-collision analysis play in the drilling of the dual lateral well?',
+    'This paper aims to analyze the impact of appraising and developing marginal fields with multiple stacked reservoirs which is quite challenging in terms of techno commercial value. The development of such marginal reservoirs using conventional single horizontal wells drilling and completion is uneconomical. Therefore, it was necessary to engineer a solution that can enhance the commercial value of the project by reducing CAPEX and OPEX. This paper will present the first comprehensive business case, where multiple stacked reservoirs with marginal reserves were studied to produce independently using multilateral completions, granting full accessibility of the laterals while achieving production monitoring and reservoir surveillance.',
+    "The most common challenge in horizontal drilling is depth uncertainty which can be due to poor seismic data or interpretation. It is arguable that a successful landing of the wellbore in the reservoir optimally and within the desired zone is the most challenging in most geosteering operation. The presence of fluid contacts such as oil-water-contact (OWC) and gas-oil-contact (GOC) complicates the whole drilling process, most especially if these fluid contacts are not well defined or known. Additionally, the ability to map the boundaries of the reservoir as the BHA drills the lateral section is an added advantage to remaining within the desired reservoir section.\nThe success of any reservoir navigation service where seismic uncertainty at the reservoir top is high will rely largely on how effective the geosteering system is and how the geosteering engineer is able to react promptly to changes while landing the well in the reservoir and drilling the lateral section with without exiting the reservoir.\nReservoir Navigation Service (RNS) provides the means for the drilling near horizontal or horizontal wells for the purpose of increasing hydrocarbon extraction from the earth's subsurface. This involves the use of a pre-defined bottom hole assembly (BHA) with inbuilt downhole logging while drilling (LWD) and measurement while drilling (MWD) sensors. The measurements from these downhole sensors are uplinked to the surface of the wellbore where they are converted to meaningful petrophysical data. The goal is to use the downhole petrophysical data such as gamma ray, propagation resistivity and so on, to update an existing pre-well geological model of a section of the earth in such a way that the final result depicts the true model picture of the earth subsurface.\nThis paper focuses on using well CBH-44L to showcase how the use of real-time distance-to-boundary (D2B) measurement from a deep reading azimuthal propagation resistivity tool is use to correct for depth uncertainty in seismic, thereby, improving the chance of successfully landing and drilling a horizontal well.",
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities)
+# tensor([[1.0000, 0.4457, 0.3235],
+#         [0.4457, 1.0000, 0.3388],
+#         [0.3235, 0.3388, 1.0000]])
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+## Evaluation
+### Metrics
+#### Triplet
+* Dataset: `ai-job-validation`
+* Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
+| Metric              | Value     |
+|:--------------------|:----------|
+| **cosine_accuracy** | **0.785** |
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### offshore_energy
+* Dataset: [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) at [0ebbfc6](https://huggingface.co/datasets/Sampath1987/offshore_energy/tree/0ebbfc615bc7c9bbd3d58315bc2e14e91f291fa1)
+* Size: 89,129 training samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                             | positive                                                                             | negative                                                                              |
+  |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
+  | type    | string                                                                             | string                                                                               | string                                                                                |
+  | details | <ul><li>min: 12 tokens</li><li>mean: 24.68 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 37 tokens</li><li>mean: 437.61 tokens</li><li>max: 983 tokens</li></ul> | <ul><li>min: 28 tokens</li><li>mean: 410.96 tokens</li><li>max: 1188 tokens</li></ul> |
+* Samples:
+  | anchor                                                                                                                                                        | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+  |:--------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>What is the significance of end point relative permeability of the oil phase in the productivity of oil reservoirs below bubble point pressure?</code>  | <code>In contrast with what is followed for Offshore Oil Operations the majority of the Onshore Oil Operations in the world do not have a Minimum and Mandatory required HSE training program for all personnel including contractors and subcontractors.<br>A comparison is drawn between the Minimum and Mandatory HSE Training Programmes applied offshore in developed areas, mainly North Sea and Gulf of Mexico and the benefits that similar programs can bring to the ME onshore oil operations are addressed by estimating the risk reduction and potential economic benefits.<br>The applicability of such Minimum and Mandatory HSE Training Programs is analyzed against the scenario of heavy utilization of contractors and subcontractors with different approach and standards in HSE training and also the increasing complexity of the onshore oil operations<br>An estimation of how many lives can potentially be saved by the introduction of such programs is provided in global and generic terms.<br>The HR Impact, in different a...</code>                                                                                  | <code>The knowledge of relative permeability is key in oil production mechanism as it affects multiphase flow which is vital to producible reserves in petroleum reservoirs. In this study, the impact of altering end point saturation on relative permeability curve and how it influences oil recovery was investigated on field X in Niger Delta, Nigeria. The saturation end points obtained after a simulation study was used as a start point to predict oil production. These end points saturation of water and oil were altered and varied according to facies. The eclipse simulation tool was used in conducting the prediction runs. The result obtained showed wide variation from actual production forecast (i.e. ≥ 25%) when end points were varied with no guided limit from experimental data. This study reveals the need for an accurate determination of residual oil saturation as it was seen to have an impact on forecast and history match.</code>                                                                                                                                                                   |
+  | <code>What role does the effective coefficient of discharge (_Kd_) play in calculating the required effective discharge area?</code>                          | <code>96 API S TANDARD 520, P ART I—S IZING AND S ELECTION  <br>**B.2.3.3** Using the theoretical mass flux obtained from numerical integration above, one may determine the<br>required effective discharge area:  <br>In USC units:  <br>_Q_ × ρ 1<br>×  <br>sec gal _G_ ×<br>60 × 7 4805 .<br>min ft 3  <br>_A_ = _W_ = _Q_ × ρ × 1<br>_G_ × _K_ d 60 sec × 7 4805 . gal _G_ × _K_  <br>d 60 × 7 4805 . d<br>3  <br>528 62 2 × . 1 2 2  <br>_A_ = × = 0 0148 ft . = 2 135 in. . (B.8)  <br>60 7 4805 × . 7 592 14 0 65, . × .  <br>In SI units:  <br>_Q_ ×ρ 1<br>×  <br>sec liter _G_ ×<br>60 min × 1 000, m 3  <br>_A_ = _W_ = _Q_ ×ρ × 1<br>_G_ × _K_ d 60 sec × 1 000, liter 3 _G_ × _K_  <br>,  <br>_A_ = 2 000, × 996 9 . × 1 = 1 379 . × 10 − 3 m 2 = 1 379 mm, 2 (B.9)<br>60 × 1 000, 37 068, × 0 65 .  <br>where  <br>_G_ is the theoretical mass flux through the nozzle, lb/s·ft [2] (kg/s·m [2] );  <br>_W_ is the required relief rate, lb/s (kg/s);  <br>_Q_ is the required relief rate, gal/min (L/min);  <br>ρ = 1 _v_ is the fluid density, lb/ft [3] (kg/m [3] );  <br>_K_ d is the effective coefficient of discharge...</code> | <code>S IZING, S ELECTION, AND I NSTALLATION OF P RESSURE           - RELIEVING D EVICES 59  <br>**5.6.3 Sizing for Critical Flow**  <br>**5.6.3.1 General**  <br>**5.6.3.1.1** Pressure-relief devices in gas or vapor service that operate at critical flow conditions (see 5.6.2)<br>may be sized using Equation (2) through Equation (7). Each of the equations may be used to calculate the<br>effective discharge area, _A_, required to achieve a required flow rate through a pressure-relief device. A PRV<br>that has an effective discharge area equal to or greater than the calculated value of _A_ is then chosen for the<br>application.  <br>In USC units:  <br>_A_ = (2)  <br>_A_ = (3)  <br>6 32 . _CK P K K_ d 1 b c  <br>_A_ = (4)  <br>1 175 . _CK P K K_  <br>1 175 . _CK P K K_ d 1 b c  <br>.  <br>In SI units:  <br>_A_ = (5)  <br>_A_ = (6)<br>_CK P K K_  <br>d 1 b c  <br>_A_ =<br>_CK P K K_  <br>=<br>(7)  <br>d 1 b c  <br>where  <br>_A_ is the required effective discharge area of the device, in. [2] (mm [2] ) (see 3.20);  <br>_W_ is the required flow through the device, lb/h (kg/h);  <br>_C...</code> |
+  | <code>How many swellable packers were required to be run in the horizontal hole part for the AICV trial, and what was the purpose of this requirement?</code> | <code>Removing fluid from a wellbore column, allowing a well to flow initially, or bringing a previous well back online, nitrogen lifting is commonly used in north Iraq wells. Due to the inability of coiled tubing units to be delivered on time and their high cost, operators are forced to seek for an alternative method of unloading drilling fluid. A hydraulic Jet Pump is a technology used to complete the task.<br>A newly drilled well DB-H was chosen, and the drilling fluid volume calculated was 12,000 bbl. to pump to the surface and begin production, assuming nonstop operation between unloading and producing. The deployment of the hydraulic lift Jet Pump for both stages was planned. Well data from the operator was collected, the process design was initiated, and Jet Evaluation Modeling Software (JEMS) was used to run the design models. A Proper pump size was set up based on available data to meet operator expectations. A Reverse Circulating Jet Pump (RCJP) was chosen to be installed inside a Sli...</code>                                                                                           | <code>This development, predominantly from four artificial islands, of a giant offshore field in the United Arab Emirates (UAE) requires lateral compartmentalization with open hole packers of the 6 5/8" horizontal lower completions with lateral lengths greater than 16,000ft and total well lengths greater than 30,000ft MD. Swell Packer technology has enabled cost effective compartmentalization in horizontal laterals and is the preferred OH packer solution for the development.<br>Deploying swell packers is regarded as being a simple solution to compartmentalizing any lateral where typically the deployment fluid differs from the fluids in which it will swell in; this application prevents the elastomer from swelling during deployment and swelling upon contact with produced or injected fluids. The use of an extended delayed oil swell packer with no delay systems in this particular application enables the packers to be deployed in a Non Aqueous Reservoir Drill in Fluid (RDFNAF) where the packer is re...</code>                                                                                     |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim",
+      "gather_across_devices": false
+  }
+  ```
+### Evaluation Dataset
+#### offshore_energy
+* Dataset: [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) at [0ebbfc6](https://huggingface.co/datasets/Sampath1987/offshore_energy/tree/0ebbfc615bc7c9bbd3d58315bc2e14e91f291fa1)
+* Size: 11,141 evaluation samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                             | positive                                                                             | negative                                                                             |
+  |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
+  | type    | string                                                                             | string                                                                               | string                                                                               |
+  | details | <ul><li>min: 12 tokens</li><li>mean: 24.37 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 38 tokens</li><li>mean: 428.35 tokens</li><li>max: 978 tokens</li></ul> | <ul><li>min: 29 tokens</li><li>mean: 405.3 tokens</li><li>max: 1111 tokens</li></ul> |
+* Samples:
+  | anchor                                                                                                      | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+  |:------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>How does partial jacket construction differ for vessels that cannot use staybolt construction?</code> | <code>**9-7 – 9-10** **ASME BPVC.VIII.1-2019**  <br>**Figure 9-7**  <br>_(2)_ Partial jackets that by virtue of their service or<br>configuration do not lend themselves to staybolt construction may be fabricated by other means providing<br>they are designed using appropriate stress values and<br>are proof tested in accordance with UG-101(p).  <br>444  <br>**9-8** **FABRICATION**  <br>_(a)_ Fabrication of vessels shall be in accordance with<br>applicable Parts of Subsection A and Subsection B, Part<br>UW. The requirements of UW-13(e) do not apply to closure rings.<br>_(b)_ This Appendix covers fabrication of jacketed vessels<br>by welding. Other methods of fabrication are permitted,<br>provided the requirements of applicable parts of this Di<br>vision are met.  <br>_(c)_ Where only the inner vessel is subjected to lethal<br>service, the requirements of UW-2 shall apply only to<br>welds in the inner vessel and those welds attaching the<br>jacket to the inner vessel. Welds attaching the jacket to<br>the inner vessel need not be radiographed and may b...</code> | <code>**9-5 – 9-7** **ASME BPVC.VIII.1-2019**  <br>‐ ‐<br>(g 5), and (g 6), may be used on any of the types of<br>jacketed vessels shown in Figure 9-2 where _t_ _rj_ does not<br>exceed [5] / 8 in. (16 mm).<br>_(7)_ Closures shown in Figure 9-5, sketch (h) used on<br>Type 3 jacketed vessels shown in Figure 9-2 shall have attachment welds in accordance with Figure 9-5, sketch  <br>‐ ‐<br>(i 1) or (i 2). This construction is limited to jackets where<br>_t_ _rj_ does not exceed [5] / 8 in. (16 mm).<br>_(8)_ Closures for conical or toriconical jackets shown<br>in Figure 9-5, sketches (k) and (l) shall comply with the<br>requirements for Type 2 jacketed vessels shown in Figure<br>9-2.  <br>_(d)_ Any radial welds in closure members shall be buttwelded joints penetrating through the full thickness of the<br>member and shall be ground flush where attachment<br>welds are to be made.  <br>_(e)_ Where the inner vessel must meet the requirements<br>of UW-2, the attachment welds of the jacket to the inner<br>vessel need not be welded for their full thickness no...</code> |
+  | <code>What dimensions must fins and studs conform to as stipulated in Section 17.4.4?</code>                | <code>**17.4 Examination of other components**  <br>**17.4.1** Examination of heater steelwork shall be in accordance with the structural design code.  <br>**17.4.2** Refractory linings shall be examined throughout for thickness variations during application and for cracks<br>after curing. Thickness tolerance is limited to a range of minus 6 mm (1/4 in) to plus 13 mm (1/2 in). Cracks which<br>are 3 mm (1/8 in) or greater in width and penetrate more than 50 % of the castable thickness shall be repaired.<br>Repairs shall be made by chipping out the unsound refractory to the backup layer interface or casing and<br>exposing a minimum of three tieback anchors, or to the sound metal, making a joint between sound refractory that<br>has a minimum slope of 25 mm (1 in) to the base metal (dove-tail construction) and then gunning, casting or<br>hand-packing the area to be repaired.  <br>**17.4.3** Finned extended surface shall be examined to ensure fins are perpendicular to the tube within 15°. The<br>maximum discontinuity of the w...</code>                            | <code>**16.1** -112 STEEL ANCHORS [Sect. I8.  <br>**3e.** **Detailing Requirements in Composite Components**  <br>Steel anchors in composite components shall meet the following requirements:  <br>(a) Minimum concrete cover to steel anchors shall be in accordance with ACI 318<br>provisions for concrete protection of headed shear stud reinforcement.  <br>(b) Minimum center-to-center spacing of steel headed stud anchors shall be four<br>diameters in any direction.  <br>(c) The maximum center-to-center spacing of steel headed stud anchors shall not  <br>exceed 32 times the shank diameter.  <br>(d) The maximum center-to-center spacing of steel channel anchors shall be 24 in.<br>(600 mm).  <br>**User Note:** Detailing requirements provided in this section are absolute limits.<br>See Sections I8.3a, I8.3b and I8.3c for additional limitations required to preclude<br>edge and group effect considerations.  <br>_Specification for Structural Steel Buildings,_ July 7, 2016<br>A MERICAN I NSTITUTE OF S TEEL C ONSTRUCTION</code>                                             |
+  | <code>What are some common mistakes in oil and gas project execution that lead to financial losses?</code>  | <code>Dozens of deepwater wells have been drilled in western South China Sea with about 30 percent have characteristics of high temperature and high pressure, which brought a series of difficulties and challenges to field operations. After incorporating the analysis of engineering and geological environment for deepwater HTHP wells in Lingshui block of western South China Sea, it is suggested that the solution of drilling problems for deepwater HTHP wells should start from drilling fluid. Several major technical problems are required to be addressed by drilling fluid, such as co-exist of low temperature and high temperature that lead to difficulty of drilling fluid maintenance and narrow density margin caused by deepwater and high pressure. Based on the above problems, combining with geological features of HTHP wells, researchers developed a novel water based drilling fluid system compatible with deepwater HTHP wells in Lingshui block on the basis of conventional HEM drilling fluid and furth...</code>                                                          | <code>The lack of availability of required skills and experience in most if not all parts of the oil and gas value chain is well documented so, rather than trying to make the case, we will summarise the challenge thus: the industry in all parts of the world can't find the capability it needs to safely get its work done in the timeframes it would like.<br>However or wherever the situation is measured, the consequence is that in days when the oil price might suggest that the industry has "never had it so good", many companies are falling seriously short of stakeholder expectations with projects of all types not being completed as planned or failing to deliver anticipated returns.<br>Close to home we see producers consistently missing quarterly production targets and a seemingly constant downgrading of forecasts and year-on-year plans. This leads to a constant stream of bad news and criticism in the media, greater stress through all levels of management and an inevitable "knee jerk" towards a more sh...</code>                                                    |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim",
+      "gather_across_devices": false
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `eval_strategy`: steps
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `learning_rate`: 2e-05
+- `warmup_ratio`: 0.1
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: steps
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 2e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 3
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.1
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `hub_revision`: None
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `liger_kernel_config`: None
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: proportional
+- `router_mapping`: {}
+- `learning_rate_mapping`: {}
+</details>
+### Training Logs
+| Epoch  | Step  | Training Loss | Validation Loss | ai-job-validation_cosine_accuracy |
+|:------:|:-----:|:-------------:|:---------------:|:---------------------------------:|
+| 0.1795 | 1000  | -             | 1.1634          | 0.6597                            |
+| 0.3590 | 2000  | -             | 1.0971          | 0.6821                            |
+| 0.5385 | 3000  | -             | 1.0596          | 0.7050                            |
+| 0.7180 | 4000  | -             | 1.0336          | 0.7193                            |
+| 0.8975 | 5000  | 1.2066        | 1.0073          | 0.7312                            |
+| 1.0770 | 6000  | -             | 1.0060          | 0.7331                            |
+| 1.2565 | 7000  | -             | 0.9794          | 0.7465                            |
+| 1.4360 | 8000  | -             | 0.9657          | 0.7580                            |
+| 1.6155 | 9000  | -             | 0.9498          | 0.7593                            |
+| 1.7950 | 10000 | 0.935         | 0.9387          | 0.7678                            |
+| 1.9745 | 11000 | -             | 0.9293          | 0.7623                            |
+| 2.1540 | 12000 | -             | 0.9313          | 0.7769                            |
+| 2.3335 | 13000 | -             | 0.9245          | 0.7794                            |
+| 2.5130 | 14000 | -             | 0.9190          | 0.7787                            |
+| 2.6925 | 15000 | 0.7607        | 0.9139          | 0.7782                            |
+| 2.8720 | 16000 | -             | 0.9094          | 0.7850                            |
+### Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 5.1.0
+- Transformers: 4.53.3
+- PyTorch: 2.8.0+cu128
+- Accelerate: 1.9.0
+- Datasets: 4.0.0
+- Tokenizers: 0.21.2
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "architectures": [
+    "NewModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration.NewConfig",
+    "AutoModel": "modeling.NewModel",
+    "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
+    "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
+    "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
+    "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
+    "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
+  },
+  "classifier_dropout": 0.0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "layer_norm_type": "layer_norm",
+  "logn_attention_clip1": false,
+  "logn_attention_scale": false,
+  "max_position_embeddings": 8192,
+  "model_type": "new",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pack_qkv": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "rope",
+  "rope_scaling": {
+    "factor": 8.0,
+    "type": "ntk"
+  },
+  "rope_theta": 20000,
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.3",
+  "type_vocab_size": 1,
+  "unpad_inputs": false,
+  "use_memory_efficient_attention": false,
+  "vocab_size": 250048
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.1.0",
+    "transformers": "4.53.3",
+    "pytorch": "2.8.0+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

configuration.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" NEW model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NewConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
+    instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the NEW
+    [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"rope"`):
+            Type of position embedding. Choose one of `"absolute"`, `"rope"`.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import NewConfig, NewModel
+    >>> # Initializing a NEW izhx/new-base-en style configuration
+    >>> configuration = NewConfig()
+    >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
+    >>> model = NewModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "new"
+    def __init__(
+        self,
+        vocab_size=30528,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=2048,
+        type_vocab_size=1,
+        initializer_range=0.02,
+        layer_norm_type='layer_norm',
+        layer_norm_eps=1e-12,
+        # pad_token_id=0,
+        position_embedding_type="rope",
+        rope_theta=10000.0,
+        rope_scaling=None,
+        classifier_dropout=None,
+        pack_qkv=True,
+        unpad_inputs=False,
+        use_memory_efficient_attention=False,
+        logn_attention_scale=False,
+        logn_attention_clip1=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_type = layer_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.classifier_dropout = classifier_dropout
+        self.pack_qkv = pack_qkv
+        self.unpad_inputs = unpad_inputs
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.logn_attention_scale = logn_attention_scale
+        self.logn_attention_clip1 = logn_attention_clip1

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:10e213517c8ea188d9928352e4f840c7dc99f2a01e0ca935229048b0d46df0e8
+size 1221487872

modeling.py ADDED Viewed

	@@ -0,0 +1,1418 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch NEW model."""
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    ModelOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+try:
+    import xformers.ops as xops
+except ImportError as e:
+    xops = None
+from .configuration import NewConfig
+logger = logging.get_logger(__name__)
+# Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
+# Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
+class IndexFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, indices):
+        ctx.save_for_backward(indices)
+        assert input.ndim >= 2
+        ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
+        second_dim = other_shape.numel()
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        # return input[indices]
+        # return torch.gather(
+        #     rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
+        # ).reshape(-1, *other_shape)
+        return torch.gather(
+            input.view(ctx.first_axis_dim, second_dim),
+            0,
+            indices.unsqueeze(-1).expand(indices.size(0), second_dim)
+        ).reshape(-1, *other_shape)
+    @staticmethod
+    def backward(ctx, grad_output):
+        (indices,) = ctx.saved_tensors
+        assert grad_output.ndim >= 2
+        other_shape = grad_output.shape[1:]
+        # grad_output = rearrange(grad_output, "b ... -> b (...)")
+        grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
+        grad_input = torch.zeros(
+            [ctx.first_axis_dim, grad_output.shape[1]],
+            device=grad_output.device,
+            dtype=grad_output.dtype,
+        )
+        # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
+        # grad_input[indices] = grad_output
+        # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
+        grad_input.scatter_(
+            0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
+        )
+        return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
+index_first_axis = IndexFirstAxis.apply
+def unpad_input(hidden_states, attention_mask=None, indices=None):
+    """
+    Arguments:
+        hidden_states: (batch, seqlen, ...)
+        attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
+        indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
+    Return:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+    """
+    if indices is None:
+        assert attention_mask is not None
+        indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
+    # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
+    # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
+    # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
+    # so we write custom forward and backward to make it a bit faster.
+    hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
+    return index_first_axis(hidden_states, indices)
+class IndexPutFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        values: torch.Tensor,
+        indices: torch.Tensor,
+        first_axis_dim
+    ) -> torch.Tensor:
+        ctx.save_for_backward(indices)
+        assert indices.ndim == 1
+        assert values.ndim >= 2
+        output = torch.zeros(
+            first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
+        )
+        output[indices] = values
+        return output
+    @staticmethod
+    def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
+        indices, = ctx.saved_tensors
+        grad_values = grad_output[indices]
+        return grad_values, None, None
+index_put_first_axis = IndexPutFirstAxis.apply
+def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
+    """Add padding to sequences.
+    Arguments:
+        inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
+        batch: int batch_size
+        seqlen: int max sequence length
+    Returns:
+        inputs: (batch, seqlen, ...)
+    """
+    output = index_put_first_axis(inputs, indices, batch * seqlen)
+    return output.view(batch, seqlen, *inputs.shape[1:])
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos, sin = cos.to(q.dtype), sin.to(q.dtype)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
+        )
+class NTKScalingRotaryEmbedding(RotaryEmbedding):
+    """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
+    def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
+        self.scaling_factor = scaling_factor
+        self.mixed_b = mixed_b
+        super().__init__(dim, max_position_embeddings, base, device)
+        max_position_embeddings = max_position_embeddings * self.scaling_factor
+        self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            if self.mixed_b is None:
+                inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim)  # (6)
+            else:
+                a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b  # (13)
+                lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp()  # (12)
+                inv_freq = inv_freq / lambda_1_m  # (10)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+LAYER_NORM = {
+    'layer_norm': nn.LayerNorm,
+    'rms_norm': RMSNorm
+}
+class NewEmbeddings(nn.Module):
+    """
+    Embedding and Unpadding.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.padding_idx = config.pad_token_id
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
+        )
+        self.position_embedding_type = config.position_embedding_type
+        if self.position_embedding_type == 'absolute':
+            self.position_embeddings = nn.Embedding(
+                config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+            )
+        elif self.position_embedding_type == 'rope':
+            self._init_rope(config)
+        else:
+            raise ValueError
+        self.type_vocab_size = config.type_vocab_size
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids is contiguous in memory and excluded when serialized
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings), persistent=False
+        )
+    def _init_rope(self, config):
+        kwargs = dict(
+            dim=int(config.hidden_size / config.num_attention_heads),
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta
+        )
+        if config.rope_scaling is None:
+            self.rotary_emb = RotaryEmbedding(**kwargs)
+        else:
+            kwargs.update(scaling_factor=config.rope_scaling["factor"])
+            scaling_type = config.rope_scaling["type"]
+            if scaling_type == 'ntk':
+                kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
+                self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "linear":
+            #     self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "dynamic":
+            #     self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+    def forward(
+        self,
+        unpad_inputs: bool,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
+        """
+        """
+        if inputs_embeds is None:
+            device, input_shape = input_ids.device, input_ids.shape
+        else:
+            device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
+        batch_size, seq_length = input_shape
+        # Set attention_mask if it's None
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+            if length is not None:
+                for i, l in enumerate(length):
+                    attention_mask[i, l:] = 0
+        # Set attention_mask_bool for unpadding
+        if unpad_inputs:
+            attention_mask_bool = attention_mask.bool()
+            if length is None:
+                length = attention_mask.sum(-1).tolist()
+        # Get word embeddings
+        if inputs_embeds is None:
+            if unpad_inputs:
+                input_ids = input_ids[attention_mask_bool].unsqueeze(0)
+            inputs_embeds = self.word_embeddings(input_ids)
+        else:
+            if unpad_inputs:
+                inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
+        embeddings = inputs_embeds
+        # Set and unpad position_ids
+        if position_ids is None:
+            if seq_length > self.position_ids.size(0):
+                self.register_buffer(
+                    "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
+                )
+            if unpad_inputs:
+                # [1, cumsum_seq_len]
+                position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
+            else:
+                # [bs, seq_len]
+                position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
+        elif unpad_inputs:
+            position_ids = position_ids[attention_mask_bool].unsqueeze(0)  # [1, cumsum_seq_len]
+        # Compute rotary embedding
+        if self.position_embedding_type == 'rope':
+            rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
+            rope_cos = rope_cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_sin = rope_sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_embeds = rope_cos, rope_sin
+        else:
+            rope_embeds = None
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = position_ids.mul(0)
+            else:
+                if self.type_vocab_size < 2:
+                    token_type_ids.mul_(0)
+                if unpad_inputs:
+                    token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+        # BERT position
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings = embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings, attention_mask, rope_embeds, length
+class NewAttention(nn.Module):
+    def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        if pack_qkv is None:
+            pack_qkv = config.pack_qkv
+        self.pack_qkv = pack_qkv
+        if self.pack_qkv:
+            self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
+        else:
+            self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = self.config.use_memory_efficient_attention
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, 'please install xformers'
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        shape_hd = (self.num_attention_heads, self.attention_head_size)
+        # qkv
+        if self.pack_qkv and qkv_inputs is None:
+            qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
+        else:
+            if qkv_inputs is None:
+                qkv_inputs = (hidden_states, hidden_states, hidden_states)
+            qkv_pack = [
+                getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
+            ]
+        query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
+        if self.config.position_embedding_type == 'rope':
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
+        dtype = query_states.dtype
+        if self.config.logn_attention_scale and attention_scale is not None:
+            # https://kexue.fm/archives/8823
+            query_states = query_states * attention_scale.to(dtype)
+        if padding_inputs is not None:
+            query_states = pad_input(query_states.squeeze(), *padding_inputs)
+            key_states = pad_input(key_states.squeeze(), *padding_inputs)
+            value_states = pad_input(value_states.squeeze(), *padding_inputs)
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, "xformers is not loaded"
+            assert output_attentions is False, "memory_efficient_attention do not output attentions"
+            assert head_mask is None, "Not support yet"
+            attention_probs = None
+            if torch.is_tensor(attention_bias):
+                attention_bias = attention_bias.to(dtype)
+            context_layer = self.memory_efficient_attention(
+                query_states,
+                key_states,
+                value_states,
+                attn_bias=attention_bias,
+                p=self.dropout.p
+            )
+        else:
+            if output_attentions and isinstance(self, NewSdpaAttention):
+                raise RuntimeError("SDPA do not output attentions")
+            context_layer, attention_probs = self._attention(
+                query_states, key_states, value_states, attention_bias, head_mask
+            )
+        if padding_inputs is not None:
+            context_layer = unpad_input(context_layer, indices=padding_inputs[0])
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+        # output proj
+        attn_output = self.o_proj(context_layer)
+        # add attentions if we output them
+        outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
+        return outputs
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        """
+        Args:
+            q/k/v: (B, L, n_head, head_dim),
+        Returns:
+            attn_output: (B L, n_head, head_dim)
+        """
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_bias is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_bias
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        if self.dropout.p > 0:
+            attention_probs = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        context_layer = torch.matmul(attention_probs, value_states)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        return context_layer, attention_probs
+class NewSdpaAttention(NewAttention):
+    """
+    New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    def __init__(self, config: NewConfig, **kwargs):
+        super().__init__(config, **kwargs)
+        # torch.backends.cuda.enable_mem_efficient_sdp(False)
+        # logger.warning(
+        #     "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
+        #     "`use_memory_efficient_attention=True` if it expected to use."
+        # )
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states.transpose(1, 2),
+            key_states.transpose(1, 2),
+            value_states.transpose(1, 2),
+            attn_mask=attention_bias,
+            dropout_p=self.dropout.p if self.training else 0.0,
+        )
+        attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
+        return attn_output, None
+NEW_ATTENTION_CLASSES = {
+    "eager": NewAttention,
+    # "flash_attention_2": ,  # TODO
+    "sdpa": NewSdpaAttention,
+}
+class NewGatedMLP(nn.Module):
+    """
+    GLU Variants Improve Transformer.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.intermediate_size = config.intermediate_size
+        self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
+        self.act_fn = ACT2FN[config.hidden_act]
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(self, hidden_states):
+        up_gate = self.up_gate_proj(hidden_states)
+        up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
+        gate = self.act_fn(gate)
+        gated_states = gate * up_states
+        if self.hidden_dropout is not None:
+            gated_states = self.hidden_dropout(gated_states)
+        down_states = self.down_proj(gated_states)
+        return down_states
+class NewLayer(nn.Module):
+    def __init__(
+        self,
+        config: NewConfig,
+        pack_qkv=None,
+        use_memory_efficient_attention=None,
+        attn_implementation=None
+    ):
+        super().__init__()
+        if attn_implementation is None:
+            attn_implementation = config._attn_implementation
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = config.use_memory_efficient_attention
+        if use_memory_efficient_attention:
+            if attn_implementation != 'eager':
+                logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
+                attn_implementation = 'eager'  # Since it will be SDPA by default for torch>=2.1.1
+        self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
+            config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
+        )
+        self.mlp = NewGatedMLP(config)
+        ln_class = LAYER_NORM[config.layer_norm_type]
+        self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        # Multi head self attention
+        residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
+        attention_outputs = self.attention(
+            hidden_states,
+            attention_bias,
+            rope_embeds,
+            padding_inputs,
+            attention_scale,
+            head_mask,
+            output_attentions=output_attentions,
+            qkv_inputs=qkv_inputs,
+        )
+        hidden_states = attention_outputs[0]
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        # In pretraining, after the attention of last layer, we only need the masked tokens.
+        if subset_indices is not None:
+            hidden_states = hidden_states[subset_indices]
+        hidden_states = self.attn_ln(hidden_states)
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        hidden_states = self.mlp_ln(hidden_states)
+        # add self attentions if we output attention weights
+        outputs = (hidden_states,) + attention_outputs[1:]
+        return outputs
+class NewEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: Optional[torch.FloatTensor] = None,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if i >= len(self.layer) - 1:
+                layer_subset_indices = subset_indices
+            else:
+                layer_subset_indices = None
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                    output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    all_hidden_states,
+                    all_self_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
+class NewPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class NewPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = NewConfig
+    base_model_prefix = "new"
+    supports_gradient_checkpointing = True
+    _supports_sdpa = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+class NewModel(NewPreTrainedModel):
+    """
+    The bare New Model transformer outputting raw hidden-states without any specific head on top.
+    """
+    def __init__(self, config: NewConfig, add_pooling_layer=False):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = NewEmbeddings(config)
+        self.encoder = NewEncoder(config)
+        self.pooler = NewPooler(config) if add_pooling_layer else None
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
+        r"""
+        length  (`list` of length `batch_size`, *optional*):
+            If is `None`, return padded `last_hidden_state`.
+        subset_indices  ():
+            pass
+        unpad_inputs  (`bool`, *optional*):
+            pass
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
+        output_padded = length is None
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
+            input_shape = input_ids.size()
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        # TODO: not used
+        # # Prepare head mask if needed
+        # # 1.0 in head_mask indicate we keep the head
+        # # attention_probs has shape bsz x n_heads x N x N
+        # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        # Get embeddings, may unpad them
+        (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
+            unpad_inputs,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds
+        )
+        batch_size, seq_length = input_shape
+        if unpad_inputs and self.config.use_memory_efficient_attention:
+            attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
+        else:
+            # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+            # ourselves in which case we just need to make it broadcastable to all heads.
+            attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
+            if self.config.use_memory_efficient_attention:
+                # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
+                attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
+        padding_inputs = None
+        if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
+            indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+            if not self.config.use_memory_efficient_attention:
+                padding_inputs = (indices, *input_shape)
+        attention_scale = None
+        if self.config.logn_attention_scale:
+            logger.warning_once("TODO: logn_attention_scale")
+        #     # attention scale log_512(input_len)
+        #     attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
+        #     # inference-time logn scale need clip 1
+        #     if self.config.logn_attention_clip1:
+        #         attention_scale.clip_(1)
+        #     attention_scale = attention_scale[:, None, None, None]
+        # else:
+        #     attention_scale = None
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_bias=attention_bias,
+            rope_embeds=rope_embeds,
+            padding_inputs=padding_inputs,
+            attention_scale=attention_scale,
+            subset_indices=subset_indices,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        if unpad_inputs and output_padded:
+            sequence_output = pad_input(
+                sequence_output.squeeze(), indices, batch_size, seq_length
+            )
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+class NewLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.transform_act_fn = ACT2FN[config.hidden_act]
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class NewForMaskedLM(NewPreTrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
+    def __init__(self, config: NewConfig):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.lm_head = NewLMPredictionHead(config)
+        self.loss_fct = nn.CrossEntropyLoss()
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is None or not self.new.config.unpad_inputs:
+            length = None
+            subset_indices = None
+        else:
+            length = attention_mask.sum(-1).tolist()
+            labels = labels[attention_mask.bool()].unsqueeze(0)
+            subset_indices = labels > -100
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            subset_indices=subset_indices,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+        masked_lm_loss = None
+        if labels is not None:
+            if subset_indices is None:
+                mask = attention_mask.bool()
+                prediction_scores = prediction_scores[mask]
+                labels = labels[mask]
+            else:
+                labels = labels[subset_indices]
+            masked_lm_loss = self.loss_fct(prediction_scores, labels)
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForSequenceClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForMultipleChoice(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+            num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            `input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
+        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+@dataclass
+class NewTokenClassifierOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+class NewForTokenClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return NewTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=sequence_output,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForQuestionAnswering(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        start_positions: Optional[torch.Tensor] = None,
+        end_positions: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 8192,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
+size 17082988

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 8192,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizerFast",
+  "unk_token": "<unk>"
+}