tuandunghcmut commited on Apr 10, 2025

Commit

e9cd0c7

verified ·

1 Parent(s): 0d2c90e

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

Ovis/docs/license/GEMMA_LICENSE.txt +77 -0
Ovis/docs/license/LLAMA3_LICENSE +84 -0
Ovis/ovis/__pycache__/__init__.cpython-310.pyc +0 -0
Ovis/ovis/__pycache__/__init__.cpython-311.pyc +0 -0
Ovis/ovis/model/__pycache__/__init__.cpython-310.pyc +0 -0
Ovis/ovis/model/__pycache__/__init__.cpython-311.pyc +0 -0
Ovis/ovis/model/__pycache__/configuration_ovis.cpython-311.pyc +0 -0
Ovis/ovis/model/__pycache__/modeling_ovis.cpython-311.pyc +0 -0
Ovis/ovis/model/configuration_ovis.py +41 -0
Ovis/ovis/model/conversation_formatter.py +233 -0
Ovis/ovis/model/visual_tokenizer/__pycache__/base_visual_tokenizer.cpython-310.pyc +0 -0
Ovis/ovis/model/visual_tokenizer/__pycache__/base_visual_tokenizer.cpython-311.pyc +0 -0
Ovis/ovis/model/visual_tokenizer/__pycache__/clip_visual_tokenizer.cpython-310.pyc +0 -0
Ovis/ovis/model/visual_tokenizer/__pycache__/clip_visual_tokenizer.cpython-311.pyc +0 -0
Ovis/ovis/model/visual_tokenizer/__pycache__/siglip_visual_tokenizer.cpython-310.pyc +0 -0
Ovis/ovis/model/visual_tokenizer/__pycache__/siglip_visual_tokenizer.cpython-311.pyc +0 -0
Ovis/ovis/serve/runner.py +105 -0
Ovis/ovis/serve/server.py +41 -0
Ovis/ovis/train/__init__.py +0 -0
Ovis/ovis/train/arguments.py +48 -0
Ovis/ovis/train/callback.py +37 -0
Ovis/ovis/train/train.py +206 -0
Ovis/ovis/util/constants.py +11 -0
Ovis/ovis/util/utils.py +26 -0
llm2vec/docs/.gitignore +9 -0
llm2vec/docs/Gemfile +18 -0
llm2vec/docs/README.md +104 -0
llm2vec/docs/_config.yml +110 -0
llm2vec/docs/_data/navigation.yml +17 -0
llm2vec/docs/_includes/head/custom.html +48 -0
llm2vec/docs/_sass/custom/header-footer.scss +19 -0
llm2vec/docs/_sass/custom/no-sidebar.scss +9 -0
llm2vec/docs/_sass/custom/splash.scss +5 -0
llm2vec/docs/_sass/skins/dark.scss +30 -0
llm2vec/docs/_sass/skins/light.scss +12 -0
llm2vec/docs/assets/images/logo/favicon.png +0 -0
llm2vec/docs/assets/images/logo/logo.png +0 -0
llm2vec/docs/assets/images/logo/logo.svg +0 -0
llm2vec/examples/classification.py +62 -0
llm2vec/examples/clustering.py +58 -0
llm2vec/examples/retrieval.py +177 -0
llm2vec/examples/sts.py +57 -0
llm2vec/experiments/mteb_eval.py +31 -0
llm2vec/experiments/mteb_eval_custom.py +98 -0
llm2vec/experiments/run_mntp.py +997 -0
llm2vec/experiments/run_simcse.py +388 -0
llm2vec/experiments/run_supervised.py +482 -0
llm2vec/experiments/run_word_task.py +905 -0
llm2vec/experiments/test_word_task.py +393 -0
llm2vec/images/sample_efficient.png +0 -0

Ovis/docs/license/GEMMA_LICENSE.txt ADDED Viewed

	@@ -0,0 +1,77 @@

+Gemma Terms of Use
+Last modified: April 1, 2024
+By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.
+Section 1: DEFINITIONS
+1.1 Definitions
+(a) "Agreement" or "Gemma Terms of Use" means these terms and conditions that govern the use, reproduction, Distribution or modification of the Gemma Services and any terms and conditions incorporated by reference.
+(b) "Distribution" or "Distribute" means any transmission, publication, or other sharing of Gemma or Model Derivatives to a third party, including by providing or making Gemma or its functionality available as a hosted service via API, web access, or any other electronic or remote means ("Hosted Service").
+(c) "Gemma" means the set of machine learning language models, trained model weights and parameters identified at ai.google.dev/gemma, regardless of the source that you obtained it from.
+(d) "Google" means Google LLC.
+(e) "Model Derivatives" means all (i) modifications to Gemma, (ii) works based on Gemma, or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Gemma, to that model in order to cause that model to perform similarly to Gemma, including distillation methods that use intermediate data representations or methods based on the generation of synthetic data Outputs by Gemma for training that model. For clarity, Outputs are not deemed Model Derivatives.
+(f) "Output" means the information content output of Gemma or a Model Derivative that results from operating or otherwise using Gemma or the Model Derivative, including via a Hosted Service.
+1.2
+As used in this Agreement, "including" means "including without limitation".
+Section 2: ELIGIBILITY AND USAGE
+2.1 Eligibility
+You represent and warrant that you have the legal capacity to enter into this Agreement (including being of sufficient age of consent). If you are accessing or using any of the Gemma Services for or on behalf of a legal entity, (a) you are entering into this Agreement on behalf of yourself and that legal entity, (b) you represent and warrant that you have the authority to act on behalf of and bind that entity to this Agreement and (c) references to "you" or "your" in the remainder of this Agreement refers to both you (as an individual) and that entity.
+2.2 Use
+You may use, reproduce, modify, Distribute, perform or display any of the Gemma Services only in accordance with the terms of this Agreement, and must not violate (or encourage or permit anyone else to violate) any term of this Agreement.
+Section 3: DISTRIBUTION AND RESTRICTIONS
+3.1 Distribution and Redistribution
+You may reproduce or Distribute copies of Gemma or Model Derivatives if you meet all of the following conditions:
+You must include the use restrictions referenced in Section 3.2 as an enforceable provision in any agreement (e.g., license agreement, terms of use, etc.) governing the use and/or distribution of Gemma or Model Derivatives and you must provide notice to subsequent users you Distribute to that Gemma or Model Derivatives are subject to the use restrictions in Section 3.2.
+You must provide all third party recipients of Gemma or Model Derivatives a copy of this Agreement.
+You must cause any modified files to carry prominent notices stating that you modified the files.
+All Distributions (other than through a Hosted Service) must be accompanied by a "Notice" text file that contains the following notice: "Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms".
+You may add your own intellectual property statement to your modifications and, except as set forth in this Section, may provide additional or different terms and conditions for use, reproduction, or Distribution of your modifications, or for any such Model Derivatives as a whole, provided your use, reproduction, modification, Distribution, performance, and display of Gemma otherwise complies with the terms and conditions of this Agreement. Any additional or different terms and conditions you impose must not conflict with the terms of this Agreement.
+3.2 Use Restrictions
+You must not use any of the Gemma Services:
+for the restricted uses set forth in the Gemma Prohibited Use Policy at ai.google.dev/gemma/prohibited_use_policy ("Prohibited Use Policy"), which is hereby incorporated by reference into this Agreement; or
+in violation of applicable laws and regulations.
+To the maximum extent permitted by law, Google reserves the right to restrict (remotely or otherwise) usage of any of the Gemma Services that Google reasonably believes are in violation of this Agreement.
+3.3 Generated Output
+Google claims no rights in Outputs you generate using Gemma. You and your users are solely responsible for Outputs and their subsequent uses.
+Section 4: ADDITIONAL PROVISIONS
+4.1 Updates
+Google may update Gemma from time to time.
+4.2 Trademarks
+Nothing in this Agreement grants you any rights to use Google's trademarks, trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between you and Google. Google reserves any rights not expressly granted herein.
+4.3 DISCLAIMER OF WARRANTY
+UNLESS REQUIRED BY APPLICABLE LAW, THE GEMMA SERVICES, AND OUTPUTS, ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING, REPRODUCING, MODIFYING, PERFORMING, DISPLAYING OR DISTRIBUTING ANY OF THE GEMMA SERVICES OR OUTPUTS AND ASSUME ANY AND ALL RISKS ASSOCIATED WITH YOUR USE OR DISTRIBUTION OF ANY OF THE GEMMA SERVICES OR OUTPUTS AND YOUR EXERCISE OF RIGHTS AND PERMISSIONS UNDER THIS AGREEMENT.
+4.4 LIMITATION OF LIABILITY
+TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), PRODUCT LIABILITY, CONTRACT, OR OTHERWISE, UNLESS REQUIRED BY APPLICABLE LAW, SHALL GOOGLE OR ITS AFFILIATES BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, CONSEQUENTIAL, OR PUNITIVE DAMAGES, OR LOST PROFITS OF ANY KIND ARISING FROM THIS AGREEMENT OR RELATED TO, ANY OF THE GEMMA SERVICES OR OUTPUTS EVEN IF GOOGLE OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+4.5 Term, Termination, and Survival
+The term of this Agreement will commence upon your acceptance of this Agreement (including acceptance by your use, modification, or Distribution, reproduction, performance or display of any portion or element of the Gemma Services) and will continue in full force and effect until terminated in accordance with the terms of this Agreement. Google may terminate this Agreement if you are in breach of any term of this Agreement. Upon termination of this Agreement, you must delete and cease use and Distribution of all copies of Gemma and Model Derivatives in your possession or control. Sections 1, 2.1, 3.3, 4.2 to 4.9 shall survive the termination of this Agreement.
+4.6 Governing Law and Jurisdiction
+This Agreement will be governed by the laws of the State of California without regard to choice of law principles. The UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The state and federal courts of Santa Clara County, California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
+4.7 Severability
+If any provision of this Agreement is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.
+4.8 Entire Agreement
+This Agreement states all the terms agreed between the parties and supersedes all other agreements between the parties as of the date of acceptance relating to its subject matter.
+4.9 No Waiver
+Google will not be treated as having waived any rights by not exercising (or delaying the exercise of) any rights under this Agreement.

Ovis/docs/license/LLAMA3_LICENSE ADDED Viewed

	@@ -0,0 +1,84 @@

+META LLAMA 3 COMMUNITY LICENSE AGREEMENT
+Meta Llama 3 Version Release Date: April 18, 2024
+“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.
+“Documentation” means the specifications, manuals and documentation accompanying Meta Llama 3 distributed by Meta at https://llama.meta.com/get-started/.
+“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+“Meta Llama 3” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at https://llama.meta.com/llama-downloads.
+“Llama Materials” means, collectively, Meta’s proprietary Meta Llama 3 and Documentation (and any portion thereof) made available under this Agreement.
+“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement.
+1. License Rights and Redistribution.
+	a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.
+	b. Redistribution and Use.
+		i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Meta Llama 3” on a related website, user interface, blogpost, about page, or product documentation. If you use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model name.
+		ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you.
+		iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.”
+		iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://llama.meta.com/llama3/use-policy), which is hereby incorporated by reference into this Agreement.
+		v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Meta Llama 3 or derivative works thereof).
+2. Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+5. Intellectual Property.
+	a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will comply with Meta’s brand guidelines (currently accessible at https://about.meta.com/brand/resources/meta/company-brand/  ). All goodwill arising out of your use of the Mark will inure to the benefit of Meta.
+	b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
+	c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
+6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement.
+7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
+Meta Llama 3 Acceptable Use Policy
+Meta is committed to promoting safe and fair use of its tools and features, including Meta Llama 3. If you access or use Meta Llama 3, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at https://llama.meta.com/llama3/use-policy
+Prohibited Uses
+We want everyone to use Meta Llama 3 safely and responsibly. You agree you will not use, or allow others to use, Meta Llama 3 to:
+1. Violate the law or others’ rights, including to:
+	a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
+      		i. Violence or terrorism
+      		ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
+      		iii. Human trafficking, exploitation, and sexual violence
+      		iv. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
+      		v. Sexual solicitation
+      		vi. Any other criminal activity
+   	b. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
+   	c. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
+   	d. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
+   	e. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
+   	f. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama Materials
+   	g. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
+2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Meta Llama 3 related to the following:
+   	a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
+   	b. Guns and illegal weapons (including weapon development)
+   	c. Illegal drugs and regulated/controlled substances
+   	d. Operation of critical infrastructure, transportation technologies, or heavy machinery
+   	e. Self-harm or harm to others, including suicide, cutting, and eating disorders
+   	f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
+3. Intentionally deceive or mislead others, including use of Meta Llama 3 related to the following:
+   	a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
+   	b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
+   	c. Generating, promoting, or further distributing spam
+   	d. Impersonating another individual without consent, authorization, or legal right
+   	e. Representing that the use of Meta Llama 3 or outputs are human-generated
+   	f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
+   	g. Fail to appropriately disclose to end users any known dangers of your AI system
+Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:
+   	* Reporting issues with the model: https://github.com/meta-llama/llama3
+   	* Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
+   	* Reporting bugs and security concerns: facebook.com/whitehat/info
+   	* Reporting violations of the Acceptable Use Policy or unlicensed uses of Meta Llama 3: LlamaUseReport@meta.com

Ovis/ovis/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (206 Bytes). View file

Ovis/ovis/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (248 Bytes). View file

Ovis/ovis/model/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (388 Bytes). View file

Ovis/ovis/model/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (446 Bytes). View file

Ovis/ovis/model/__pycache__/configuration_ovis.cpython-311.pyc ADDED Viewed

Binary file (2.53 kB). View file

Ovis/ovis/model/__pycache__/modeling_ovis.cpython-311.pyc ADDED Viewed

Binary file (29.4 kB). View file

Ovis/ovis/model/configuration_ovis.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from typing import Union, Optional
+from transformers import PretrainedConfig, AutoConfig
+class OvisConfig(PretrainedConfig):
+    model_type = "ovis"
+    def __init__(
+        self,
+        llm_config: Optional[Union[PretrainedConfig, dict]] = None,
+        visual_tokenizer_config: Optional[Union[PretrainedConfig, dict]] = None,
+        multimodal_max_length=8192,
+        hidden_size=None,
+        conversation_formatter_class=None,
+        llm_attn_implementation=None,
+        disable_tie_weight=False,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        if llm_config is not None:
+            assert isinstance(llm_config, (PretrainedConfig, dict)), \
+                f"expect `llm_config` to be instance of PretrainedConfig or dict, but got {type(llm_config)} type"
+            if not isinstance(llm_config, PretrainedConfig):
+                model_type = llm_config['model_type']
+                llm_config.pop('model_type')
+                llm_config = AutoConfig.for_model(model_type, **llm_config)
+        self.llm_config = llm_config
+        if visual_tokenizer_config is not None:
+            assert isinstance(visual_tokenizer_config, (PretrainedConfig, dict)), \
+                f"expect `visual_tokenizer_config` to be instance of PretrainedConfig or dict, but got {type(visual_tokenizer_config)} type"
+            if not isinstance(visual_tokenizer_config, PretrainedConfig):
+                model_type = visual_tokenizer_config['model_type']
+                visual_tokenizer_config.pop('model_type')
+                visual_tokenizer_config = AutoConfig.for_model(model_type, **visual_tokenizer_config)
+        self.visual_tokenizer_config = visual_tokenizer_config
+        self.multimodal_max_length = multimodal_max_length
+        self.hidden_size = hidden_size
+        self.conversation_formatter_class = conversation_formatter_class
+        self.llm_attn_implementation = llm_attn_implementation
+        self.disable_tie_weight = disable_tie_weight

Ovis/ovis/model/conversation_formatter.py ADDED Viewed

	@@ -0,0 +1,233 @@

+from abc import ABC, abstractmethod
+from typing import List, Dict
+from ovis.util.constants import IMAGE_TOKEN_ID, IGNORE_ID, IMAGE_TOKEN
+class ConversationFormatter(ABC):
+    support_tokenizer_types = None
+    def __init__(self, tokenizer):
+        tokenizer_type = type(tokenizer).__name__
+        assert tokenizer_type in self.support_tokenizer_types, \
+            f'Invalid tokenizer type, expected one from `{self.support_tokenizer_types}`, but got `{tokenizer_type}`'
+        self.tokenizer = tokenizer
+        self.image_token = IMAGE_TOKEN
+        self.image_token_id = IMAGE_TOKEN_ID
+        self.ignore_id = IGNORE_ID
+    def _tokenize_with_image_symbol(self, text):
+        text_chunks = [self.tokenizer(chunk, add_special_tokens=False).input_ids for chunk in
+                       text.split(self.image_token)]
+        token_ids = []
+        num_chuck = len(text_chunks)
+        for i, chunk in enumerate(text_chunks):
+            token_ids.extend(chunk)
+            if i < num_chuck - 1:
+                token_ids.append(self.image_token_id)
+        return token_ids
+    @abstractmethod
+    def format(self, conversations: List[Dict], generation_preface=None):
+        pass
+    @abstractmethod
+    def format_query(self, query, generation_preface=""):
+        pass
+class QwenConversationFormatter(ConversationFormatter):
+    support_tokenizer_types = ['QWenTokenizer', 'Qwen2TokenizerFast']
+    def __init__(self, tokenizer):
+        super().__init__(tokenizer)
+        self.from2role = {
+            "system": "<|im_start|>system\n",
+            "human": "<|im_start|>user\n",
+            "gpt": "<|im_start|>assistant\n",
+        }
+        self.gpt_token_num = None
+        self.im_end = "<|im_end|>\n"
+        self.default_system_prompt = "You are a helpful assistant."
+    def format(self, conversations: List[Dict], generation_preface=None):
+        if self.gpt_token_num is None:
+            self.gpt_token_num = len(self.tokenizer(self.from2role["gpt"], add_special_tokens=False).input_ids)
+        if conversations[0]["from"] != "system":
+            conversations.insert(0, {
+                "from": "system",
+                "value": self.default_system_prompt
+            })
+        if generation_preface is not None:
+            conversations.append({
+                "from": "gpt",
+                "value": generation_preface
+            })
+        prompt = ""
+        input_ids = []
+        labels = []
+        num_conversation = len(conversations)
+        for i, conversation in enumerate(conversations):
+            frm = conversation["from"]
+            role = self.from2role[frm]
+            message = conversation["value"]
+            text = role + message
+            if i < num_conversation - 1 or generation_preface is None:
+                text += self.im_end
+            prompt += text
+            token_ids = self._tokenize_with_image_symbol(text)
+            input_ids.extend(token_ids)
+            label_ids = [self.ignore_id] * len(token_ids)
+            if frm == "gpt" and generation_preface is None:
+                # learning `\n` following `im_end` is meaningless, so the last `\n` token is ignored in label
+                label_ids[self.gpt_token_num:-1] = token_ids[self.gpt_token_num:-1]
+            labels.extend(label_ids)
+        assert self._tokenize_with_image_symbol(prompt) == input_ids
+        assert len(input_ids) == len(labels)
+        return prompt, input_ids, labels
+    def format_query(self, query, generation_preface=""):
+        prompt, input_ids, _ = self.format([{
+            "from": "human",
+            "value": query
+        }], generation_preface=generation_preface)
+        return prompt, input_ids
+class Llama3ConversationFormatter(ConversationFormatter):
+    support_tokenizer_types = ['PreTrainedTokenizerFast']
+    def __init__(self, tokenizer):
+        super().__init__(tokenizer)
+        self.from2role = {
+            "system": "<|start_header_id|>system<|end_header_id|>\n\n",
+            "human": "<|start_header_id|>user<|end_header_id|>\n\n",
+            "gpt": "<|start_header_id|>assistant<|end_header_id|>\n\n",
+        }
+        self.gpt_token_num = None
+        self.im_end = "<|eot_id|>"
+        self.default_system_prompt = "You are a helpful and honest multimodal assistant."
+        self.bos_token = "<|begin_of_text|>"
+        self.bos_token_ids = None
+    def format(self, conversations: List[Dict], generation_preface=None):
+        if self.gpt_token_num is None:
+            self.gpt_token_num = len(self.tokenizer(self.from2role["gpt"], add_special_tokens=False).input_ids)
+        if self.bos_token_ids is None:
+            self.bos_token_ids = self.tokenizer(self.bos_token, add_special_tokens=False).input_ids
+        if conversations[0]["from"] != "system":
+            conversations.insert(0, {
+                "from": "system",
+                "value": self.default_system_prompt
+            })
+        if generation_preface is not None:
+            conversations.append({
+                "from": "gpt",
+                "value": generation_preface
+            })
+        prompt = "" + self.bos_token
+        input_ids = [] + self.bos_token_ids
+        labels = [] + [IGNORE_ID] * len(input_ids)
+        num_conversation = len(conversations)
+        for i, conversation in enumerate(conversations):
+            frm = conversation["from"]
+            role = self.from2role[frm]
+            message = conversation["value"].strip()
+            text = role + message
+            if i < num_conversation - 1 or generation_preface is None:
+                text += self.im_end
+            prompt += text
+            token_ids = self._tokenize_with_image_symbol(text)
+            input_ids.extend(token_ids)
+            label_ids = [self.ignore_id] * len(token_ids)
+            if frm == "gpt":
+                label_ids[self.gpt_token_num:] = token_ids[self.gpt_token_num:]
+            labels.extend(label_ids)
+        assert self._tokenize_with_image_symbol(prompt) == input_ids
+        assert len(input_ids) == len(labels)
+        return prompt, input_ids, labels
+    def format_query(self, query, generation_preface=""):
+        prompt, input_ids, _ = self.format([{
+            "from": "human",
+            "value": query
+        }], generation_preface=generation_preface)
+        return prompt, input_ids
+class GemmaConversationFormatter(ConversationFormatter):
+    support_tokenizer_types = ['GemmaTokenizer', 'GemmaTokenizerFast']
+    def __init__(self, tokenizer):
+        super().__init__(tokenizer)
+        # Gemma does not support system prompt
+        self.from2role = {
+            "human": "<start_of_turn>user\n",
+            "gpt": "<start_of_turn>model\n",
+        }
+        self.gpt_token_num = None
+        self.im_end = "<end_of_turn>\n"
+        self.bos_token = "<bos>"
+        self.bos_token_ids = None
+    def format(self, conversations: List[Dict], generation_preface=None):
+        if self.gpt_token_num is None:
+            self.gpt_token_num = len(self.tokenizer(self.from2role["gpt"], add_special_tokens=False).input_ids)
+        if self.bos_token_ids is None:
+            self.bos_token_ids = self.tokenizer(self.bos_token, add_special_tokens=False).input_ids
+        if conversations[0]["from"] == "system":
+            raise ValueError("Gemma does not support system prompt")
+        if generation_preface is not None:
+            conversations.append({
+                "from": "gpt",
+                "value": generation_preface
+            })
+        prompt = "" + self.bos_token
+        input_ids = [] + self.bos_token_ids
+        labels = [] + [IGNORE_ID] * len(input_ids)
+        num_conversation = len(conversations)
+        for i, conversation in enumerate(conversations):
+            frm = conversation["from"]
+            role = self.from2role[frm]
+            message = conversation["value"].strip()
+            text = role + message
+            if i < num_conversation - 1 or generation_preface is None:
+                text += self.im_end
+            prompt += text
+            token_ids = self._tokenize_with_image_symbol(text)
+            input_ids.extend(token_ids)
+            label_ids = [self.ignore_id] * len(token_ids)
+            if frm == "gpt":
+                # learning `\n` following `im_end` is meaningless, so the last `\n` token is ignored in label
+                label_ids[self.gpt_token_num:-1] = token_ids[self.gpt_token_num:-1]
+            labels.extend(label_ids)
+        assert self._tokenize_with_image_symbol(prompt) == input_ids
+        assert len(input_ids) == len(labels)
+        return prompt, input_ids, labels
+    def format_query(self, query, generation_preface=""):
+        prompt, input_ids, _ = self.format([{
+            "from": "human",
+            "value": query
+        }], generation_preface=generation_preface)
+        return prompt, input_ids

Ovis/ovis/model/visual_tokenizer/__pycache__/base_visual_tokenizer.cpython-310.pyc ADDED Viewed

Binary file (10.1 kB). View file

Ovis/ovis/model/visual_tokenizer/__pycache__/base_visual_tokenizer.cpython-311.pyc ADDED Viewed

Binary file (18.3 kB). View file

Ovis/ovis/model/visual_tokenizer/__pycache__/clip_visual_tokenizer.cpython-310.pyc ADDED Viewed

Binary file (2.03 kB). View file

Ovis/ovis/model/visual_tokenizer/__pycache__/clip_visual_tokenizer.cpython-311.pyc ADDED Viewed

Binary file (3.03 kB). View file

Ovis/ovis/model/visual_tokenizer/__pycache__/siglip_visual_tokenizer.cpython-310.pyc ADDED Viewed

Binary file (2.05 kB). View file

Ovis/ovis/model/visual_tokenizer/__pycache__/siglip_visual_tokenizer.cpython-311.pyc ADDED Viewed

Binary file (3.06 kB). View file

Ovis/ovis/serve/runner.py ADDED Viewed

	@@ -0,0 +1,105 @@

+from dataclasses import field, dataclass
+from typing import Optional, Union, List
+import torch
+from PIL import Image
+from ovis.model.modeling_ovis import Ovis
+from ovis.util.constants import IMAGE_TOKEN
+@dataclass
+class RunnerArguments:
+    model_path: str
+    max_new_tokens: int = field(default=512)
+    do_sample: bool = field(default=False)
+    top_p: Optional[float] = field(default=None)
+    top_k: Optional[int] = field(default=None)
+    temperature: Optional[float] = field(default=None)
+    max_partition: int = field(default=9)
+class OvisRunner:
+    def __init__(self, args: RunnerArguments):
+        self.model_path = args.model_path
+        self.dtype = torch.bfloat16
+        self.device = torch.cuda.current_device()
+        self.dtype = torch.bfloat16
+        self.model = Ovis.from_pretrained(self.model_path, torch_dtype=self.dtype, multimodal_max_length=8192)
+        self.model = self.model.eval().to(device=self.device)
+        self.eos_token_id = self.model.generation_config.eos_token_id
+        self.text_tokenizer = self.model.get_text_tokenizer()
+        self.pad_token_id = self.text_tokenizer.pad_token_id
+        self.visual_tokenizer = self.model.get_visual_tokenizer()
+        self.conversation_formatter = self.model.get_conversation_formatter()
+        self.image_placeholder = IMAGE_TOKEN
+        self.max_partition = args.max_partition
+        self.gen_kwargs = dict(
+            max_new_tokens=args.max_new_tokens,
+            do_sample=args.do_sample,
+            top_p=args.top_p,
+            top_k=args.top_k,
+            temperature=args.temperature,
+            repetition_penalty=None,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            use_cache=True
+        )
+    def preprocess(self, inputs: List[Union[Image.Image, str]]):
+        # for single image and single text inputs, ensure image ahead
+        if len(inputs) == 2 and isinstance(inputs[0], str) and isinstance(inputs[1], Image.Image):
+            inputs = reversed(inputs)
+        # build query
+        query = ''
+        images = []
+        for data in inputs:
+            if isinstance(data, Image.Image):
+                query += self.image_placeholder + '\n'
+                images.append(data)
+            elif isinstance(data, str):
+                query += data.replace(self.image_placeholder, '')
+            elif data is not None:
+                raise RuntimeError(f'Invalid input type, expected `PIL.Image.Image` or `str`, but got {type(data)}')
+        # format conversation
+        prompt, input_ids, pixel_values = self.model.preprocess_inputs(
+            query, images, max_partition=self.max_partition)
+        attention_mask = torch.ne(input_ids, self.text_tokenizer.pad_token_id)
+        input_ids = input_ids.unsqueeze(0).to(device=self.device)
+        attention_mask = attention_mask.unsqueeze(0).to(device=self.device)
+        if pixel_values is not None:
+            pixel_values = [pixel_values.to(device=self.device, dtype=self.dtype)]
+        else:
+            pixel_values = [None]
+        return prompt, input_ids, attention_mask, pixel_values
+    def run(self, inputs: List[Union[Image.Image, str]]):
+        prompt, input_ids, attention_mask, pixel_values = self.preprocess(inputs)
+        output_ids = self.model.generate(
+            input_ids,
+            pixel_values=pixel_values,
+            attention_mask=attention_mask,
+            **self.gen_kwargs
+        )
+        output = self.text_tokenizer.decode(output_ids[0], skip_special_tokens=True)
+        input_token_len = input_ids.shape[1]
+        output_token_len = output_ids.shape[1]
+        response = dict(
+            prompt=prompt,
+            output=output,
+            prompt_tokens=input_token_len,
+            total_tokens=input_token_len + output_token_len
+        )
+        return response
+if __name__ == '__main__':
+    runner_args = RunnerArguments(model_path='<model_path>')
+    runner = OvisRunner(runner_args)
+    image = Image.open('<image_path>')
+    text = '<prompt>'
+    response = runner.run([image, text])
+    print(response['output'])

Ovis/ovis/serve/server.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import argparse
+import os.path
+import gradio as gr
+from gradio.components import Textbox, Image
+from ovis.serve.runner import RunnerArguments, OvisRunner
+class Server:
+    def __init__(self, runner: OvisRunner):
+        self.runner = runner
+    def __call__(self, image, text):
+        response = self.runner.run([image, text])
+        output = response["output"]
+        return output
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Ovis Server')
+    parser.add_argument('--model_path', type=str, required=True)
+    parser.add_argument('--flagging_dir', type=str, default=os.path.expanduser('~/ovis-flagged'))
+    parser.add_argument('--max_partition', type=int, default=9)
+    parser.add_argument('--port', type=int, required=True)
+    args = parser.parse_args()
+    os.makedirs(args.flagging_dir, exist_ok=True)
+    runner_args = RunnerArguments(
+        model_path=args.model_path,
+        max_partition=args.max_partition
+    )
+    demo = gr.Interface(
+        fn=Server(OvisRunner(runner_args)),
+        inputs=[Image(type='pil', label='image'),
+                Textbox(placeholder='Enter your text here...', label='prompt')],
+        outputs=gr.Markdown(),
+        title=args.model_path.split('/')[-1],
+        flagging_dir=args.flagging_dir
+    )
+    demo.launch(server_port=args.port)

Ovis/ovis/train/__init__.py ADDED Viewed

File without changes

Ovis/ovis/train/arguments.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from dataclasses import dataclass, field
+from typing import Optional
+import transformers
+@dataclass
+class ModelArguments:
+    llm_name_or_path: Optional[str] = field(default=None)
+    visual_tokenizer_type: str = field(default=None)
+    visual_vocab_size: int = field(default=8192)
+    visual_drop_cls_token: bool = field(default=False)
+    visual_tokenize_function: str = field(default='softmax')
+    visual_tau: float = field(default=1.0)
+    visual_depths: Optional[str] = field(default=None)
+    visual_hidden_stride: int = field(default=1)
+    multimodal_max_length: int = field(default=2048)
+    conversation_formatter_class: str = field(default=None)
+    pad_token_id: Optional[int] = field(default=None)
+    llm_attn_implementation: Optional[str] = field(default=None)
+    disable_tie_weight: bool = field(default=False)
+@dataclass
+class TrainingArguments(transformers.TrainingArguments):
+    dataset_names: Optional[str] = field(default=None)  # a|b|c
+    dataset_info: Optional[str] = field(default='dataset_info_v1_6')
+    ovis_pretrained_path: Optional[str] = field(default=None)
+    visual_tokenizer_pretrained_path: Optional[str] = field(default=None)
+    caption_template: Optional[str] = field(default=None)
+    stage: Optional[int] = field(default=None)
+    train_modules: Optional[str] = field(default=None)
+    cache_dir: Optional[str] = field(default=None)
+    optim: str = field(default="adamw_torch")
+    visual_max_tau: float = field(default=5.0)
+    visual_min_tau: float = field(default=0.05)
+    save_safetensors: bool = field(default=True)
+    monitor_step: int = field(default=100)
+    vte_re_init: bool = field(default=False)
+    text_max_length: int = field(default=1024)
+    max_partitions: str = field(default="9|1|1")
+    def __post_init__(self):
+        if self.gradient_checkpointing:
+            self.gradient_checkpointing_kwargs = {"use_reentrant": False}
+        if self.stage < 3:
+            self.save_safetensors = False
+        super().__post_init__()

Ovis/ovis/train/callback.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import deepspeed
+import torch
+from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl
+from ovis.util.constants import END_LINE, BEGIN_LINE
+from ovis.util.utils import rank0_print
+class TuneTauCallback(TrainerCallback):
+    def on_step_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        visual_tokenizer = kwargs['model'].get_visual_tokenizer()
+        current_step = state.global_step
+        max_step = state.max_steps
+        ratio = current_step / max_step
+        visual_tokenizer.config.tau = args.visual_max_tau - (args.visual_max_tau - args.visual_min_tau) * ratio
+class MonitorCallback(TrainerCallback):
+    def _monitoring(self, model, step):
+        with torch.no_grad():
+            with deepspeed.zero.GatheredParameters(model.get_monitor_tensors().values()):
+                for k, v in model.get_monitor_tensors().items():
+                    rank0_print(BEGIN_LINE)
+                    rank0_print(f'{k} @ step {step} with sum: {v.sum().item()} and content: ')
+                    rank0_print(v)
+                    rank0_print(END_LINE)
+    def on_step_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        model = kwargs['model']
+        step = state.global_step
+        if step % args.monitor_step == 0 or step == 10:  # monitor at step 10 for fast check
+            self._monitoring(model, step)
+    def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        model = kwargs['model']
+        step = state.global_step
+        self._monitoring(model, step)

Ovis/ovis/train/train.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import json
+import os
+import pathlib
+import deepspeed
+import torch
+import transformers
+from deepspeed import get_accelerator
+from torch.utils.data import ConcatDataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel, AutoConfig
+from transformers import Trainer
+from transformers.integrations.deepspeed import unset_hf_deepspeed_config, set_hf_deepspeed_config
+from callback import TuneTauCallback, MonitorCallback
+from ovis.model.configuration_ovis import OvisConfig
+from ovis.model.modeling_ovis import Ovis
+from ovis.train.arguments import ModelArguments, TrainingArguments
+from ovis.train.dataset.caption_dataset import CaptionDataset
+from ovis.train.dataset.conversation_dataset import ConversationDataset
+from ovis.train.dataset.multimodal_dataset import DataCollatorForMultimodalDataset
+from ovis.util.constants import BEGIN_LINE, END_LINE
+from ovis.util.utils import smart_unit, rank0_print
+def train():
+    # parse args
+    parser = transformers.HfArgumentParser(
+        (ModelArguments, TrainingArguments))
+    model_args, training_args = parser.parse_args_into_dataclasses()
+    # save args to checkpoint dir
+    with training_args.main_process_first(local=False):
+        if training_args.process_index == 0:
+            def args2dict(args):
+                return {k: str(v) for k, v in args.__dict__.items()}
+            args_log = json.dumps(dict(
+                model_args=args2dict(model_args),
+                training_args=args2dict(training_args)
+            ), ensure_ascii=False, indent=2)
+            print(args_log)
+            os.makedirs(training_args.output_dir, exist_ok=True)
+            with open(os.path.join(training_args.output_dir, 'model_training_args.json'), 'w',
+                      encoding='utf-8') as f:
+                f.write(args_log + '\n')
+    # construct or load ovis model
+    if not training_args.ovis_pretrained_path:  # construct model (S1)
+        # 1. construct ovis config
+        ovis_config = OvisConfig(
+            multimodal_max_length=model_args.multimodal_max_length,
+            conversation_formatter_class=model_args.conversation_formatter_class,
+            llm_attn_implementation=model_args.llm_attn_implementation
+        )
+        # 2. load pretrained llm and text tokenizer
+        attn_kwargs = dict()
+        if model_args.llm_attn_implementation:
+            attn_kwargs['attn_implementation'] = model_args.llm_attn_implementation
+        llm = AutoModelForCausalLM.from_pretrained(model_args.llm_name_or_path, **attn_kwargs)
+        text_tokenizer = AutoTokenizer.from_pretrained(model_args.llm_name_or_path)
+        if text_tokenizer.pad_token_id is None and model_args.pad_token_id is not None:
+            text_tokenizer.pad_token_id = model_args.pad_token_id
+        # 3. construct visual tokenizer
+        # deepspeed zero.Init with bfloat16 fail for visual_tokenizer, so temporarily disable zero.Init here
+        unset_hf_deepspeed_config()
+        if training_args.visual_tokenizer_pretrained_path is not None:
+            visual_tokenizer = AutoModel.from_pretrained(
+                training_args.visual_tokenizer_pretrained_path,
+                image_processor_name_or_path=training_args.visual_tokenizer_pretrained_path
+            )
+        else:
+            visual_tokenizer_config = AutoConfig.for_model(
+                model_type=model_args.visual_tokenizer_type + "_visual_tokenizer",
+                vocab_size=model_args.visual_vocab_size,
+                tokenize_function=model_args.visual_tokenize_function,
+                tau=model_args.visual_tau,
+                depths=model_args.visual_depths,
+                drop_cls_token=model_args.visual_drop_cls_token,
+                hidden_stride=model_args.visual_hidden_stride,
+            )
+            visual_tokenizer = AutoModel.from_config(visual_tokenizer_config, train_from_scratch=True)
+        visual_tokenizer = visual_tokenizer.to(
+            device=torch.device(get_accelerator().device_name(os.getenv("LOCAL_RANK"))))
+        if getattr(training_args, 'hf_deepspeed_config', None) is not None:
+            set_hf_deepspeed_config(training_args.hf_deepspeed_config)
+        # 4. construct ovis model
+        model = Ovis(ovis_config, llm=llm, text_tokenizer=text_tokenizer, visual_tokenizer=visual_tokenizer,
+                     train_from_scratch=True)
+    else:  # load pretrained ovis model
+        model, loading_info = Ovis.from_pretrained(training_args.ovis_pretrained_path,
+                                                   multimodal_max_length=model_args.multimodal_max_length,
+                                                   output_loading_info=True)
+        rank0_print(BEGIN_LINE)
+        rank0_print(f'Loading info of Ovis:\n{loading_info}')
+        rank0_print(END_LINE)
+        training_args.vte_re_init = False
+    model.get_llm().config.use_cache = False
+    model.config.use_cache = False
+    text_tokenizer = model.get_text_tokenizer()
+    rank0_print(BEGIN_LINE)
+    rank0_print(f'model.config:\n{model.config}')
+    rank0_print(END_LINE)
+    # maybe re-init vte
+    if training_args.vte_re_init:
+        with deepspeed.zero.GatheredParameters([model.get_wte().weight]):
+            mean = model.get_wte().weight.mean().item()
+            std = model.get_wte().weight.std().item()
+        rank0_print(f'Statistics of embedding table of LLM: {mean=}, {std=}')
+        model.re_init_vte(mean, std)
+    # select train modules
+    model.requires_grad_(False)
+    for module in training_args.train_modules.split('|'):
+        if module == 'all':
+            model.requires_grad_(True)
+        elif module == 'llm':
+            model.get_llm().requires_grad_(True)
+        elif module == 'visual_tokenizer':
+            model.get_visual_tokenizer().requires_grad_(True)
+        elif module == 'visual_tokenizer.backbone':
+            model.get_visual_tokenizer().get_backbone().requires_grad_(True)
+        elif module.startswith('visual_tokenizer.backbone.layer.'):
+            layer_index = int(module[len('visual_tokenizer.backbone.layer.'):])
+            layer = model.get_visual_tokenizer().get_backbone_layer(layer_index)
+            layer.requires_grad_(True)
+        elif module == 'visual_tokenizer.head':
+            model.get_visual_tokenizer().get_head().requires_grad_(True)
+        elif module == 'vte':
+            model.get_vte().requires_grad_(True)
+        else:
+            raise ValueError(f'Invalid train module name: {module}')
+    rank0_print(BEGIN_LINE)
+    rank0_print('Parameters to train:')
+    for name, param in model.named_parameters():
+        if param.requires_grad:
+            rank0_print(name)
+    rank0_print(f'LLM\'s attn implementation: {model.get_llm().config._attn_implementation}')
+    rank0_print(END_LINE)
+    # construct data module
+    datasets = []
+    dataset_info_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                                     f'dataset/{training_args.dataset_info}.json')
+    with open(dataset_info_path, 'r', encoding='utf-8') as f:
+        dataset_info = json.load(f)
+    for name in training_args.dataset_names.split('|'):
+        info = dataset_info[name]
+        data_format = info['data_format']
+        if data_format == 'caption':
+            dataset = CaptionDataset(name, info, model, training_args)
+        elif data_format == 'conversation':
+            dataset = ConversationDataset(name, info, model, training_args)
+        else:
+            raise ValueError(f'Invalid data format `{data_format}` for dataset `{name}`')
+        datasets.append(dataset)
+    data_module = dict(
+        train_dataset=ConcatDataset(datasets),
+        data_collator=DataCollatorForMultimodalDataset(text_tokenizer)
+    )
+    # train
+    train_callbacks = [MonitorCallback]
+    if model_args.visual_tokenize_function == 'gumbel_argmax':
+        train_callbacks.append(TuneTauCallback)
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        callbacks=train_callbacks,
+        **data_module
+    )
+    rank0_print(BEGIN_LINE)
+    rank0_print('Dataset sample tensor:')
+    rank0_print(data_module['train_dataset'][0])
+    rank0_print(END_LINE)
+    rank0_print(BEGIN_LINE)
+    rank0_print('Dataset sample input_ids decoding:')
+    rank0_print(text_tokenizer.decode([x for x in data_module['train_dataset'][0]['input_ids'] if x >= 0]))
+    rank0_print(END_LINE)
+    rank0_print(BEGIN_LINE)
+    rank0_print('Dataset sample labels decoding:')
+    rank0_print(text_tokenizer.decode([x for x in data_module['train_dataset'][0]['labels'] if x >= 0]))
+    rank0_print(END_LINE)
+    rank0_print(BEGIN_LINE)
+    rank0_print(f'#param of model: {smart_unit(model.num_parameters())}')
+    rank0_print(f'#param of llm: {smart_unit(model.get_llm().num_parameters())}')
+    rank0_print(f'#param of visual_tokenizer: {smart_unit(model.get_visual_tokenizer().num_parameters())}')
+    rank0_print(f'#param of vte: {smart_unit(model.get_vte().weight.numel())}')
+    rank0_print(END_LINE)
+    if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
+        trainer.train(resume_from_checkpoint=True)
+    else:
+        trainer.train()
+    trainer.save_state()
+    # save model
+    model.get_llm().config.use_cache = True
+    model.config.use_cache = True
+    trainer.save_model()
+if __name__ == '__main__':
+    train()

Ovis/ovis/util/constants.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Model Constants
+IGNORE_ID = -100
+IMAGE_TOKEN_ID = -200
+IMAGE_TOKEN = "<image>"
+IMAGE_ATOM_ID = -300
+IMAGE_INDICATOR_IDS = [-301, -302, -303, -304, -305]
+# Log & Print
+BEGIN_LINE = '========================************========================'
+END_LINE = '------------------------------------------------------------'

Ovis/ovis/util/utils.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import os
+from importlib import import_module
+def rank0_print(*args):
+    if int(os.getenv("LOCAL_PROCESS_RANK", os.getenv("LOCAL_RANK", 0))) == 0:
+        print(*args)
+def smart_unit(num):
+    if num / 1.0e9 >= 1:
+        return f'{num / 1.0e9:.2f}B'
+    else:
+        return f'{num / 1.0e6:.2f}M'
+def import_class_from_string(full_class_string):
+    # Split the path to get separate module and class names
+    module_path, _, class_name = full_class_string.rpartition('.')
+    # Import the module using the module path
+    module = import_module(module_path)
+    # Get the class from the imported module
+    cls = getattr(module, class_name)
+    return cls

llm2vec/docs/.gitignore ADDED Viewed

	@@ -0,0 +1,9 @@

+# Jekyll
+_site
+.sass-cache
+.jekyll-metadata
+Gemfile.lock
+vendor/*
+.bundle
+*.vscode

llm2vec/docs/Gemfile ADDED Viewed

	@@ -0,0 +1,18 @@

+source "https://rubygems.org"
+gem "webrick"
+gem "github-pages", group: :jekyll_plugins
+gem "tzinfo-data"
+gem "wdm", "~> 0.1.0" if Gem.win_platform?
+# If you have any plugins, put them here!
+group :jekyll_plugins do
+  gem "jekyll-paginate"
+  gem "jekyll-sitemap"
+  gem "jekyll-gist"
+  gem "jekyll-feed"
+  gem "jemoji"
+  gem "jekyll-include-cache"
+  gem "jekyll-algolia"
+end

llm2vec/docs/README.md ADDED Viewed

	@@ -0,0 +1,104 @@

+# Project Page Template
+This is a project page template for McGill-NLP projects.
+![Demo of project page](images/demo.jpg)
+## Getting started
+You can follow one of the following ways to get started.
+### Copy from template
+If you have not yet created a repo for your project, you can copy the template from the following link. Simply click on the "Use this template" button, or [click here](https://github.com/McGill-NLP/project-page-template/generate).
+### Cloning with git
+If you have already created a repo for your project, you can clone the template and copy the files to your project. Let's assume you are in the root of your project directory, e.g. `my-project`.
+```bash
+cd ..  # Go to parent directory
+git clone https://github.com/McGill-NLP/project-page-template
+cp -r project-page-template/docs my-project/
+cp project-page-template/README.md my-project/docs/
+cd my-project/
+```
+## Activate GitHub Pages
+Once the template is copied to your project, you need to activate GitHub Pages for your project.
+1. Click on the "Settings" button in the top right corner of the page.
+2. Click on the "Pages" tab.
+3. In "Source", select branch to be "main" and folder to "/docs".
+4. Click on "Save"
+5. Go to the "Actions" tab (on the right of "Pull requests" tab) and wait for the action to finish.
+6. Visit your project page at mcgill-nlp.github.io/<your-project-name>
+### Why `docs/`?
+You might be wondering why all the files for the webpage is in `/docs`. Well the page is not really "docs" per se, it's because `github-pages` only allows us to either use the root folder or `/docs`. So we are forced to use the latter in order to clearly separate this page from the rest of the project. Maybe in the future GitHub will allow other names like `/page`, but before then there's nothing we can do...
+> Well technically, you *can* push everything to a separate branch, but the original author of this repo is in the school of thought that branches are meant for alternate or historical versions of the `main` branch, or serve as a way to create pull requests. The original author will also vehemently defend this philosophy if one tries to argue otherwise :)
+But what if you actually want to write some docs for your library? You can just edit `docs/_pages/docs.md`, which is the real page for documentations.
+## Navigation bar
+You can already find links to different pages in the navigation bar. To add, remove, or modify links, you can edit [`docs/_data/navigation.yml`](docs/_data/navigation.yml) file. The `title` corresponds to the text that appear on the navbar, and the `url` corresponds to the relative URL of the page. It is not recommended to include an external URL, as that should be in `/home` page.
+## Modifying a page
+The files are located in [`docs/_pages/`](docs/_pages/). For example, if you want to modify the `/home` page, you would edit `docs/_pages/home.md`.
+All of the pages are markdown files with something called a [front matter](https://jekyllrb.com/docs/front-matter/) at the top, which uses the YAML syntax. Generally, all you need to worry about is the title and the permalink (the latter is the relative URL of the page). In the case of `/home`, you also need to specify external links (`header.actions`) and author names (`excerpt`). However, a template is already provided for you, you only need to modify the content.
+## Adding and removing pages
+To add a page:
+1. Create a new file in `docs/_pages/`, with the desired `permalink` to be your relative URL. So for example, `/contact/` links to `mcgill-nlp.github.io/my-project/contact`.
+2. In `docs/_data/navigation.yml`, add a new entry with the `title` and `url` from the previous step.
+To remove a page:
+1. Delete the file in `docs/_pages/`.
+2. In `docs/_data/navigation.yml`, remove the entry with the same `url` as the deleted file.
+## Documentations and API for your project
+Note that there's a tab that says `docs`, and you can see that it links to other pages. So this is a standalone doc page inside your webpage. Note also that, due to Github pages' caveat, we were forced to put the webpage in `/docs`, but the actual docs are in `/docs/_docs`. Now that's cleared up, you can head to [`/docs/_docs/README.md`](/docs/_docs/README.md) to read the instructions.
+> Do you feel writing documentation is too complicated or time-consuming, and you'd like something more straightforward? Check out the [template for using MkDocs](https://github.com/McGill-NLP/mkdocs-template) instead. However, the simplicity comes at the cost of more repositories and different frameworks to maintain.
+## Advanced
+For any advanced modification, it is recommended to look in the advanced section of the readme of the [group website](https://github.com/McGill-NLP/mcgill-nlp.github.io). Below are a extra tips included for convenience.
+### Setup
+Please refer to setup instructions in the readme of the [group website](https://github.com/McGill-NLP/mcgill-nlp.github.io).
+### Running locally
+```bash
+cd docs/
+bundle exec jekyll serve
+```
+### Removing dark mode
+To remove dark mode, inside [`docs/_config.yml`](docs/_config.yml) file, remove `dark_theme_css`. The dark mode should automatically turn off.
+### Updating footer
+Inside [`docs/_config.yml`](docs/_config.yml) file, you can modify the footer.
+### Modify `excerpt` in a splash page (`/home`)
+If you want to modify the excerpt in the `/home` page, you can do so in [`docs/_sass_/splash.scss`](docs/_sass_/splash.scss). Note that `splash.scss` was added specifically for this template, not for the group website.
+### Modify or remove icons in splash page buttons
+This is handled in [`docs/_includes/page__hero.html`](docs/_includes/page__hero.html). That file was added specifically for this template, not for the group website. You can modify that file to add, modify or remove icons.

llm2vec/docs/_config.yml ADDED Viewed

	@@ -0,0 +1,110 @@

+# Welcome to Jekyll!
+#
+# This config file is meant for settings that affect your whole blog, values
+# which you are expected to set up once and rarely edit after that. If you find
+# yourself editing this file very often, consider using Jekyll's data files
+# feature for the data you need to update frequently.
+#
+# For technical reasons, this file is *NOT* reloaded automatically when you use
+# 'bundle exec jekyll serve'. If you change this file, please restart the server process.
+# Site settings
+# These are used to personalize your new site. If you look in the HTML files,
+# you will see them accessed via {{ site.title }}, {{ site.email }}, and so on.
+# You can create any custom variable you would like, and they will be accessible
+# in the templates via {{ site.myvariable }}.
+title: McGill NLP
+email:
+description: >- # this means to ignore newlines until "baseurl:"
+  McGill NLP is a research group within McGill University and Mila focusing on various topics of natural language processing.
+twitter_username: McGill_NLP
+github_username: McGill-NLP
+logo: "/assets/images/logo/logo.png"
+dark_theme_css: "/assets/css/main-dark.css"
+future: true
+# Build settings
+markdown: kramdown
+remote_theme: mmistakes/minimal-mistakes@4.24.0
+# Outputting
+permalink: /:categories/:title/
+timezone: America/Montreal
+include:
+  - _pages
+  - _docs
+# Exclude from processing.
+# The following items will not be processed, by default. Create a custom list
+# to override the default setting.
+# exclude:
+#   - Gemfile
+#   - Gemfile.lock
+#   - node_modules
+#   - vendor/bundle/
+#   - vendor/cache/
+#   - vendor/gems/
+#   - vendor/ruby/
+# Plugins (previously gems:)
+plugins:
+  - jekyll-sitemap
+  - jekyll-gist
+  - jemoji
+  - jekyll-include-cache
+author:
+  name   : "McGill NLP Member(s)"
+  avatar : "/assets/images/bio/default.jpg"
+  bio    : "Current or former lab member(s) worked on this."
+  links:
+    - label: "Website"
+      icon: "fas fa-fw fa-link"
+      url: "https://mcgill-nlp.github.io"
+    - label: "GitHub"
+      icon: "fab fa-fw fa-github"
+      url: "https://github.com/McGill-NLP"
+    - label: "Twitter"
+      icon: "fab fa-fw fa-twitter-square"
+      url: "https://twitter.com/McGill_NLP"
+analytics:
+  provider: "google-gtag"
+  google:
+    tracking_id: "G-MEDG9XN4VP"
+    anonymize_ip: false # default
+atom_feed:
+  hide: true
+footer:
+  links:
+    - label: "GitHub"
+      icon: "fab fa-fw fa-github"
+      url: "https://github.com/McGill-NLP"
+    - label: "Twitter"
+      icon: "fab fa-fw fa-twitter-square"
+      url: "https://twitter.com/McGill_NLP"
+defaults:
+  # /docs/_pages
+  - scope:
+      path: "_pages"
+      type: pages
+    values:
+      layout: single
+      classes:
+        - no-sidebar
+        - wide
+      author_profile: false
+  # /docs/_docs
+  - scope:
+      path: "_docs"
+      type: pages
+    values:
+      layout: single
+      sidebar:
+        title: "Doc Pages"
+        nav: sidebar-docs  # See /docs/_data/navigation.yml
+      toc: true
+      toc_label: "Table of Contents"

llm2vec/docs/_data/navigation.yml ADDED Viewed

	@@ -0,0 +1,17 @@

+main:
+  - title: "Home"
+    url: /
+  - title: "Leaderboard"
+    url: /leaderboard/
+  - title: "Docs"
+    url: /docs/
+  - title: "Contact"
+    url: /contact/
+sidebar-docs:  # See "include" in /_config.yml and /docs/_docs
+  - title: "Home"
+    url: /docs/
+  - title: "API"
+    url: /docs/api
+  - title: "Training"
+    url: /docs/training

llm2vec/docs/_includes/head/custom.html ADDED Viewed

	@@ -0,0 +1,48 @@

+<!-- Add favicon -->
+<link rel="icon" type="image/png" href="{{ site.baseurl }}/assets/images/logo/favicon.png">
+{% if site.dark_theme_css %}
+<!-- Dark Mode -->
+<link rel="stylesheet" href="{{ '/assets/css/main.css' | relative_url }}" id="theme-css">
+<link rel="stylesheet alternate" href="{{ site.dark_theme_css | relative_url }}" id="theme-css-dark">
+<script type="text/javascript">
+    const updateNodesRel = theme => {
+        const node_light = document.getElementById('theme-css');
+        const node_dark = document.getElementById('theme-css-dark');
+        if (theme === "dark") {
+            node_light.setAttribute('rel', 'stylesheet alternate');
+            node_dark.setAttribute('rel', 'stylesheet');
+        }
+        else if (theme === "light") {
+            node_light.setAttribute('rel', 'stylesheet');
+            node_dark.setAttribute('rel', 'stylesheet alternate');
+        }
+    }
+    const changeTheme = () => {
+        let theme = sessionStorage.getItem('theme');
+        // Change the theme to the other option
+        if (theme === "light") {
+            theme = "dark";
+        } else {
+            theme = "light";
+        }
+        // Update the stored session and the nodes' rel attribute
+        sessionStorage.setItem('theme', theme);
+        updateNodesRel(theme);
+        return false;
+    }
+    if (sessionStorage.getItem('theme') === null) {
+        sessionStorage.setItem('theme', "light");
+    }
+    const theme = sessionStorage.getItem('theme');
+    updateNodesRel(theme);
+</script>
+{% endif %}

llm2vec/docs/_sass/custom/header-footer.scss ADDED Viewed

	@@ -0,0 +1,19 @@

+a.site-title {
+    @media (min-width: 601px) {
+        font-size: xx-large;
+    }
+    @media (max-width: 600px) {
+        font-size: large;
+    }
+    color: $primary-color;
+    &:hover {
+        color: mix($background-color, $primary-color, 25%);
+    }
+}
+.theme-toggle {
+    @media (min-width: 601px) {
+        margin: 0px;
+    }
+}

llm2vec/docs/_sass/custom/no-sidebar.scss ADDED Viewed

	@@ -0,0 +1,9 @@

+.no-sidebar article.page {
+    float: left;
+    width: 100%;
+}
+.no-sidebar .archive {
+    float: left;
+    width: 100%;
+}

llm2vec/docs/_sass/custom/splash.scss ADDED Viewed

	@@ -0,0 +1,5 @@

+// This contains the styles for the "excerpt" on a splash page
+div.wrapper > p.page__lead {
+    font-size: x-large;
+    max-width: 100%;
+}

llm2vec/docs/_sass/skins/dark.scss ADDED Viewed

	@@ -0,0 +1,30 @@

+/* ==========================================================================
+   Dark skin
+   Imported in /assets/css/main-light.scss
+   ========================================================================== */
+/* Colors */
+$background-color: #000000 !default;
+$text-color: #eaeaea !default;
+$primary-color: #ED1B2F !default;
+$border-color: mix(#fff, $background-color, 20%) !default;
+$code-background-color: mix(#000, $background-color, 15%) !default;
+$code-background-color-dark: mix(#000, $background-color, 20%) !default;
+$form-background-color: mix(#000, $background-color, 15%) !default;
+$footer-background-color: mix($text-color, $background-color, 5%) !default;
+$link-color: mix($primary-color, $text-color, 100%) !default;
+$link-color-hover: mix($background-color, $link-color, 15%) !default;
+$link-color-visited: mix(#000, $link-color, 0%) !default;
+$masthead-link-color: $text-color !default;
+$masthead-link-color-hover: mix(#000, $text-color, 20%) !default;
+.author__urls.social-icons i,
+.author__urls.social-icons .svg-inline--fa,
+.page__footer-follow .social-icons i,
+.page__footer-follow .social-icons .svg-inline--fa  {
+  color: inherit;
+}
+.ais-search-box .ais-search-box--input {
+  background-color: $form-background-color;
+}

llm2vec/docs/_sass/skins/light.scss ADDED Viewed

	@@ -0,0 +1,12 @@

+/*
+   Imported in /assets/css/main-light.scss
+*/
+$background-color: #fff !default;
+$text-color: #000 !default;
+$primary-color: #ED1B2F !default;
+// $footer-background-color: mix($primary-color, $background-color, 100%) !default;
+$link-color: #ED1B2F !default;
+$link-color-hover: mix(#fff, $link-color, 25%) !default;
+$link-color-visited: mix(#000, $link-color, 10%) !default;
+$masthead-link-color: $text-color !default;
+$masthead-link-color-hover: mix($background-color, $text-color, 25%) !default;

llm2vec/docs/assets/images/logo/favicon.png ADDED Viewed

llm2vec/docs/assets/images/logo/logo.png ADDED Viewed

llm2vec/docs/assets/images/logo/logo.svg ADDED Viewed

llm2vec/examples/classification.py ADDED Viewed

	@@ -0,0 +1,62 @@

+from sklearn.metrics import accuracy_score, f1_score
+from sklearn.linear_model import LogisticRegression
+import datasets
+import numpy as np
+import torch
+from llm2vec import LLM2Vec
+dataset = "mteb/amazon_counterfactual"
+instruction = "Classify a given Amazon customer review text as either counterfactual or notcounterfactual: "
+dataset = datasets.load_dataset(dataset, "en")
+sentences_train, y_train = dataset["train"]["text"], dataset["train"]["label"]
+sentences_test, y_test = dataset["test"]["text"], dataset["test"]["label"]
+max_iter = 100
+batch_size = 8
+scores = {}
+clf = LogisticRegression(
+    random_state=42,
+    n_jobs=1,
+    max_iter=max_iter,
+    verbose=0,
+)
+print("Loading model...")
+model = LLM2Vec.from_pretrained(
+    "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
+    peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
+    device_map="cuda" if torch.cuda.is_available() else "cpu",
+    torch_dtype=torch.bfloat16,
+)
+def append_instruction(instruction, sentences):
+    new_sentences = []
+    for s in sentences:
+        new_sentences.append([instruction, s, 0])
+    return new_sentences
+print(f"Encoding {len(sentences_train)} training sentences...")
+sentences_train = append_instruction(instruction, sentences_train)
+X_train = np.asarray(model.encode(sentences_train, batch_size=batch_size))
+print(f"Encoding {len(sentences_test)} test sentences...")
+sentences_test = append_instruction(instruction, sentences_test)
+X_test = np.asarray(model.encode(sentences_test, batch_size=batch_size))
+print("Fitting logistic regression classifier...")
+clf.fit(X_train, y_train)
+print("Evaluating...")
+y_pred = clf.predict(X_test)
+accuracy = accuracy_score(y_test, y_pred)
+scores["accuracy"] = accuracy
+f1 = f1_score(y_test, y_pred, average="macro")
+scores["f1"] = f1
+print(scores)
+# {'accuracy': 0.891044776119403, 'f1': 0.8283106625713033}

llm2vec/examples/clustering.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import sklearn
+import sklearn.cluster
+import datasets
+import tqdm
+import numpy as np
+import torch
+from llm2vec import LLM2Vec
+dataset = "mteb/twentynewsgroups-clustering"
+instruction = "Identify the topic or theme of the given news articles: "
+dataset = datasets.load_dataset(dataset)
+batch_size = 32
+print("Loading model...")
+model = LLM2Vec.from_pretrained(
+    "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
+    peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
+    device_map="cuda" if torch.cuda.is_available() else "cpu",
+    torch_dtype=torch.bfloat16,
+)
+def append_instruction(instruction, sentences):
+    new_sentences = []
+    for s in sentences:
+        new_sentences.append([instruction, s, 0])
+    return new_sentences
+v_measures = []
+for cluster_set in tqdm.tqdm(dataset["test"], desc="Clustering"):
+    sentences = cluster_set["sentences"]
+    labels = cluster_set["labels"]
+    clustering_batch_size = 500
+    print(f"Encoding {len(sentences)} sentences...")
+    new_sentences = append_instruction(instruction, sentences)
+    corpus_embeddings = np.asarray(model.encode(new_sentences, batch_size=batch_size))
+    print("Fitting Mini-Batch K-Means model...")
+    clustering_model = sklearn.cluster.MiniBatchKMeans(
+        n_clusters=len(set(labels)), batch_size=clustering_batch_size
+    )
+    clustering_model.fit(corpus_embeddings)
+    cluster_assignment = clustering_model.labels_
+    print("Evaluating...")
+    v_measure = sklearn.metrics.cluster.v_measure_score(labels, cluster_assignment)
+    v_measures.append(v_measure)
+v_mean = np.mean(v_measures)
+v_std = np.std(v_measures)
+print(v_mean)
+# 0.5137461051538426

llm2vec/examples/retrieval.py ADDED Viewed

	@@ -0,0 +1,177 @@

+import datasets
+import torch
+from llm2vec import LLM2Vec
+from beir import util
+from beir.datasets.data_loader import GenericDataLoader as BeirDataLoader
+import os
+from typing import Dict, List
+from beir.retrieval.evaluation import EvaluateRetrieval
+dataset = "arguana"
+instruction = "Given a claim, find documents that refute the claim: "
+print("Loading dataset...")
+url = (
+    f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
+)
+download_path = os.path.join(datasets.config.HF_DATASETS_CACHE, "BeIR")
+data_path = util.download_and_unzip(url, download_path)
+corpus, queries, relevant_docs = BeirDataLoader(data_folder=data_path).load(
+    split="test"
+)
+batch_size = 8
+print("Loading model...")
+model = LLM2Vec.from_pretrained(
+    "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
+    peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
+    device_map="cuda" if torch.cuda.is_available() else "cpu",
+    torch_dtype=torch.bfloat16,
+)
+def append_instruction(instruction, sentences):
+    new_sentences = []
+    for s in sentences:
+        new_sentences.append([instruction, s, 0])
+    return new_sentences
+def cos_sim(a: torch.Tensor, b: torch.Tensor):
+    if not isinstance(a, torch.Tensor):
+        a = torch.tensor(a)
+    if not isinstance(b, torch.Tensor):
+        b = torch.tensor(b)
+    if len(a.shape) == 1:
+        a = a.unsqueeze(0)
+    if len(b.shape) == 1:
+        b = b.unsqueeze(0)
+    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
+    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
+    return torch.mm(a_norm, b_norm.transpose(0, 1))
+def encode_queries(queries: List[str], batch_size: int, **kwargs):
+    new_sentences = append_instruction(instruction, queries)
+    kwargs["show_progress_bar"] = False
+    return model.encode(new_sentences, batch_size=batch_size, **kwargs)
+def encode_corpus(corpus: List[Dict[str, str]], batch_size: int, **kwargs):
+    if type(corpus) is dict:
+        sentences = [
+            (
+                (corpus["title"][i] + " " + corpus["text"][i]).strip()
+                if "title" in corpus
+                else corpus["text"][i].strip()
+            )
+            for i in range(len(corpus["text"]))
+        ]
+    else:
+        sentences = [
+            (
+                (doc["title"] + " " + doc["text"]).strip()
+                if "title" in doc
+                else doc["text"].strip()
+            )
+            for doc in corpus
+        ]
+    new_sentences = append_instruction("", sentences)
+    return model.encode(new_sentences, batch_size=batch_size, **kwargs)
+print("Encoding Queries...")
+query_ids = list(queries.keys())
+results = {qid: {} for qid in query_ids}
+queries = [queries[qid] for qid in queries]
+query_embeddings = encode_queries(
+    queries, batch_size=batch_size, show_progress_bar=True, convert_to_tensor=True
+)
+print("Sorting Corpus by document length (Longest first)...")
+corpus_ids = sorted(
+    corpus,
+    key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")),
+    reverse=True,
+)
+corpus = [corpus[cid] for cid in corpus_ids]
+print("Encoding Corpus ... Warning: This might take a while!")
+corpus_embeddings = encode_corpus(
+    corpus, batch_size=batch_size, show_progress_bar=True, convert_to_tensor=True
+)
+print("Scoring Function: {} ({})".format("Cosine Similarity", "cos_sim"))
+cos_scores = cos_sim(query_embeddings, corpus_embeddings)
+cos_scores[torch.isnan(cos_scores)] = -1
+# Get top-k values
+top_k = 1000
+cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(
+    cos_scores, min(top_k + 1, len(cos_scores[0])), dim=1, largest=True, sorted=False
+)
+cos_scores_top_k_values = cos_scores_top_k_values.cpu().tolist()
+cos_scores_top_k_idx = cos_scores_top_k_idx.cpu().tolist()
+for query_itr in range(len(query_embeddings)):
+    query_id = query_ids[query_itr]
+    for sub_corpus_id, score in zip(
+        cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]
+    ):
+        corpus_id = corpus_ids[sub_corpus_id]
+        if corpus_id != query_id:
+            results[query_id][corpus_id] = score
+retriever = EvaluateRetrieval(model, score_function="cos_sim")
+ndcg, _map, recall, precision = retriever.evaluate(
+    relevant_docs, results, retriever.k_values
+)
+mrr = retriever.evaluate_custom(relevant_docs, results, retriever.k_values, "mrr")
+scores = {
+    **{f"ndcg_at_{k.split('@')[1]}": v for (k, v) in ndcg.items()},
+    **{f"map_at_{k.split('@')[1]}": v for (k, v) in _map.items()},
+    **{f"recall_at_{k.split('@')[1]}": v for (k, v) in recall.items()},
+    **{f"precision_at_{k.split('@')[1]}": v for (k, v) in precision.items()},
+    **{f"mrr_at_{k.split('@')[1]}": v for (k, v) in mrr.items()},
+}
+print(scores)
+"""
+{
+    'ndcg_at_1': 0.32788,
+    'ndcg_at_3': 0.47534,
+    'ndcg_at_5': 0.52296,
+    'ndcg_at_10': 0.57505,
+    'ndcg_at_100': 0.6076,
+    'ndcg_at_1000': 0.60801,
+	'map_at_1': 0.32788,
+	'map_at_3': 0.43883,
+	'map_at_5': 0.46518,
+	'map_at_10': 0.48675,
+	'map_at_100': 0.49506,
+	'map_at_1000': 0.49509,
+	'recall_at_1': 0.32788,
+	'recall_at_3': 0.58108,
+	'recall_at_5': 0.69701,
+	'recall_at_10': 0.85775,
+	'recall_at_100': 0.9936,
+	'recall_at_1000': 0.99644,
+	'precision_at_1': 0.32788,
+	'precision_at_3': 0.19369,
+	'precision_at_5': 0.1394,
+	'precision_at_10': 0.08578,
+	'precision_at_100': 0.00994,
+	'precision_at_1000': 0.001,
+	'mrr_at_1': 0.33357,
+	'mrr_at_3': 0.44085,
+	'mrr_at_5': 0.46745,
+	'mrr_at_10': 0.4888,
+	'mrr_at_100': 0.49718,
+	'mrr_at_1000': 0.49721}
+"""

llm2vec/examples/sts.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import datasets
+import numpy as np
+from sklearn.metrics.pairwise import paired_cosine_distances
+from scipy.stats import spearmanr
+import torch
+from llm2vec import LLM2Vec
+dataset = "mteb/sts17-crosslingual-sts"
+instruction = "Retrieve semantically similar text: "
+dataset = datasets.load_dataset(dataset, "en-en")
+min_score, max_score = 0, 5
+normalize = lambda x: (x - min_score) / (max_score - min_score)
+normalized_scores = list(map(normalize, dataset["test"]["score"]))
+batch_size = 8
+sentences1, sentences2 = dataset["test"]["sentence1"], dataset["test"]["sentence2"]
+print("Loading model...")
+model = LLM2Vec.from_pretrained(
+    "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
+    peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
+    device_map="cuda" if torch.cuda.is_available() else "cpu",
+    torch_dtype=torch.bfloat16,
+)
+def append_instruction(instruction, sentences):
+    new_sentences = []
+    for s in sentences:
+        new_sentences.append([instruction, s, 0])
+    return new_sentences
+print(f"Encoding {len(sentences1)} sentences1...")
+sentences1 = append_instruction(instruction, sentences1)
+embeddings1 = np.asarray(model.encode(sentences1, batch_size=batch_size))
+print(f"Encoding {len(sentences2)} sentences2...")
+sentences2 = append_instruction(instruction, sentences2)
+embeddings2 = np.asarray(model.encode(sentences2, batch_size=batch_size))
+print("Evaluating...")
+cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))
+cosine_spearman, _ = spearmanr(normalized_scores, cosine_scores)
+results = {
+    "cos_sim": {
+        "spearman": cosine_spearman,
+    }
+}
+print(results)
+# {'cos_sim': {'spearman': 0.9021906216635642}}

llm2vec/experiments/mteb_eval.py ADDED Viewed

	@@ -0,0 +1,31 @@

+import argparse
+import mteb
+import json
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
+    )
+    parser.add_argument("--task_name", type=str, default="STS16")
+    parser.add_argument(
+        "--task_to_instructions_fp",
+        type=str,
+        default="test_configs/mteb/task_to_instructions.json",
+    )
+    parser.add_argument("--output_dir", type=str, default="results")
+    args = parser.parse_args()
+    model_kwargs = {}
+    if args.task_to_instructions_fp is not None:
+        with open(args.task_to_instructions_fp, "r") as f:
+            task_to_instructions = json.load(f)
+        model_kwargs["task_to_instructions"] = task_to_instructions
+    model = mteb.get_model(args.model_name, **model_kwargs)
+    tasks = mteb.get_tasks(tasks=[args.task_name])
+    evaluation = mteb.MTEB(tasks=tasks)
+    results = evaluation.run(model, output_folder=args.output_dir)

llm2vec/experiments/mteb_eval_custom.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import argparse
+from typing import Any
+import mteb
+import json
+import torch
+import numpy as np
+from mteb.models.instructions import task_to_instruction
+from mteb.models.text_formatting_utils import corpus_to_texts
+from llm2vec import LLM2Vec
+def llm2vec_instruction(instruction):
+    if len(instruction) > 0 and instruction[-1] != ":":
+        instruction = instruction.strip(".") + ":"
+    return instruction
+class LLM2VecWrapper:
+    def __init__(self, model=None, task_to_instructions=None):
+        self.task_to_instructions = task_to_instructions
+        self.model = model
+    def encode(
+        self,
+        sentences: list[str],
+        *,
+        prompt_name: str = None,
+        **kwargs: Any,  # noqa
+    ) -> np.ndarray:
+        if prompt_name is not None:
+            instruction = (
+                self.task_to_instructions[prompt_name]
+                if self.task_to_instructions
+                and prompt_name in self.task_to_instructions
+                else llm2vec_instruction(task_to_instruction(prompt_name))
+            )
+        else:
+            instruction = ""
+        sentences = [[instruction, sentence] for sentence in sentences]
+        return self.model.encode(sentences, **kwargs)
+    def encode_corpus(
+        self,
+        corpus: list[dict[str, str]] | dict[str, list[str]] | list[str],
+        prompt_name: str = None,
+        **kwargs: Any,
+    ) -> np.ndarray:
+        sentences = corpus_to_texts(corpus, sep=" ")
+        sentences = [["", sentence] for sentence in sentences]
+        if "request_qid" in kwargs:
+            kwargs.pop("request_qid")
+        return self.model.encode(sentences, **kwargs)
+    def encode_queries(self, queries: list[str], **kwargs: Any) -> np.ndarray:
+        return self.encode(queries, **kwargs)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--base_model_name_or_path",
+        type=str,
+        default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
+    )
+    parser.add_argument(
+        "--peft_model_name_or_path",
+        type=str,
+        default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
+    )
+    parser.add_argument("--task_name", type=str, default="STS16")
+    parser.add_argument(
+        "--task_to_instructions_fp",
+        type=str,
+        default="test_configs/mteb/task_to_instructions.json",
+    )
+    parser.add_argument("--output_dir", type=str, default="results")
+    args = parser.parse_args()
+    task_to_instructions = None
+    if args.task_to_instructions_fp is not None:
+        with open(args.task_to_instructions_fp, "r") as f:
+            task_to_instructions = json.load(f)
+    l2v_model = LLM2Vec.from_pretrained(
+        args.base_model_name_or_path,
+        peft_model_name_or_path=args.peft_model_name_or_path,
+        device_map="cuda" if torch.cuda.is_available() else "cpu",
+        torch_dtype=torch.bfloat16,
+    )
+    model = LLM2VecWrapper(model=l2v_model, task_to_instructions=task_to_instructions)
+    tasks = mteb.get_tasks(tasks=[args.task_name])
+    evaluation = mteb.MTEB(tasks=tasks)
+    results = evaluation.run(model, output_folder=args.output_dir)

llm2vec/experiments/run_mntp.py ADDED Viewed

	@@ -0,0 +1,997 @@

+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2020 The HuggingFace Team All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The script is adapted from https://github.com/huggingface/transformers/blob/51bcadc10a569847b93a30dbe3a077037ae63bad/examples/pytorch/language-modeling/run_mlm.py
+"""
+import logging
+import math
+import os
+import sys
+import warnings
+from dataclasses import dataclass, field
+from itertools import chain
+from typing import Optional, Any, Tuple, List
+import numpy as np
+import datasets
+import evaluate
+from datasets import load_dataset
+import torch
+import transformers
+from transformers import (
+    CONFIG_MAPPING,
+    MODEL_FOR_MASKED_LM_MAPPING,
+    AutoConfig,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    HfArgumentParser,
+    Trainer,
+    TrainingArguments,
+    TrainerCallback,
+    is_torch_tpu_available,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from transformers.utils import send_example_telemetry
+from transformers.utils.versions import require_version
+from peft import LoraConfig, get_peft_model
+from llm2vec.models import (
+    MistralBiForMNTP,
+    LlamaBiForMNTP,
+    GemmaBiForMNTP,
+    Qwen2BiForMNTP,
+)
+# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
+# check_min_version("4.38.0.dev0")
+require_version(
+    "datasets>=1.8.0",
+    "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt",
+)
+logger = logging.getLogger(__name__)
+MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+def get_model_class(config):
+    config_class_name = config.__class__.__name__
+    if config_class_name == "MistralConfig":
+        return MistralBiForMNTP
+    elif config_class_name == "LlamaConfig":
+        return LlamaBiForMNTP
+    elif config_class_name == "GemmaConfig":
+        return GemmaBiForMNTP
+    elif config_class_name == "Qwen2Config":
+        return Qwen2BiForMNTP
+    else:
+        raise ValueError(f"Model class {config_class_name} not supported.")
+def initialize_peft(
+    model,
+    lora_r: int = 8,
+    lora_alpha: int = 16,
+    lora_dropout: float = 0.05,
+    lora_modules: Optional[List[str]] = None,
+):
+    if lora_modules is None and model.config.__class__.__name__ in [
+        "LlamaConfig",
+        "MistralConfig",
+        "GemmaConfig",
+        "Qwen2Config",
+    ]:
+        lora_modules = [
+            "q_proj",
+            "v_proj",
+            "k_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ]
+    elif lora_modules is None:
+        raise ValueError("lora_modules must be specified for this model.")
+    config = LoraConfig(
+        r=lora_r,
+        lora_alpha=lora_alpha,
+        target_modules=lora_modules,
+        lora_dropout=lora_dropout,
+        bias="none",
+        task_type=None,
+    )
+    model = get_peft_model(model, config)
+    print(f"Model's Lora trainable parameters:")
+    model.print_trainable_parameters()
+    return model
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+    model_name_or_path: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
+            )
+        },
+    )
+    model_type: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "If training from scratch, pass a model type from the list: "
+            + ", ".join(MODEL_TYPES)
+        },
+    )
+    config_overrides: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Override some existing default config settings when a model is trained from scratch. Example: "
+                "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
+            )
+        },
+    )
+    config_name: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Pretrained config name or path if not the same as model_name"
+        },
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Pretrained tokenizer name or path if not the same as model_name"
+        },
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Where do you want to store the pretrained models downloaded from huggingface.co"
+        },
+    )
+    use_fast_tokenizer: bool = field(
+        default=True,
+        metadata={
+            "help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."
+        },
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={
+            "help": "The specific model version to use (can be a branch name, tag name or commit id)."
+        },
+    )
+    token: str = field(
+        default=None,
+        metadata={
+            "help": (
+                "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
+                "generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
+            )
+        },
+    )
+    use_auth_token: bool = field(
+        default=None,
+        metadata={
+            "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
+        },
+    )
+    trust_remote_code: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option "
+                "should only be set to `True` for repositories you trust and in which you have read the code, as it will "
+                "execute code present on the Hub on your local machine."
+            )
+        },
+    )
+    torch_dtype: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
+                "dtype will be automatically derived from the model's weights."
+            ),
+            "choices": ["auto", "bfloat16", "float16", "float32"],
+        },
+    )
+    attn_implementation: Optional[str] = field(
+        default="sdpa",
+        metadata={
+            "help": ("The attention implementation to use in the model."),
+            "choices": ["eager", "sdpa", "flash_attention_2"],
+        },
+    )
+    low_cpu_mem_usage: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. "
+                "set True will benefit LLM loading time and RAM consumption."
+            )
+        },
+    )
+    def __post_init__(self):
+        if self.config_overrides is not None and (
+            self.config_name is not None or self.model_name_or_path is not None
+        ):
+            raise ValueError(
+                "--config_overrides can't be used in combination with --config_name or --model_name_or_path"
+            )
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    dataset_name: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of the dataset to use (via the datasets library)."},
+    )
+    dataset_config_name: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "The configuration name of the dataset to use (via the datasets library)."
+        },
+    )
+    train_file: Optional[str] = field(
+        default=None, metadata={"help": "The input training data file (a text file)."}
+    )
+    validation_file: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=True,
+        metadata={"help": "Overwrite the cached training and evaluation sets"},
+    )
+    validation_split_percentage: Optional[int] = field(
+        default=5,
+        metadata={
+            "help": "The percentage of the train set used as validation set in case there's no validation split"
+        },
+    )
+    max_seq_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated."
+            )
+        },
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    mlm_probability: float = field(
+        default=0.15,
+        metadata={"help": "Ratio of tokens to mask for masked language modeling loss"},
+    )
+    line_by_line: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."
+        },
+    )
+    pad_to_max_length: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to pad all samples to `max_seq_length`. "
+                "If False, will pad the samples dynamically when batching to the maximum length in the batch."
+            )
+        },
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+    max_eval_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
+                "value if set."
+            )
+        },
+    )
+    streaming: bool = field(default=False, metadata={"help": "Enable streaming mode"})
+    def __post_init__(self):
+        if self.streaming:
+            require_version(
+                "datasets>=2.0.0", "The streaming feature requires `datasets>=2.0.0`"
+            )
+        if (
+            self.dataset_name is None
+            and self.train_file is None
+            and self.validation_file is None
+        ):
+            raise ValueError(
+                "Need either a dataset name or a training/validation file."
+            )
+        else:
+            if self.train_file is not None:
+                extension = self.train_file.split(".")[-1]
+                if extension not in ["csv", "json", "txt"]:
+                    raise ValueError(
+                        "`train_file` should be a csv, a json or a txt file."
+                    )
+            if self.validation_file is not None:
+                extension = self.validation_file.split(".")[-1]
+                if extension not in ["csv", "json", "txt"]:
+                    raise ValueError(
+                        "`validation_file` should be a csv, a json or a txt file."
+                    )
+# add more arguments
+@dataclass
+class CustomArguments:
+    """
+    Custom arguments for the script
+    """
+    lora_dropout: float = field(
+        default=0.05, metadata={"help": "The dropout rate for lora"}
+    )
+    lora_r: int = field(default=8, metadata={"help": "The r value for lora"})
+    mask_token_type: str = field(
+        default="blank",
+        metadata={"help": "The type of mask token. Options: blank, eos, mask"},
+    )
+    stop_after_n_steps: int = field(
+        default=10000, metadata={"help": "Stop training after n steps"}
+    )
+    data_collator_type: str = field(
+        default="default",
+        metadata={"help": "The type of data collator. Options: default, all_mask"},
+    )
+class DataCollatorForLanguageModelingWithFullMasking(DataCollatorForLanguageModeling):
+    def torch_mask_tokens(
+        self,
+        inputs: Any,
+        special_tokens_mask: Optional[Any] = None,
+    ) -> Tuple[Any, Any]:
+        """
+        Prepare masked tokens inputs/labels for masked language modeling: 100% MASK, 0% random, 0% original.
+        """
+        import torch
+        labels = inputs.clone()
+        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
+        probability_matrix = torch.full(labels.shape, self.mlm_probability)
+        if special_tokens_mask is None:
+            special_tokens_mask = [
+                self.tokenizer.get_special_tokens_mask(
+                    val, already_has_special_tokens=True
+                )
+                for val in labels.tolist()
+            ]
+            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
+        else:
+            special_tokens_mask = special_tokens_mask.bool()
+        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
+        masked_indices = torch.bernoulli(probability_matrix).bool()
+        labels[~masked_indices] = -100  # We only compute loss on masked tokens
+        # 100% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+        inputs[masked_indices] = self.tokenizer.convert_tokens_to_ids(
+            self.tokenizer.mask_token
+        )
+        return inputs, labels
+class StopTrainingCallback(TrainerCallback):
+    def __init__(self, stop_after_n_steps: int):
+        self.stop_after_n_steps = stop_after_n_steps
+    def on_step_end(self, args, state, control, **kwargs):
+        if state.global_step >= self.stop_after_n_steps:
+            control.should_training_stop = True
+class MNTPTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.label_names = ["labels"]
+    def _remove_unused_columns(
+        self, dataset: "datasets.Dataset", description: Optional[str] = None
+    ):
+        return dataset
+    # We need a custom save function as we have to save the inner model
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        # If we are executing this function, we are the process zero, so we don't check for that.
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+        # model organization is MODEL_TYPEBiForMNTP.model -> MODEL_TYPELBiModel, we have to save the inner model, handled by save_peft_model function of the outer model
+        self.model.save_peft_model(output_dir)
+        self.tokenizer.save_pretrained(output_dir)
+        # Good practice: save your training arguments together with the trained model
+        torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    parser = HfArgumentParser(
+        (ModelArguments, DataTrainingArguments, TrainingArguments, CustomArguments)
+    )
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, custom_args = parser.parse_json_file(
+            json_file=os.path.abspath(sys.argv[1])
+        )
+    else:
+        (
+            model_args,
+            data_args,
+            training_args,
+            custom_args,
+        ) = parser.parse_args_into_dataclasses()
+    if training_args.gradient_checkpointing:
+        training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
+    if model_args.use_auth_token is not None:
+        warnings.warn(
+            "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.",
+            FutureWarning,
+        )
+        if model_args.token is not None:
+            raise ValueError(
+                "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
+            )
+        model_args.token = model_args.use_auth_token
+    # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
+    # information sent is the one passed as arguments along with your Python/PyTorch versions.
+    send_example_telemetry("run_mlm", model_args, data_args)
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    if training_args.should_log:
+        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+        transformers.utils.logging.set_verbosity_info()
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, "
+        + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}"
+    )
+    # Set the verbosity to info of the Transformers logger (on main process only):
+    logger.info(f"Training/evaluation parameters {training_args}")
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if (
+        os.path.isdir(training_args.output_dir)
+        and training_args.do_train
+        and not training_args.overwrite_output_dir
+    ):
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif (
+            last_checkpoint is not None and training_args.resume_from_checkpoint is None
+        ):
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+    # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
+    # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
+    # (the dataset will be downloaded automatically from the datasets Hub
+    #
+    # For CSV/JSON files, this script will use the column called 'text' or the first column. You can easily tweak this
+    # behavior (see below)
+    #
+    # In distributed training, the load_dataset function guarantee that only one local process can concurrently
+    # download the dataset.
+    if data_args.dataset_name is not None:
+        # Downloading and loading a dataset from the hub.
+        raw_datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            token=model_args.token,
+            streaming=data_args.streaming,
+        )
+        if "validation" not in raw_datasets.keys():
+            raw_datasets["validation"] = load_dataset(
+                data_args.dataset_name,
+                data_args.dataset_config_name,
+                split=f"train[:{data_args.validation_split_percentage}%]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+                streaming=data_args.streaming,
+            )
+            raw_datasets["train"] = load_dataset(
+                data_args.dataset_name,
+                data_args.dataset_config_name,
+                split=f"train[{data_args.validation_split_percentage}%:]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+                streaming=data_args.streaming,
+            )
+    else:
+        data_files = {}
+        if data_args.train_file is not None:
+            data_files["train"] = data_args.train_file
+            extension = data_args.train_file.split(".")[-1]
+        if data_args.validation_file is not None:
+            data_files["validation"] = data_args.validation_file
+            extension = data_args.validation_file.split(".")[-1]
+        if extension == "txt":
+            extension = "text"
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            token=model_args.token,
+        )
+        # If no validation data is there, validation_split_percentage will be used to divide the dataset.
+        if "validation" not in raw_datasets.keys():
+            raw_datasets["validation"] = load_dataset(
+                extension,
+                data_files=data_files,
+                split=f"train[:{data_args.validation_split_percentage}%]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+            )
+            raw_datasets["train"] = load_dataset(
+                extension,
+                data_files=data_files,
+                split=f"train[{data_args.validation_split_percentage}%:]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+            )
+    # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
+    # https://huggingface.co/docs/datasets/loading_datasets.
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+    config_kwargs = {
+        "cache_dir": model_args.cache_dir,
+        "revision": model_args.model_revision,
+        "token": model_args.token,
+        "trust_remote_code": model_args.trust_remote_code,
+    }
+    if model_args.config_name:
+        config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
+    elif model_args.model_name_or_path:
+        config = AutoConfig.from_pretrained(
+            model_args.model_name_or_path, **config_kwargs
+        )
+    else:
+        config = CONFIG_MAPPING[model_args.model_type]()
+        logger.warning("You are instantiating a new config instance from scratch.")
+        if model_args.config_overrides is not None:
+            logger.info(f"Overriding config: {model_args.config_overrides}")
+            config.update_from_string(model_args.config_overrides)
+            logger.info(f"New config: {config}")
+    tokenizer_kwargs = {
+        "cache_dir": model_args.cache_dir,
+        "use_fast": model_args.use_fast_tokenizer,
+        "revision": model_args.model_revision,
+        "token": model_args.token,
+        "trust_remote_code": model_args.trust_remote_code,
+    }
+    if model_args.tokenizer_name:
+        tokenizer = AutoTokenizer.from_pretrained(
+            model_args.tokenizer_name, **tokenizer_kwargs
+        )
+    elif model_args.model_name_or_path:
+        tokenizer = AutoTokenizer.from_pretrained(
+            model_args.model_name_or_path, **tokenizer_kwargs
+        )
+    else:
+        raise ValueError(
+            "You are instantiating a new tokenizer from scratch. This is not supported by this script. "
+            "You can do it from another script, save it, and load it from here, using --tokenizer_name."
+        )
+    # blank, eos, mask
+    if tokenizer.mask_token is None:
+        if custom_args.mask_token_type == "blank":
+            tokenizer.mask_token = "_"
+        elif custom_args.mask_token_type == "eos":
+            tokenizer.mask_token = tokenizer.eos_token
+        elif custom_args.mask_token_type == "mask":
+            tokenizer.add_tokens(["<mask>"])
+            tokenizer.mask_token = "<mask>"
+        else:
+            raise ValueError(
+                f"mask_token_type {custom_args.mask_token_type} is not supported."
+            )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Loading bidirectional model using LLM2Vec package
+    model_class = get_model_class(config)
+    torch_dtype = (
+        model_args.torch_dtype
+        if model_args.torch_dtype in ["auto", None]
+        else getattr(torch, model_args.torch_dtype)
+    )
+    model = model_class.from_pretrained(
+        model_args.model_name_or_path,
+        from_tf=bool(".ckpt" in model_args.model_name_or_path),
+        config=config,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        token=model_args.token,
+        trust_remote_code=model_args.trust_remote_code,
+        torch_dtype=torch_dtype,
+        low_cpu_mem_usage=model_args.low_cpu_mem_usage,
+        attn_implementation=model_args.attn_implementation,
+    )
+    # model organization is MODEL_TYPEBiForMNTP.model -> MODEL_TYPELBiModel, we have to apply PEFT to the inner model
+    model.model = initialize_peft(
+        model.model,
+        lora_r=custom_args.lora_r,
+        lora_alpha=2 * custom_args.lora_r,
+        lora_dropout=custom_args.lora_dropout,
+    )
+    # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch
+    # on a small vocab and want a smaller embedding size, remove this test.
+    embedding_size = model.get_input_embeddings().weight.shape[0]
+    if len(tokenizer) > embedding_size:
+        model.resize_token_embeddings(len(tokenizer))
+    # Preprocessing the datasets.
+    # First we tokenize all the texts.
+    if training_args.do_train:
+        column_names = list(raw_datasets["train"].features)
+    else:
+        column_names = list(raw_datasets["validation"].features)
+    text_column_name = "text" if "text" in column_names else column_names[0]
+    if data_args.max_seq_length is None:
+        max_seq_length = tokenizer.model_max_length
+        if max_seq_length > 1024:
+            logger.warning(
+                "The chosen tokenizer supports a `model_max_length` that is longer than the default `block_size` value"
+                " of 1024. If you would like to use a longer `block_size` up to `tokenizer.model_max_length` you can"
+                " override this default with `--block_size xxx`."
+            )
+            max_seq_length = 1024
+    else:
+        if data_args.max_seq_length > tokenizer.model_max_length:
+            logger.warning(
+                f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the "
+                f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}."
+            )
+        max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
+    if data_args.line_by_line:
+        # When using line_by_line, we just tokenize each nonempty line.
+        padding = "max_length" if data_args.pad_to_max_length else False
+        def tokenize_function(examples):
+            # Remove empty lines
+            examples[text_column_name] = [
+                line
+                for line in examples[text_column_name]
+                if len(line) > 0 and not line.isspace()
+            ]
+            return tokenizer(
+                examples[text_column_name],
+                padding=padding,
+                truncation=True,
+                max_length=max_seq_length,
+                # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
+                # receives the `special_tokens_mask`.
+                return_special_tokens_mask=True,
+            )
+        with training_args.main_process_first(desc="dataset map tokenization"):
+            if not data_args.streaming:
+                tokenized_datasets = raw_datasets.map(
+                    tokenize_function,
+                    batched=True,
+                    num_proc=data_args.preprocessing_num_workers,
+                    remove_columns=[text_column_name],
+                    load_from_cache_file=not data_args.overwrite_cache,
+                    desc="Running tokenizer on dataset line_by_line",
+                )
+            else:
+                tokenized_datasets = raw_datasets.map(
+                    tokenize_function,
+                    batched=True,
+                    remove_columns=[text_column_name],
+                )
+    else:
+        # Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts.
+        # We use `return_special_tokens_mask=True` because DataCollatorForLanguageModeling (see below) is more
+        # efficient when it receives the `special_tokens_mask`.
+        def tokenize_function(examples):
+            return tokenizer(
+                examples[text_column_name], return_special_tokens_mask=True
+            )
+        with training_args.main_process_first(desc="dataset map tokenization"):
+            if not data_args.streaming:
+                tokenized_datasets = raw_datasets.map(
+                    tokenize_function,
+                    batched=True,
+                    num_proc=data_args.preprocessing_num_workers,
+                    remove_columns=column_names,
+                    load_from_cache_file=not data_args.overwrite_cache,
+                    desc="Running tokenizer on every text in dataset",
+                )
+            else:
+                tokenized_datasets = raw_datasets.map(
+                    tokenize_function,
+                    batched=True,
+                    remove_columns=column_names,
+                )
+        # Main data processing function that will concatenate all texts from our dataset and generate chunks of
+        # max_seq_length.
+        def group_texts(examples):
+            # Concatenate all texts.
+            concatenated_examples = {
+                k: list(chain(*examples[k])) for k in examples.keys()
+            }
+            total_length = len(concatenated_examples[list(examples.keys())[0]])
+            # We drop the small remainder, and if the total_length < max_seq_length  we exclude this batch and return an empty dict.
+            # We could add padding if the model supported it instead of this drop, you can customize this part to your needs.
+            total_length = (total_length // max_seq_length) * max_seq_length
+            # Split by chunks of max_len.
+            result = {
+                k: [
+                    t[i : i + max_seq_length]
+                    for i in range(0, total_length, max_seq_length)
+                ]
+                for k, t in concatenated_examples.items()
+            }
+            return result
+        # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
+        # remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value
+        # might be slower to preprocess.
+        #
+        # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
+        # https://huggingface.co/docs/datasets/process#map
+        with training_args.main_process_first(desc="grouping texts together"):
+            if not data_args.streaming:
+                tokenized_datasets = tokenized_datasets.map(
+                    group_texts,
+                    batched=True,
+                    num_proc=data_args.preprocessing_num_workers,
+                    load_from_cache_file=not data_args.overwrite_cache,
+                    desc=f"Grouping texts in chunks of {max_seq_length}",
+                )
+            else:
+                tokenized_datasets = tokenized_datasets.map(
+                    group_texts,
+                    batched=True,
+                )
+    if training_args.do_train:
+        if "train" not in tokenized_datasets:
+            raise ValueError("--do_train requires a train dataset")
+        train_dataset = tokenized_datasets["train"]
+        if data_args.max_train_samples is not None:
+            max_train_samples = min(len(train_dataset), data_args.max_train_samples)
+            train_dataset = train_dataset.select(range(max_train_samples))
+    if training_args.do_eval:
+        if "validation" not in tokenized_datasets:
+            raise ValueError("--do_eval requires a validation dataset")
+        eval_dataset = tokenized_datasets["validation"]
+        if data_args.max_eval_samples is not None:
+            max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples)
+            eval_dataset = eval_dataset.select(range(max_eval_samples))
+        def preprocess_logits_for_metrics(logits, labels):
+            if isinstance(logits, tuple):
+                # Depending on the model and config, logits may contain extra tensors,
+                # like past_key_values, but logits always come first
+                logits = logits[0]
+            return logits.argmax(dim=-1)
+        metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir)
+        def compute_metrics(eval_preds):
+            preds, labels = eval_preds
+            preds = preds[:, :-1]
+            labels = labels[:, 1:]
+            # preds have the same shape as the labels, after the argmax(-1) has been calculated
+            # by preprocess_logits_for_metrics
+            labels = labels.reshape(-1)
+            preds = preds.reshape(-1)
+            mask = labels != -100
+            labels = labels[mask]
+            preds = preds[mask]
+            return metric.compute(predictions=preds, references=labels)
+    # Data collator
+    # This one will take care of randomly masking the tokens.
+    pad_to_multiple_of_8 = (
+        data_args.line_by_line
+        and training_args.fp16
+        and not data_args.pad_to_max_length
+    )
+    data_collator_cls = None
+    if custom_args.data_collator_type == "all_mask":
+        data_collator_cls = DataCollatorForLanguageModelingWithFullMasking
+    elif custom_args.data_collator_type == "default":
+        data_collator_cls = DataCollatorForLanguageModeling
+    else:
+        raise ValueError(
+            f"data_collator_type {custom_args.data_collator_type} is not supported."
+        )
+    data_collator = data_collator_cls(
+        tokenizer=tokenizer,
+        mlm_probability=data_args.mlm_probability,
+        pad_to_multiple_of=8 if pad_to_multiple_of_8 else None,
+    )
+    # Initialize our Trainer
+    trainer = MNTPTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=(
+            compute_metrics
+            if training_args.do_eval and not is_torch_tpu_available()
+            else None
+        ),
+        preprocess_logits_for_metrics=(
+            preprocess_logits_for_metrics
+            if training_args.do_eval and not is_torch_tpu_available()
+            else None
+        ),
+    )
+    trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))
+    # Training
+    if training_args.do_train:
+        checkpoint = None
+        if training_args.resume_from_checkpoint is not None:
+            checkpoint = training_args.resume_from_checkpoint
+        elif last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        trainer.save_model()  # Saves the tokenizer too for easy upload
+        metrics = train_result.metrics
+        max_train_samples = (
+            data_args.max_train_samples
+            if data_args.max_train_samples is not None
+            else len(train_dataset)
+        )
+        metrics["train_samples"] = min(max_train_samples, len(train_dataset))
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+    # Evaluation
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+        metrics = trainer.evaluate()
+        max_eval_samples = (
+            data_args.max_eval_samples
+            if data_args.max_eval_samples is not None
+            else len(eval_dataset)
+        )
+        metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
+        try:
+            perplexity = math.exp(metrics["eval_loss"])
+        except OverflowError:
+            perplexity = float("inf")
+        metrics["perplexity"] = perplexity
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+    kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "fill-mask"}
+    if data_args.dataset_name is not None:
+        kwargs["dataset_tags"] = data_args.dataset_name
+        if data_args.dataset_config_name is not None:
+            kwargs["dataset_args"] = data_args.dataset_config_name
+            kwargs["dataset"] = (
+                f"{data_args.dataset_name} {data_args.dataset_config_name}"
+            )
+        else:
+            kwargs["dataset"] = data_args.dataset_name
+    if training_args.push_to_hub:
+        trainer.push_to_hub(**kwargs)
+if __name__ == "__main__":
+    main()

llm2vec/experiments/run_simcse.py ADDED Viewed

	@@ -0,0 +1,388 @@

+import logging
+from dataclasses import dataclass, field
+import os
+import sys
+from typing import Any, Dict, List, Optional, Tuple, Union
+import torch
+from torch import nn
+from accelerate import Accelerator, DistributedDataParallelKwargs
+from accelerate.logging import get_logger
+import transformers
+from transformers import (
+    MODEL_FOR_MASKED_LM_MAPPING,
+    HfArgumentParser,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    set_seed,
+)
+from transformers.trainer_utils import seed_worker
+from peft import LoraConfig, get_peft_model
+from llm2vec import LLM2Vec
+from llm2vec.dataset.utils import load_dataset
+from llm2vec.loss.utils import load_loss
+from tqdm import tqdm
+transformers.logging.set_verbosity_error()
+logging.basicConfig(
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+    level=logging.INFO,
+)
+logger = get_logger(__name__, log_level="INFO")
+MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+def initialize_peft(
+    model,
+    lora_r: int = 8,
+    lora_alpha: int = 16,
+    lora_dropout: float = 0.05,
+    lora_modules: Optional[List[str]] = None,
+):
+    if lora_modules is None and model.config.__class__.__name__ in [
+        "LlamaConfig",
+        "MistralConfig",
+        "GemmaConfig",
+        "Qwen2Config",
+    ]:
+        lora_modules = [
+            "q_proj",
+            "v_proj",
+            "k_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ]
+    elif lora_modules is None:
+        raise ValueError("lora_modules must be specified for this model.")
+    config = LoraConfig(
+        r=lora_r,
+        lora_alpha=lora_alpha,
+        target_modules=lora_modules,
+        lora_dropout=lora_dropout,
+        bias="none",
+        task_type=None,
+    )
+    model = get_peft_model(model, config)
+    print(f"Model's Lora trainable parameters:")
+    model.print_trainable_parameters()
+    return model
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+    model_name_or_path: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The base model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
+            )
+        },
+    )
+    peft_model_name_or_path: Optional[str] = field(
+        default=None,
+        metadata={"help": ("The PEFT model checkpoint to add on top of base model.")},
+    )
+    bidirectional: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to enable bidirectional attention in the model. If set to False, the model will use unidirectional attention."
+            )
+        },
+    )
+    max_seq_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated."
+            )
+        },
+    )
+    torch_dtype: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
+                "dtype will be automatically derived from the model's weights."
+            ),
+            "choices": ["auto", "bfloat16", "float16", "float32"],
+        },
+    )
+    attn_implementation: Optional[str] = field(
+        default="sdpa",
+        metadata={
+            "help": ("The attention implementation to use in the model."),
+            "choices": ["eager", "sdpa", "flash_attention_2"],
+        },
+    )
+    pooling_mode: Optional[str] = field(
+        default="mean",
+        metadata={
+            "help": ("The pooling mode to use in the model."),
+            "choices": ["mean", "weighted_mean", "eos_token"],
+        },
+    )
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    dataset_name: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of the dataset to use. Options: E5"},
+    )
+    dataset_file_path: Optional[str] = field(
+        default=None, metadata={"help": "The input training data file or folder."}
+    )
+    # TODO: implement this
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+@dataclass
+class CustomArguments:
+    """
+    Custom arguments for the script
+    """
+    simcse_dropout: float = field(
+        default=0.1, metadata={"help": "The SimCSE dropout rate for the model"}
+    )
+    lora_dropout: float = field(
+        default=0.05, metadata={"help": "The dropout rate for lora"}
+    )
+    lora_r: int = field(default=8, metadata={"help": "The r value for lora"})
+    stop_after_n_steps: int = field(
+        default=10000, metadata={"help": "Stop training after n steps"}
+    )
+    experiment_id: Optional[str] = field(
+        default=None, metadata={"help": "The experiment id"}
+    )
+    loss_class: Optional[str] = field(
+        default="HardNegativeNLLLoss",
+        metadata={
+            "help": "The loss class to use for training. Options: HardNegativeNLLLoss"
+        },
+    )
+    loss_scale: float = field(
+        default=50.0, metadata={"help": "The loss scale for the loss function"}
+    )
+@dataclass
+class DefaultCollator:
+    model: LLM2Vec
+    def __init__(self, model: LLM2Vec) -> None:
+        self.model = model
+    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
+        batch = features
+        num_texts = len(batch[0].texts)
+        texts = [[] for _ in range(num_texts)]
+        labels = []
+        for example in batch:
+            for idx, text in enumerate(example.texts):
+                # TODO: Add prepare_for_tokenization here similar to supervised training and see if it impacts performance
+                texts[idx].append(text)
+            labels.append(example.label)
+        labels = torch.tensor(labels)
+        sentence_features = []
+        for idx in range(num_texts):
+            tokenized = self.model.tokenize(texts[idx])
+            sentence_features.append(tokenized)
+        return sentence_features, labels
+class StopTrainingCallback(TrainerCallback):
+    def __init__(self, stop_after_n_steps: int):
+        self.stop_after_n_steps = stop_after_n_steps
+    def on_step_end(self, args, state, control, **kwargs):
+        if state.global_step >= self.stop_after_n_steps:
+            control.should_training_stop = True
+class SimCSETrainer(Trainer):
+    def __init__(
+        self,
+        *args,
+        loss_function=None,
+        **kwargs,
+    ) -> None:
+        super().__init__(*args, **kwargs)
+        self.loss_function = loss_function
+    def compute_loss(
+        self,
+        model: nn.Module,
+        inputs: Dict[str, Union[torch.Tensor, Any]],
+        return_outputs: bool = False,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
+        features, labels = inputs
+        q_reps = self.model(features[0])
+        d_reps = self.model(features[1])
+        d_reps_neg = None
+        if len(features) > 2:
+            d_reps_neg = self.model(features[2])
+        loss = self.loss_function(q_reps, d_reps, d_reps_neg)
+        if return_outputs:
+            output = torch.cat(
+                [model(row)["sentence_embedding"][:, None] for row in features], dim=1
+            )
+            return loss, output
+        return loss
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        # If we are executing this function, we are the process zero, so we don't check for that.
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+        self.model.save(output_dir)
+        # Good practice: save your training arguments together with the trained model
+        torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+def main():
+    parser = HfArgumentParser(
+        (ModelArguments, DataTrainingArguments, TrainingArguments, CustomArguments)
+    )
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, custom_args = parser.parse_json_file(
+            json_file=os.path.abspath(sys.argv[1])
+        )
+    else:
+        (
+            model_args,
+            data_args,
+            training_args,
+            custom_args,
+        ) = parser.parse_args_into_dataclasses()
+    if training_args.ddp_find_unused_parameters:
+        kwargs = [
+            DistributedDataParallelKwargs(
+                dim=0,
+                broadcast_buffers=True,
+                bucket_cap_mb=25,
+                find_unused_parameters=True,
+                check_reduction=False,
+                gradient_as_bucket_view=False,
+            )
+        ]
+    else:
+        kwargs = []
+    accelerator = Accelerator(kwargs_handlers=kwargs)
+    set_seed(training_args.seed)
+    if training_args.gradient_checkpointing:
+        training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
+    train_dataset = load_dataset(
+        data_args.dataset_name,
+        split="train",
+        file_path=data_args.dataset_file_path,
+    )
+    train_examples = [
+        train_dataset[i]
+        for i in tqdm(
+            range(len(train_dataset)),
+            desc="Loading train examples...",
+            disable=not accelerator.is_main_process,
+        )
+    ]
+    torch_dtype = (
+        model_args.torch_dtype
+        if model_args.torch_dtype in ["auto", None]
+        else getattr(torch, model_args.torch_dtype)
+    )
+    model = LLM2Vec.from_pretrained(
+        base_model_name_or_path=model_args.model_name_or_path,
+        enable_bidirectional=model_args.bidirectional,
+        peft_model_name_or_path=model_args.peft_model_name_or_path,
+        merge_peft=True,
+        pooling_mode=model_args.pooling_mode,
+        max_length=model_args.max_seq_length,
+        torch_dtype=torch_dtype,
+        attn_implementation=model_args.attn_implementation,
+        attention_dropout=custom_args.simcse_dropout,
+    )
+    # model organization is LLM2VecModel.model -> HF Model, we have to apply PEFT to the inner model
+    model.model = initialize_peft(
+        model.model,
+        lora_r=custom_args.lora_r,
+        lora_alpha=2 * custom_args.lora_r,
+        lora_dropout=custom_args.lora_dropout,
+    )
+    tokenizer = model.tokenizer
+    train_loss = load_loss(custom_args.loss_class, scale=custom_args.loss_scale)
+    data_collator = DefaultCollator(model)
+    trainer = SimCSETrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_examples,
+        data_collator=data_collator,
+        tokenizer=tokenizer,
+        loss_function=train_loss,
+    )
+    if custom_args.stop_after_n_steps is not None:
+        trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))
+    trainer.train()
+if __name__ == "__main__":
+    main()

llm2vec/experiments/run_supervised.py ADDED Viewed

	@@ -0,0 +1,482 @@

+import logging
+from dataclasses import dataclass, field
+import os
+import sys
+from typing import Any, Dict, List, Optional, Tuple, Union
+import torch
+from torch import nn
+from torch.utils.data import DataLoader, SequentialSampler
+from accelerate import Accelerator, DistributedDataParallelKwargs
+from accelerate.logging import get_logger
+import transformers
+from transformers import (
+    MODEL_FOR_MASKED_LM_MAPPING,
+    HfArgumentParser,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    LlamaConfig,
+    MistralConfig,
+    GemmaConfig,
+    Qwen2Config,
+    set_seed,
+)
+from transformers.trainer_utils import seed_worker
+from peft import LoraConfig, get_peft_model
+from llm2vec import LLM2Vec
+from llm2vec.dataset.utils import load_dataset
+from llm2vec.loss.utils import load_loss
+from llm2vec.experiment_utils import generate_experiment_id
+from tqdm import tqdm
+transformers.logging.set_verbosity_error()
+logging.basicConfig(
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+    level=logging.INFO,
+)
+logger = get_logger(__name__, log_level="INFO")
+MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+def prepare_for_tokenization(model, text, pooling_mode="mean"):
+    if model.config._name_or_path == "meta-llama/Meta-Llama-3-8B-Instruct":
+        text = (
+            "<|start_header_id|>user<|end_header_id|>\n\n" + text.strip() + "<|eot_id|>"
+        )
+        return text
+    if model.config._name_or_path in [
+        "mistralai/Mistral-7B-Instruct-v0.2",
+        "meta-llama/Llama-2-7b-chat-hf",
+    ]:
+        text = "[INST] " + text.strip() + " [/INST]"
+    if model.config._name_or_path in [
+        "google/gemma-2-9b-it",
+    ]:
+        text = "<bos><start_of_turn>user\n" + text.strip() + "<end_of_turn>"
+    if model.config._name_or_path in [
+        "Qwen/Qwen2-1.5B-Instruct",
+        "Qwen/Qwen2-7B-Instruct",
+    ]:
+        text = "<|im_start|>user\n" + text.strip() + "<|im_end|>"
+    if pooling_mode == "eos_token":
+        if model.config._name_or_path == "meta-llama/Meta-Llama-3-8B":
+            text = text.strip() + "<|end_of_text|>"
+        elif isinstance(model.config, LlamaConfig) or isinstance(
+            model.config, MistralConfig
+        ):
+            text = text.strip() + " </s>"
+        elif isinstance(model.config, GemmaConfig):
+            text = text.strip() + "<eos>"
+        elif isinstance(model.config, Qwen2Config):
+            text = text.strip() + "<|endoftext|>"
+    return text
+def initialize_peft(
+    model,
+    lora_r: int = 8,
+    lora_alpha: int = 16,
+    lora_dropout: float = 0.05,
+    lora_modules: Optional[List[str]] = None,
+):
+    if lora_modules is None and model.config.__class__.__name__ in [
+        "LlamaConfig",
+        "MistralConfig",
+        "GemmaConfig",
+        "Qwen2Config",
+    ]:
+        lora_modules = [
+            "q_proj",
+            "v_proj",
+            "k_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ]
+    elif lora_modules is None:
+        raise ValueError("lora_modules must be specified for this model.")
+    config = LoraConfig(
+        r=lora_r,
+        lora_alpha=lora_alpha,
+        target_modules=lora_modules,
+        lora_dropout=lora_dropout,
+        bias="none",
+        task_type=None,
+    )
+    model = get_peft_model(model, config)
+    print(f"Model's Lora trainable parameters:")
+    model.print_trainable_parameters()
+    return model
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+    model_name_or_path: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The base model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
+            )
+        },
+    )
+    peft_model_name_or_path: Optional[str] = field(
+        default=None,
+        metadata={"help": ("The PEFT model checkpoint to add on top of base model.")},
+    )
+    bidirectional: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to enable bidirectional attention in the model. If set to False, the model will use unidirectional attention."
+            )
+        },
+    )
+    max_seq_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated."
+            )
+        },
+    )
+    torch_dtype: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
+                "dtype will be automatically derived from the model's weights."
+            ),
+            "choices": ["auto", "bfloat16", "float16", "float32"],
+        },
+    )
+    attn_implementation: Optional[str] = field(
+        default="sdpa",
+        metadata={
+            "help": ("The attention implementation to use in the model."),
+            "choices": ["eager", "sdpa", "flash_attention_2"],
+        },
+    )
+    pooling_mode: Optional[str] = field(
+        default="mean",
+        metadata={
+            "help": ("The pooling mode to use in the model."),
+            "choices": ["mean", "weighted_mean", "eos_token"],
+        },
+    )
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    dataset_name: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of the dataset to use. Options: E5"},
+    )
+    dataset_file_path: Optional[str] = field(
+        default=None, metadata={"help": "The input training data file or folder."}
+    )
+    # TODO: implement this
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+@dataclass
+class CustomArguments:
+    """
+    Custom arguments for the script
+    """
+    lora_dropout: float = field(
+        default=0.05, metadata={"help": "The dropout rate for lora"}
+    )
+    lora_r: int = field(default=8, metadata={"help": "The r value for lora"})
+    stop_after_n_steps: int = field(
+        default=10000, metadata={"help": "Stop training after n steps"}
+    )
+    experiment_id: Optional[str] = field(
+        default=None, metadata={"help": "The experiment id"}
+    )
+    loss_class: Optional[str] = field(
+        default="HardNegativeNLLLoss",
+        metadata={
+            "help": "The loss class to use for training. Options: HardNegativeNLLLoss"
+        },
+    )
+    loss_scale: float = field(
+        default=50.0, metadata={"help": "The loss scale for the loss function"}
+    )
+@dataclass
+class DefaultCollator:
+    model: LLM2Vec
+    def __init__(self, model: LLM2Vec) -> None:
+        self.model = model
+    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
+        batch = features
+        num_texts = len(batch[0].texts)
+        texts = [[] for _ in range(num_texts)]
+        labels = []
+        for example in batch:
+            for idx, text in enumerate(example.texts):
+                text = prepare_for_tokenization(
+                    self.model, text, pooling_mode=self.model.pooling_mode
+                )
+                texts[idx].append(text)
+            labels.append(example.label)
+        labels = torch.tensor(labels)
+        sentence_features = []
+        for idx in range(num_texts):
+            tokenized = self.model.tokenize(texts[idx])
+            sentence_features.append(tokenized)
+        return sentence_features, labels
+class StopTrainingCallback(TrainerCallback):
+    def __init__(self, stop_after_n_steps: int):
+        self.stop_after_n_steps = stop_after_n_steps
+    def on_step_end(self, args, state, control, **kwargs):
+        if state.global_step >= self.stop_after_n_steps:
+            control.should_training_stop = True
+class LLM2VecSupervisedTrainer(Trainer):
+    def __init__(
+        self,
+        *args,
+        loss_function=None,
+        **kwargs,
+    ) -> None:
+        super().__init__(*args, **kwargs)
+        self.loss_function = loss_function
+    def compute_loss(
+        self,
+        model: nn.Module,
+        inputs: Dict[str, Union[torch.Tensor, Any]],
+        return_outputs: bool = False,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
+        features, labels = inputs
+        q_reps = self.model(features[0])
+        d_reps = self.model(features[1])
+        d_reps_neg = None
+        if len(features) > 2:
+            d_reps_neg = self.model(features[2])
+        loss = self.loss_function(q_reps, d_reps, d_reps_neg)
+        if return_outputs:
+            output = torch.cat(
+                [model(row)["sentence_embedding"][:, None] for row in features], dim=1
+            )
+            return loss, output
+        return loss
+    def get_train_dataloader(self) -> DataLoader:
+        # Copying most of the code from the parent class, changing the sampler to SequentialSampler
+        if self.train_dataset is None:
+            raise ValueError("Trainer: training requires a train_dataset.")
+        train_dataset = self.train_dataset
+        data_collator = self.data_collator
+        data_collator = self._get_collator_with_removed_columns(
+            data_collator, description="training"
+        )
+        dataloader_params = {
+            "batch_size": self._train_batch_size,
+            "collate_fn": data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+            "persistent_workers": self.args.dataloader_persistent_workers,
+        }
+        if not isinstance(train_dataset, torch.utils.data.IterableDataset):
+            # Changing from random sampler to sequential sampler
+            dataloader_params["sampler"] = SequentialSampler(train_dataset)
+            dataloader_params["drop_last"] = self.args.dataloader_drop_last
+            dataloader_params["worker_init_fn"] = seed_worker
+        return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        # If we are executing this function, we are the process zero, so we don't check for that.
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+        self.model.save(output_dir)
+        # Good practice: save your training arguments together with the trained model
+        torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+def main():
+    parser = HfArgumentParser(
+        (ModelArguments, DataTrainingArguments, TrainingArguments, CustomArguments)
+    )
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, custom_args = parser.parse_json_file(
+            json_file=os.path.abspath(sys.argv[1])
+        )
+    else:
+        (
+            model_args,
+            data_args,
+            training_args,
+            custom_args,
+        ) = parser.parse_args_into_dataclasses()
+    if training_args.ddp_find_unused_parameters:
+        kwargs = [
+            DistributedDataParallelKwargs(
+                dim=0,
+                broadcast_buffers=True,
+                bucket_cap_mb=25,
+                find_unused_parameters=True,
+                check_reduction=False,
+                gradient_as_bucket_view=False,
+            )
+        ]
+    else:
+        kwargs = []
+    accelerator = Accelerator(kwargs_handlers=kwargs)
+    set_seed(training_args.seed)
+    if training_args.gradient_checkpointing:
+        training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
+    if custom_args.experiment_id is not None:
+        experiment_id = custom_args.experiment_id
+    else:
+        experiment_id = generate_experiment_id(
+            name=data_args.dataset_name,
+            split="train",
+            model_name=(
+                model_args.model_name_or_path
+                if "/" not in model_args.model_name_or_path
+                else model_args.model_name_or_path.split("/")[-1]
+            ),
+            pooling_mode=model_args.pooling_mode,
+            train_batch_size=training_args.per_device_train_batch_size
+            * accelerator.num_processes
+            * training_args.gradient_accumulation_steps,
+            max_seq_length=model_args.max_seq_length,
+            bidirectional=model_args.bidirectional,
+            epochs=training_args.num_train_epochs,
+            seed=training_args.seed,
+            warmup_steps=training_args.warmup_steps,
+            lr=training_args.learning_rate,
+            lora_r=custom_args.lora_r,
+        )
+    training_args.output_dir = f"{training_args.output_dir}/{experiment_id}"
+    # TODO: can also pass separator arg here
+    train_dataset = load_dataset(
+        data_args.dataset_name,
+        split="train",
+        file_path=data_args.dataset_file_path,
+        effective_batch_size=training_args.per_device_train_batch_size
+        * accelerator.num_processes,
+    )
+    train_examples = [
+        train_dataset[i]
+        for i in tqdm(
+            range(len(train_dataset)),
+            desc="Loading train examples...",
+            disable=not accelerator.is_main_process,
+        )
+    ]
+    torch_dtype = (
+        model_args.torch_dtype
+        if model_args.torch_dtype in ["auto", None]
+        else getattr(torch, model_args.torch_dtype)
+    )
+    model = LLM2Vec.from_pretrained(
+        base_model_name_or_path=model_args.model_name_or_path,
+        enable_bidirectional=model_args.bidirectional,
+        peft_model_name_or_path=model_args.peft_model_name_or_path,
+        merge_peft=True,
+        pooling_mode=model_args.pooling_mode,
+        max_length=model_args.max_seq_length,
+        torch_dtype=torch_dtype,
+        attn_implementation=model_args.attn_implementation,
+    )
+    # model organization is LLM2VecModel.model -> HF Model, we have to apply PEFT to the inner model
+    model.model = initialize_peft(
+        model.model,
+        lora_r=custom_args.lora_r,
+        lora_alpha=2 * custom_args.lora_r,
+        lora_dropout=custom_args.lora_dropout,
+    )
+    tokenizer = model.tokenizer
+    train_loss = load_loss(custom_args.loss_class, scale=custom_args.loss_scale)
+    data_collator = DefaultCollator(model)
+    trainer = LLM2VecSupervisedTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_examples,
+        data_collator=data_collator,
+        tokenizer=tokenizer,
+        loss_function=train_loss,
+    )
+    if custom_args.stop_after_n_steps is not None:
+        trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))
+    trainer.train()
+if __name__ == "__main__":
+    main()

llm2vec/experiments/run_word_task.py ADDED Viewed

	@@ -0,0 +1,905 @@

+"""
+The script is adapted from https://huggingface.co/docs/transformers/en/tasks/token_classification
+"""
+import logging
+import os
+import sys
+import warnings
+from dataclasses import dataclass, field
+import numpy as np
+from typing import List, Optional, Tuple, Union
+import datasets
+import evaluate
+from datasets import load_dataset
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss
+import transformers
+from transformers import (
+    PreTrainedModel,
+    MODEL_FOR_MASKED_LM_MAPPING,
+    AutoConfig,
+    AutoTokenizer,
+    HfArgumentParser,
+    Trainer,
+    TrainingArguments,
+    TrainerCallback,
+    set_seed,
+    AutoModelForTokenClassification,
+    DataCollatorForTokenClassification,
+)
+from transformers.modeling_outputs import TokenClassifierOutput
+from transformers.utils import send_example_telemetry
+from transformers.utils.versions import require_version
+from llm2vec import LLM2Vec
+require_version(
+    "datasets>=1.8.0",
+    "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt",
+)
+class ModelForWordTask(PreTrainedModel):
+    def __init__(self, config, model, merge_subwords=False, **model_args):
+        PreTrainedModel.__init__(self, config)
+        self.model = model
+        self.merge_subwords = merge_subwords
+        if (
+            hasattr(config, "classifier_dropout")
+            and config.classifier_dropout is not None
+        ):
+            classifier_dropout = config.classifier_dropout
+        elif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:
+            classifier_dropout = config.hidden_dropout
+        else:
+            classifier_dropout = 0.1
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.num_labels = config.num_labels
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels).to(
+            model_args.get("torch_dtype")
+        )
+        # Initialize weights and apply final processing
+        self.post_init()
+    def _merge_subwords(self, hidden_states, token_type_ids, attention_mask):
+        new_hidden_states = hidden_states.clone()
+        for b in range(hidden_states.shape[0]):
+            for w in torch.arange(0, token_type_ids[b].max() + 1):
+                words_w = (token_type_ids[b] == w) * (attention_mask[b] > 0)
+                new_hidden_states[b][words_w] = torch.mean(
+                    hidden_states[b][words_w], dim=0
+                ).repeat(sum(words_w), 1)
+        return new_hidden_states
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple, TokenClassifierOutput]:
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = outputs[0]
+        if self.merge_subwords:
+            hidden_states = self._merge_subwords(
+                hidden_states, token_type_ids, attention_mask
+            )
+        hidden_states = self.dropout(hidden_states)
+        logits = self.classifier(hidden_states)
+        loss = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs.hidden_states
+            return ((loss,) + output) if loss is not None else output
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=hidden_states,
+            attentions=outputs.attentions,
+        )
+logger = logging.getLogger(__name__)
+MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+LABELS = {
+    "conll2003": {
+        "pos_tags": {
+            '"': 0,
+            "''": 1,
+            "#": 2,
+            "$": 3,
+            "(": 4,
+            ")": 5,
+            ",": 6,
+            ".": 7,
+            ":": 8,
+            "``": 9,
+            "CC": 10,
+            "CD": 11,
+            "DT": 12,
+            "EX": 13,
+            "FW": 14,
+            "IN": 15,
+            "JJ": 16,
+            "JJR": 17,
+            "JJS": 18,
+            "LS": 19,
+            "MD": 20,
+            "NN": 21,
+            "NNP": 22,
+            "NNPS": 23,
+            "NNS": 24,
+            "NN|SYM": 25,
+            "PDT": 26,
+            "POS": 27,
+            "PRP": 28,
+            "PRP$": 29,
+            "RB": 30,
+            "RBR": 31,
+            "RBS": 32,
+            "RP": 33,
+            "SYM": 34,
+            "TO": 35,
+            "UH": 36,
+            "VB": 37,
+            "VBD": 38,
+            "VBG": 39,
+            "VBN": 40,
+            "VBP": 41,
+            "VBZ": 42,
+            "WDT": 43,
+            "WP": 44,
+            "WP$": 45,
+            "WRB": 46,
+        },
+        "chunk_tags": {
+            "O": 0,
+            "B-ADJP": 1,
+            "I-ADJP": 2,
+            "B-ADVP": 3,
+            "I-ADVP": 4,
+            "B-CONJP": 5,
+            "I-CONJP": 6,
+            "B-INTJ": 7,
+            "I-INTJ": 8,
+            "B-LST": 9,
+            "I-LST": 10,
+            "B-NP": 11,
+            "I-NP": 12,
+            "B-PP": 13,
+            "I-PP": 14,
+            "B-PRT": 15,
+            "I-PRT": 16,
+            "B-SBAR": 17,
+            "I-SBAR": 18,
+            "B-UCP": 19,
+            "I-UCP": 20,
+            "B-VP": 21,
+            "I-VP": 22,
+        },
+        "ner_tags": {
+            "O": 0,
+            "B-PER": 1,
+            "I-PER": 2,
+            "B-ORG": 3,
+            "I-ORG": 4,
+            "B-LOC": 5,
+            "I-LOC": 6,
+            "B-MISC": 7,
+            "I-MISC": 8,
+        },
+    }
+}
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+    model_name_or_path: Optional[str] = field(
+        default=None,
+        metadata={},
+    )
+    config_overrides: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Override some existing default config settings when a model is trained from scratch. Example: "
+                "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
+            )
+        },
+    )
+    config_name: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Pretrained config name or path if not the same as model_name"
+        },
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Pretrained tokenizer name or path if not the same as model_name"
+        },
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Where do you want to store the pretrained models downloaded from huggingface.co"
+        },
+    )
+    use_fast_tokenizer: bool = field(
+        default=True,
+        metadata={
+            "help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."
+        },
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={
+            "help": "The specific model version to use (can be a branch name, tag name or commit id)."
+        },
+    )
+    token: str = field(
+        default=None,
+        metadata={
+            "help": (
+                "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
+                "generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
+            )
+        },
+    )
+    use_auth_token: bool = field(
+        default=None,
+        metadata={
+            "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
+        },
+    )
+    trust_remote_code: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option "
+                "should only be set to `True` for repositories you trust and in which you have read the code, as it will "
+                "execute code present on the Hub on your local machine."
+            )
+        },
+    )
+    low_cpu_mem_usage: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. "
+                "set True will benefit LLM loading time and RAM consumption."
+            )
+        },
+    )
+    torch_dtype: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
+                "dtype will be automatically derived from the model's weights."
+            ),
+            "choices": ["auto", "bfloat16", "float16", "float32"],
+        },
+    )
+    attn_implementation: Optional[str] = field(
+        default="sdpa",
+        metadata={
+            "help": ("The attention implementation to use in the model."),
+            "choices": ["eager", "sdpa", "flash_attention_2"],
+        },
+    )
+    classifier_dropout: Optional[float] = field(
+        default=0.1, metadata={"help": "The dropout rate for models"}
+    )
+    peft_addr: Optional[str] = field(
+        default=None, metadata={"help": "addr of lora adapter weights"}
+    )
+    model_class: str = field(
+        default="custom",
+        metadata={
+            "help": "One of the items 'custom' or 'auto'. 'custom' for LLM2Vec models and 'auto' for pretrained encoders such as BERT.",
+            "choices": ["custom", "auto"],
+        },
+    )
+    merge_subwords: bool = field(
+        default=True,
+        metadata={"help": "Whether the representations of the subtokens get averaged."},
+    )
+    bidirectional: bool = field(
+        default=True, metadata={"help": "Whether to use bidirectional attention."}
+    )
+    def __post_init__(self):
+        if self.config_overrides is not None and (
+            self.config_name is not None or self.model_name_or_path is not None
+        ):
+            raise ValueError(
+                "--config_overrides can't be used in combination with --config_name or --model_name_or_path"
+            )
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    dataset_name: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of the dataset to use (via the datasets library)."},
+    )
+    dataset_config_name: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "The configuration name of the dataset to use (via the datasets library)."
+        },
+    )
+    train_file: Optional[str] = field(
+        default=None, metadata={"help": "The input training data file (a text file)."}
+    )
+    validation_file: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=True,
+        metadata={"help": "Overwrite the cached training and evaluation sets"},
+    )
+    validation_split_percentage: Optional[int] = field(
+        default=5,
+        metadata={
+            "help": "The percentage of the train set used as validation set in case there's no validation split"
+        },
+    )
+    max_seq_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated."
+            )
+        },
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    mlm_probability: float = field(
+        default=0.15,
+        metadata={"help": "Ratio of tokens to mask for masked language modeling loss"},
+    )
+    line_by_line: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."
+        },
+    )
+    pad_to_max_length: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to pad all samples to `max_seq_length`. "
+                "If False, will pad the samples dynamically when batching to the maximum length in the batch."
+            )
+        },
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+    max_eval_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
+                "value if set."
+            )
+        },
+    )
+    streaming: bool = field(default=False, metadata={"help": "Enable streaming mode"})
+    def __post_init__(self):
+        if self.streaming:
+            require_version(
+                "datasets>=2.0.0", "The streaming feature requires `datasets>=2.0.0`"
+            )
+        if (
+            self.dataset_name is None
+            and self.train_file is None
+            and self.validation_file is None
+        ):
+            raise ValueError(
+                "Need either a dataset name or a training/validation file."
+            )
+        else:
+            if self.train_file is not None:
+                extension = self.train_file.split(".")[-1]
+                if extension not in ["csv", "json", "txt"]:
+                    raise ValueError(
+                        "`train_file` should be a csv, a json or a txt file."
+                    )
+            if self.validation_file is not None:
+                extension = self.validation_file.split(".")[-1]
+                if extension not in ["csv", "json", "txt"]:
+                    raise ValueError(
+                        "`validation_file` should be a csv, a json or a txt file."
+                    )
+# add more arguments
+@dataclass
+class CustomArguments:
+    """
+    Custom arguments for the script
+    """
+    stop_after_n_steps: int = field(
+        default=10000, metadata={"help": "Stop training after n steps"}
+    )
+    data_collator_type: str = field(
+        default="custom",
+        metadata={
+            "help": "The type of data collator. Options: custom, default, custom_no_random"
+        },
+    )
+    task: Optional[str] = field(
+        default="pos_tags",
+        metadata={
+            "help": "One of the 'pos_tags', 'chunk_tags', and 'ner_tags' choices",
+            "choices": ["pos_tags", "ner_tags", "chunk_tags"],
+        },
+    )
+    retroactive_labels: str = field(
+        default="next_token",
+        metadata={
+            "help": "Whether the tokens representations are used to predict the next token's labels. Options: same_token, next_word, next_token.",
+            "choices": ["next_token", "same_token"],
+        },
+    )
+class StopTrainingCallback(TrainerCallback):
+    def __init__(self, stop_after_n_steps: int):
+        self.stop_after_n_steps = stop_after_n_steps
+    def on_step_end(self, args, state, control, **kwargs):
+        if state.global_step >= self.stop_after_n_steps:
+            control.should_training_stop = True
+class WordTaskTrainer(Trainer):
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        # If we are executing this function, we are the process zero, so we don't check for that.
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+        torch.save(self.model.classifier, os.path.join(output_dir, "classifier.pt"))
+        self.tokenizer.save_pretrained(output_dir)
+        # Good practice: save your training arguments together with the trained model
+        torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+def main():
+    parser = HfArgumentParser(
+        (ModelArguments, DataTrainingArguments, TrainingArguments, CustomArguments)
+    )
+    # model_args, data_args, training_args, custom_args = parser.parse_args_into_dataclasses()
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, custom_args = parser.parse_json_file(
+            json_file=os.path.abspath(sys.argv[1])
+        )
+    else:
+        (
+            model_args,
+            data_args,
+            training_args,
+            custom_args,
+        ) = parser.parse_args_into_dataclasses()
+    if training_args.gradient_checkpointing:
+        training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
+    if model_args.use_auth_token is not None:
+        warnings.warn(
+            "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.",
+            FutureWarning,
+        )
+        if model_args.token is not None:
+            raise ValueError(
+                "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
+            )
+        model_args.token = model_args.use_auth_token
+    # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
+    # information sent is the one passed as arguments along with your Python/PyTorch versions.
+    send_example_telemetry("run_word_task", model_args, data_args)
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    if training_args.should_log:
+        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+        transformers.utils.logging.set_verbosity_info()
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, "
+        + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}"
+    )
+    # Set the verbosity to info of the Transformers logger (on main process only):
+    logger.info(f"Training/evaluation parameters {training_args}")
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+    if data_args.dataset_name is not None:
+        # Downloading and loading a dataset from the hub.
+        raw_datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            token=model_args.token,
+            streaming=data_args.streaming,
+        )
+        if "validation" not in raw_datasets.keys():
+            raw_datasets["validation"] = load_dataset(
+                data_args.dataset_name,
+                data_args.dataset_config_name,
+                split=f"train[:{data_args.validation_split_percentage}%]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+                streaming=data_args.streaming,
+            )
+            raw_datasets["train"] = load_dataset(
+                data_args.dataset_name,
+                data_args.dataset_config_name,
+                split=f"train[{data_args.validation_split_percentage}%:]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+                streaming=data_args.streaming,
+            )
+    else:
+        data_files = {}
+        if data_args.train_file is not None:
+            data_files["train"] = data_args.train_file
+            extension = data_args.train_file.split(".")[-1]
+        if data_args.validation_file is not None:
+            data_files["validation"] = data_args.validation_file
+            extension = data_args.validation_file.split(".")[-1]
+        if extension == "txt":
+            extension = "text"
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            token=model_args.token,
+        )
+        # If no validation data is there, validation_split_percentage will be used to divide the dataset.
+        if "validation" not in raw_datasets.keys():
+            raw_datasets["validation"] = load_dataset(
+                extension,
+                data_files=data_files,
+                split=f"train[:{data_args.validation_split_percentage}%]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+            )
+            raw_datasets["train"] = load_dataset(
+                extension,
+                data_files=data_files,
+                split=f"train[{data_args.validation_split_percentage}%:]",
+                cache_dir=model_args.cache_dir,
+                token=model_args.token,
+            )
+    assert (
+        data_args.dataset_name in LABELS
+        and custom_args.task in LABELS[data_args.dataset_name]
+    ), f"LABELS[{data_args.dataset_name}][{custom_args.task}] is not defined."
+    config_kwargs = {
+        "num_labels": len(LABELS[data_args.dataset_name][custom_args.task]),
+        "id2label": {
+            i: lab
+            for (lab, i) in LABELS[data_args.dataset_name][custom_args.task].items()
+        },
+        "label2id": LABELS[data_args.dataset_name][custom_args.task],
+        "classifier_dropout": model_args.classifier_dropout,
+    }
+    tokenizer_kwargs = {
+        "cache_dir": model_args.cache_dir,
+        "use_fast": model_args.use_fast_tokenizer,
+        "revision": model_args.model_revision,
+        "token": model_args.token,
+        "trust_remote_code": model_args.trust_remote_code,
+    }
+    if model_args.tokenizer_name:
+        if "gpt" in model_args.tokenizer_name:
+            tokenizer_kwargs["add_prefix_space"] = True
+        tokenizer = AutoTokenizer.from_pretrained(
+            model_args.tokenizer_name, **tokenizer_kwargs
+        )
+    elif model_args.model_name_or_path:
+        if "gpt" in model_args.model_name_or_path:
+            tokenizer_kwargs["add_prefix_space"] = True
+        tokenizer = AutoTokenizer.from_pretrained(
+            model_args.model_name_or_path, **tokenizer_kwargs
+        )
+    else:
+        raise ValueError(
+            "You are instantiating a new tokenizer from scratch. This is not supported by this script. "
+            "You can do it from another script, save it, and load it from here, using --tokenizer_name."
+        )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    if model_args.model_class == "custom":
+        tokenizer.model_input_names.append("token_type_ids")
+    if model_args.model_class == "auto":
+        assert not model_args.merge_subwords
+    if model_args.model_class == "custom":
+        if model_args.config_name:
+            config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
+        elif model_args.model_name_or_path:
+            config = AutoConfig.from_pretrained(
+                model_args.model_name_or_path, **config_kwargs
+            )
+        else:
+            raise ValueError("Invalid config loading")
+        for k, v in config_kwargs.items():
+            config.__setattr__(k, v)
+        torch_dtype = (
+            model_args.torch_dtype
+            if model_args.torch_dtype in ["auto", None]
+            else getattr(torch, model_args.torch_dtype)
+        )
+        l2v = LLM2Vec.from_pretrained(
+            base_model_name_or_path=model_args.model_name_or_path,
+            enable_bidirectional=model_args.bidirectional,
+            peft_model_name_or_path=model_args.peft_addr,
+            merge_peft=False,
+            torch_dtype=torch_dtype,
+            attn_implementation=model_args.attn_implementation,
+        )
+        model = ModelForWordTask(
+            model=l2v.model,
+            merge_subwords=model_args.merge_subwords,
+            config=config,
+            torch_dtype=torch_dtype,
+        )
+        MyTrainer = WordTaskTrainer
+    elif model_args.model_class == "auto":
+        model = AutoModelForTokenClassification.from_pretrained(
+            model_args.model_name_or_path,
+            num_labels=config_kwargs["num_labels"],
+            id2label=config_kwargs["id2label"],
+            label2id=config_kwargs["label2id"],
+        )
+        MyTrainer = Trainer
+    else:
+        raise ValueError(
+            f"{model_args.model_class} is not implemented. Only 'auto' and 'custom' model_class options are valid."
+        )
+    # only train classifier
+    for n, p in list(model.named_parameters()):
+        if "classifier" in n:
+            p.requires_grad = True
+        else:
+            p.requires_grad = False
+    if data_args.max_seq_length is None:
+        max_seq_length = tokenizer.model_max_length
+        if max_seq_length > 1024:
+            logger.warning(
+                "The chosen tokenizer supports a `model_max_length` that is longer than the default `block_size` value"
+                " of 1024. If you would like to use a longer `block_size` up to `tokenizer.model_max_length` you can"
+                " override this default with `--block_size xxx`."
+            )
+            max_seq_length = 1024
+    else:
+        if data_args.max_seq_length > tokenizer.model_max_length:
+            logger.warning(
+                f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the "
+                f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}."
+            )
+        max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
+    def tokenize_and_align_labels(examples):
+        task = custom_args.task
+        padding = "max_length" if data_args.pad_to_max_length else False
+        tokenized_inputs = tokenizer(
+            examples["tokens"],
+            truncation=True,
+            is_split_into_words=True,
+            padding=padding,
+            max_length=max_seq_length,
+        )
+        labels = []
+        words = []
+        for i, label in enumerate(examples[task]):
+            if custom_args.retroactive_labels in ["same_token"]:
+                word_ids = tokenized_inputs.word_ids(batch_index=i)
+                previous_word_idx = None
+                label_ids = []
+                for word_idx in word_ids:
+                    if word_idx is None:
+                        label_ids.append(-100)
+                    elif word_idx != previous_word_idx:
+                        label_ids.append(label[word_idx])
+                    else:
+                        label_ids.append(-100)
+                    previous_word_idx = word_idx
+                labels.append(label_ids)
+                word_ids = [-1 if w is None else w for w in word_ids]
+                words.append(word_ids)
+            elif custom_args.retroactive_labels == "next_token":
+                word_ids = tokenized_inputs.word_ids(batch_index=i)
+                previous_word_idx = None
+                label_ids = []
+                for word_idx in word_ids:
+                    if word_idx is None:
+                        label_ids.append(-100)
+                    elif word_idx != previous_word_idx:
+                        label_ids.append(label[word_idx])
+                    else:
+                        label_ids.append(-100)
+                    previous_word_idx = word_idx
+                label_ids.append(-100)
+                labels.append(label_ids[1:])
+                word_ids = word_ids[1:] + [None]
+                word_ids = [-1 if w is None else w for w in word_ids]
+                words.append(word_ids)
+            else:
+                raise ValueError(
+                    f"retroactive_labels {custom_args.retroactive_labels} is not implemented."
+                )
+        tokenized_inputs["labels"] = labels
+        if model_args.model_class == "custom":
+            tokenized_inputs["token_type_ids"] = words
+        return tokenized_inputs
+    tokenized_dataset = raw_datasets.map(
+        tokenize_and_align_labels,
+        batched=True,
+        remove_columns=list(LABELS[data_args.dataset_name].keys()) + ["tokens", "id"],
+        load_from_cache_file=not data_args.overwrite_cache,
+    )
+    data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
+    seqeval = evaluate.load("seqeval")
+    def compute_metrics(p):
+        predictions, labels = p
+        predictions = predictions[0]
+        predictions = np.argmax(predictions, axis=2)
+        true_predictions = [
+            [
+                config_kwargs["id2label"][p]
+                for (p, l) in zip(prediction, label)
+                if l != -100
+            ]
+            for prediction, label in zip(predictions, labels)
+        ]
+        true_labels = [
+            [
+                config_kwargs["id2label"][l]
+                for (p, l) in zip(prediction, label)
+                if l != -100
+            ]
+            for prediction, label in zip(predictions, labels)
+        ]
+        results = seqeval.compute(predictions=true_predictions, references=true_labels)
+        return {
+            "precision": results["overall_precision"],
+            "recall": results["overall_recall"],
+            "f1": results["overall_f1"],
+            "accuracy": results["overall_accuracy"],
+        }
+    trainer = MyTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_dataset["train"],
+        eval_dataset=tokenized_dataset["validation"],
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics,
+    )
+    trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))
+    trainer.train()
+if __name__ == "__main__":
+    main()

llm2vec/experiments/test_word_task.py ADDED Viewed

	@@ -0,0 +1,393 @@

+import os
+import sys
+import logging
+import argparse
+from transformers import (
+    AutoTokenizer,
+    AutoConfig,
+    AutoModelForTokenClassification,
+    set_seed,
+    HfArgumentParser,
+)
+import torch
+from datasets import load_dataset
+import evaluate
+import json
+from tqdm import tqdm
+from run_word_task import ModelForWordTask
+from llm2vec import LLM2Vec
+LABELS = {
+    "conll2003": {
+        "pos_tags": {
+            '"': 0,
+            "''": 1,
+            "#": 2,
+            "$": 3,
+            "(": 4,
+            ")": 5,
+            ",": 6,
+            ".": 7,
+            ":": 8,
+            "``": 9,
+            "CC": 10,
+            "CD": 11,
+            "DT": 12,
+            "EX": 13,
+            "FW": 14,
+            "IN": 15,
+            "JJ": 16,
+            "JJR": 17,
+            "JJS": 18,
+            "LS": 19,
+            "MD": 20,
+            "NN": 21,
+            "NNP": 22,
+            "NNPS": 23,
+            "NNS": 24,
+            "NN|SYM": 25,
+            "PDT": 26,
+            "POS": 27,
+            "PRP": 28,
+            "PRP$": 29,
+            "RB": 30,
+            "RBR": 31,
+            "RBS": 32,
+            "RP": 33,
+            "SYM": 34,
+            "TO": 35,
+            "UH": 36,
+            "VB": 37,
+            "VBD": 38,
+            "VBG": 39,
+            "VBN": 40,
+            "VBP": 41,
+            "VBZ": 42,
+            "WDT": 43,
+            "WP": 44,
+            "WP$": 45,
+            "WRB": 46,
+        },
+        "chunk_tags": {
+            "O": 0,
+            "B-ADJP": 1,
+            "I-ADJP": 2,
+            "B-ADVP": 3,
+            "I-ADVP": 4,
+            "B-CONJP": 5,
+            "I-CONJP": 6,
+            "B-INTJ": 7,
+            "I-INTJ": 8,
+            "B-LST": 9,
+            "I-LST": 10,
+            "B-NP": 11,
+            "I-NP": 12,
+            "B-PP": 13,
+            "I-PP": 14,
+            "B-PRT": 15,
+            "I-PRT": 16,
+            "B-SBAR": 17,
+            "I-SBAR": 18,
+            "B-UCP": 19,
+            "I-UCP": 20,
+            "B-VP": 21,
+            "I-VP": 22,
+        },
+        "ner_tags": {
+            "O": 0,
+            "B-PER": 1,
+            "I-PER": 2,
+            "B-ORG": 3,
+            "I-ORG": 4,
+            "B-LOC": 5,
+            "I-LOC": 6,
+            "B-MISC": 7,
+            "I-MISC": 8,
+        },
+    }
+}
+def str2bool(v):
+    if isinstance(v, bool):
+        return v
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise argparse.ArgumentTypeError("Boolean value expected.")
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.INFO)
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_class", default="custom", type=str)
+    parser.add_argument("--model_name_or_path", default=None, type=str)
+    parser.add_argument(
+        "--peft_addr",
+        default=None,
+        type=str,
+        help="The dir address where adapter_model.bin is saved.",
+    )
+    parser.add_argument(
+        "--cls_addr",
+        default=None,
+        type=str,
+        help="The dir address where classifier is saved.",
+    )
+    parser.add_argument("--bidirectional", default=True, type=str2bool)
+    parser.add_argument("--merge_subwords", default=True, type=str2bool)
+    parser.add_argument("--output_dir", default=None, type=str)
+    parser.add_argument("--classifier_dropout", default=0.1, type=float)
+    parser.add_argument(
+        "--attn_implementation",
+        default="sdpa",
+        type=str,
+        choices=["sdpa", "eager", "flash_attention_2"],
+    )
+    parser.add_argument(
+        "--torch_dtype",
+        default=None,
+        type=str,
+        choices=["auto", "bfloat16", "float16", "float32"],
+    )
+    parser.add_argument(
+        "--retroactive_labels",
+        default="next_token",
+        type=str,
+        choices=["next_token", "same_token"],
+    )
+    parser.add_argument("--dataset_name", default=None, type=str)
+    parser.add_argument(
+        "--task", default=None, type=str, choices=["pos_tags", "chunk_tags", "ner_tags"]
+    )
+    parser.add_argument("--max_seq_length", default=1024, type=int)
+    parser.add_argument("--batch_size", default=32, type=int)
+    parser.add_argument("--seed", default=32, type=int)
+    parser.add_argument("--config_file", default=None, type=str)
+    args = parser.parse_args()
+    if args.config_file is not None:
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        from pathlib import Path
+        import json
+        json_text = json.load(open(os.path.abspath(args.config_file)))
+        argparse_dict = vars(args)
+        argparse_dict.update(json_text)
+        # args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        args = parser.parse_args()
+    path_to_check = args.peft_addr if args.peft_addr else args.model_name_or_path
+    assert (
+        args.output_dir is not None
+    ), "If you want to evaluate a model, you have to provide the output_dir"
+    os.makedirs(args.output_dir, exist_ok=True)
+    set_seed(args.seed)
+    tokenizer_kwargs = {}
+    if "gpt" in args.model_name_or_path:
+        tokenizer_kwargs["add_prefix_space"] = True
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.model_name_or_path, **tokenizer_kwargs
+    )
+    if tokenizer.pad_token_id is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    if args.model_class == "custom":
+        tokenizer.model_input_names.append("token_type_ids")
+    if args.model_class == "auto":
+        assert not args.merge_subwords
+    assert (
+        args.dataset_name in LABELS and args.task in LABELS[args.dataset_name]
+    ), f"LABELS[{args.dataset_name}][{args.task}] is not defined."
+    config_kwargs = {
+        "num_labels": len(LABELS[args.dataset_name][args.task]),
+        "id2label": {
+            i: lab for (lab, i) in LABELS[args.dataset_name][args.task].items()
+        },
+        "label2id": LABELS[args.dataset_name][args.task],
+        "classifier_dropout": args.classifier_dropout,
+    }
+    if args.model_class == "custom":
+        if args.model_name_or_path:
+            config = AutoConfig.from_pretrained(
+                args.model_name_or_path, **config_kwargs
+            )
+        else:
+            raise ValueError("Invalid config loading")
+        for k, v in config_kwargs.items():
+            config.__setattr__(k, v)
+        torch_dtype = (
+            args.torch_dtype
+            if args.torch_dtype in ["auto", None]
+            else getattr(torch, args.torch_dtype)
+        )
+        l2v = LLM2Vec.from_pretrained(
+            base_model_name_or_path=args.model_name_or_path,
+            enable_bidirectional=args.bidirectional,
+            peft_model_name_or_path=args.peft_addr,
+            merge_peft=False,
+            torch_dtype=torch_dtype,
+            attn_implementation=args.attn_implementation,
+        )
+        model = ModelForWordTask(
+            model=l2v.model,
+            merge_subwords=args.merge_subwords,
+            config=config,
+            torch_dtype=torch_dtype,
+        )
+        classifier_path = os.path.join(args.cls_addr, "classifier.pt")
+        if os.path.exists(classifier_path):
+            print(f"Loading classifier from {classifier_path}")
+            model.classifier = torch.load(classifier_path)
+        else:
+            raise ValueError("classifier does not exist in", classifier_path)
+    elif args.model_class == "auto":
+        model = AutoModelForTokenClassification.from_pretrained(
+            args.model_name_or_path,
+            num_labels=len(LABELS[args.dataset_name][args.task]),
+            id2label={
+                i: lab for (lab, i) in LABELS[args.dataset_name][args.task].items()
+            },
+            label2id=LABELS[args.dataset_name][args.task],
+        )
+    else:
+        raise ValueError(
+            f"{args.model_class} is not implemented. Only auto and custom model_class options are valid."
+        )
+    model = model.cuda()
+    raw_datasets = load_dataset(args.dataset_name, split="test")
+    def tokenize_and_align_labels(examples):
+        task = args.task
+        tokenized_inputs = tokenizer(
+            examples["tokens"],
+            truncation=True,
+            is_split_into_words=True,
+            padding="max_length",
+            max_length=args.max_seq_length,
+            return_tensors="pt",
+        )
+        labels = []
+        words = []
+        for i, label in enumerate(examples[task]):
+            if args.retroactive_labels in ["same_token"]:
+                # if args.retroactive_labels == "next_word":
+                #     label = label[1:] + [-100]
+                word_ids = tokenized_inputs.word_ids(batch_index=i)
+                previous_word_idx = None
+                label_ids = []
+                for word_idx in word_ids:
+                    if word_idx is None:
+                        label_ids.append(-100)
+                    elif word_idx != previous_word_idx:
+                        label_ids.append(label[word_idx])
+                    else:
+                        label_ids.append(-100)
+                    previous_word_idx = word_idx
+                labels.append(label_ids)
+                word_ids = [-1 if w is None else w for w in word_ids]
+                words.append(word_ids)
+            elif args.retroactive_labels == "next_token":
+                word_ids = tokenized_inputs.word_ids(batch_index=i)
+                previous_word_idx = None
+                label_ids = []
+                for word_idx in word_ids:
+                    if word_idx is None:
+                        label_ids.append(-100)
+                    elif word_idx != previous_word_idx:
+                        label_ids.append(label[word_idx])
+                    else:
+                        label_ids.append(-100)
+                    previous_word_idx = word_idx
+                label_ids.append(-100)
+                labels.append(label_ids[1:])
+                word_ids = word_ids[1:] + [None]
+                word_ids = [-1 if w is None else w for w in word_ids]
+                words.append(word_ids)
+            else:
+                raise ValueError(
+                    f"retroactive_labels {args.retroactive_labels} is not implemented."
+                )
+        tokenized_inputs["labels"] = torch.tensor(labels)
+        if args.model_class == "custom":
+            tokenized_inputs["token_type_ids"] = words
+        return tokenized_inputs
+    tokenized_dataset = raw_datasets.map(
+        tokenize_and_align_labels,
+        batched=True,
+        remove_columns=list(LABELS[args.dataset_name].keys()) + ["tokens", "id"],
+    )
+    with torch.no_grad():
+        predictions = None
+        labels = None
+        for batch_begin in tqdm(
+            torch.arange(0, len(tokenized_dataset), args.batch_size)
+        ):
+            features = {
+                "input_ids": torch.tensor(
+                    tokenized_dataset[batch_begin : batch_begin + args.batch_size][
+                        "input_ids"
+                    ]
+                ).to(model.device),
+                "attention_mask": torch.tensor(
+                    tokenized_dataset[batch_begin : batch_begin + args.batch_size][
+                        "attention_mask"
+                    ]
+                ).to(model.device),
+            }
+            if (
+                "token_type_ids"
+                in tokenized_dataset[batch_begin : batch_begin + args.batch_size]
+            ):
+                features["token_type_ids"] = torch.tensor(
+                    tokenized_dataset[batch_begin : batch_begin + args.batch_size][
+                        "token_type_ids"
+                    ]
+                ).to(model.device)
+            labs = torch.tensor(
+                tokenized_dataset[batch_begin : batch_begin + args.batch_size]["labels"]
+            )
+            logits = model(**features).logits
+            preds = torch.argmax(logits, dim=-1)
+            if predictions is None:
+                predictions = preds
+                labels = labs
+            else:
+                predictions = torch.concatenate((predictions, preds))
+                labels = torch.concatenate((labels, labs))
+    precision_metric = evaluate.load("precision")
+    metrics = precision_metric.compute(
+        references=labels[labels != -100],
+        predictions=predictions[labels != -100],
+        average="micro",
+    )
+    with open(os.path.join(args.output_dir, "result_summary.json"), "w") as f:
+        json.dump(metrics, f)
+    print(metrics)

llm2vec/images/sample_efficient.png ADDED Viewed