Spaces:
Runtime error
Runtime error
| ## Hydra | |
| [Hydra](https://github.com/facebookresearch/hydra) is an open-source Python | |
| framework that simplifies the development of research and other complex | |
| applications. The key feature is the ability to dynamically create a | |
| hierarchical configuration by composition and override it through config files | |
| and the command line. The name Hydra comes from its ability to run multiple | |
| similar jobs - much like a Hydra with multiple heads. | |
| ## Motivation | |
| Until recently, all components in fairseq were configured through a shared | |
| `args` namespace that was created at application startup. Components declared | |
| their own `add_args` method to update the argparse parser, hoping that the names | |
| would not clash with arguments from other components. While this model works for | |
| smaller applications, as fairseq grew and became integrated into other | |
| applications, this became problematic. In order to determine how to configure | |
| each component, one needed to a) examine what args were added by this component, | |
| and b) read the code to figure out what shared arguments it is using that were | |
| added in other places. Reproducing models involved sharing commands that often | |
| contained dozens of command line switches. | |
| The model described above is still supported by fairseq for backward | |
| compatibility, but will be deprecated some time in the future. | |
| New components in fairseq should now create a dataclass that encapsulates all | |
| parameters required to configure this component. The dataclass is registered | |
| along with the component, and fairseq takes care of constructing and providing | |
| this configuration object to the component's constructor. Note that sharing | |
| parameters can optionally still work, but one has to explicitly point to the | |
| "source of truth" (see inheritance example below). These changes make components | |
| in fairseq more independent and re-usable by other applications: all that is | |
| needed to create a component is to initialize its dataclass and overwrite some | |
| of the defaults. | |
| While configuring fairseq through command line (using either the legacy argparse | |
| based or the new Hydra based entry points) is still fully supported, you can now | |
| take advantage of configuring fairseq completely or piece-by-piece through | |
| hierarchical YAML configuration files. These files can also be shipped as | |
| examples that others can use to run an identically configured job. | |
| Additionally, Hydra has a rich and growing [library of | |
| plugins](https://github.com/facebookresearch/hydra/tree/master/plugins) that | |
| provide functionality such as hyperparameter sweeping (including using bayesian | |
| optimization through the [Ax](https://github.com/facebook/Ax) library), job | |
| launching across various platforms, and more. | |
| ## Creating or migrating components | |
| In general, each new (or updated) component should provide a companion | |
| [dataclass](https://www.python.org/dev/peps/pep-0557/). These dataclass are | |
| typically located in the same file as the component and are passed as arguments | |
| to the `register_*()` functions. Top-level configs that should be present in | |
| every fairseq application are placed in the | |
| [global](fairseq/dataclass/configs.py) config file and added to the | |
| `FairseqConfig` object. | |
| Each dataclass is a plain-old-data object, similar to a `NamedTuple`. These | |
| classes are decorated with a `@dataclass` decorator, and typically inherit from | |
| `FairseqDataclass` (which adds some functionality for backward compatibility). | |
| Each field must have a type, and generally has metadata (such as a help string) | |
| and a default value. Only primitive types or other config objects are allowed as | |
| data types for each field. | |
| #### Example: | |
| ```python | |
| from dataclasses import dataclass, field | |
| from fairseq.dataclass import FairseqDataclass | |
| @dataclass | |
| class InteractiveConfig(FairseqDataclass): | |
| buffer_size: int = field( | |
| default=0, | |
| metadata={ | |
| "help": "read this many sentences into a buffer before processing them" | |
| }, | |
| ) | |
| input: str = field( | |
| default="-", | |
| metadata={"help": "file to read from; use - for stdin"}, | |
| ) | |
| ``` | |
| ### Inherting values | |
| Some components require sharing a value. For example, a learning rate scheduler | |
| and an optimizer may both need to know the initial learning rate value. One can | |
| declare a field that, by default, will inherit its value from another config | |
| node in the same hierarchy: | |
| ```python | |
| @dataclass | |
| FairseqAdamConfig(FairseqDataclass): | |
| ... | |
| lr: List[float] = II("optimization.lr") | |
| ... | |
| ``` | |
| `II("optimization.lr")` is syntactic sugar for `"${optimization.lr}"`, which is | |
| the value one can use in a YAML config file or through command line to achieve | |
| the same effect. Note that this assumes that there is an "optimization" config | |
| object in the root config and it has a field called "lr". | |
| ### Tasks and Models | |
| Creating Tasks and Models works same as before, except that legacy | |
| implementations now inherit from `LegacyFairseq*` base classes, while new | |
| components inherit from `FairseqTask` and `FairseqModel` and provide a dataclass | |
| to the `register_*()` functions. | |
| #### Task example: | |
| ```python | |
| @dataclass | |
| class LanguageModelingConfig(FairseqDataclass): | |
| data: Optional[str] = field( | |
| default=None, metadata={"help": "path to data directory"} | |
| ) | |
| ... | |
| @register_task("language_modeling", dataclass=LanguageModelingConfig) | |
| class LanguageModelingTask(FairseqTask): | |
| ... | |
| @classmethod | |
| def setup_task(cls, cfg: LanguageModelingConfig): | |
| ... | |
| ``` | |
| #### Model example: | |
| ```python | |
| @dataclass | |
| class TransformerLanguageModelConfig(FairseqDataclass): | |
| activation_fn: ChoiceEnum(utils.get_available_activation_fns()) = field( | |
| default="relu", metadata={"help": "activation function to use"} | |
| ) | |
| dropout: float = field(default=0.1, metadata={"help": "dropout probability"}) | |
| ... | |
| @register_model("transformer_lm", dataclass=TransformerLanguageModelConfig) | |
| class TransformerLanguageModel(FairseqLanguageModel): | |
| ... | |
| @classmethod | |
| def build_model(cls, cfg: TransformerLanguageModelConfig, task: FairseqTask): | |
| ... | |
| ``` | |
| ### Other components | |
| Other components work as before, but they now take their configuration dataclass | |
| as the only constructor argument: | |
| ```python | |
| @dataclass | |
| class MosesTokenizerConfig(FairseqDataclass): | |
| source_lang: str = field(default="en", metadata={"help": "source language"}) | |
| ... | |
| @register_tokenizer("moses", dataclass=MosesTokenizerConfig) | |
| class MosesTokenizer(object): | |
| def __init__(self, cfg: MosesTokenizerConfig): | |
| ... | |
| ``` | |
| Note that if you are adding a new registry for a new set of components, you need | |
| to add it to the `FairseqConfig` object in `fairseq/dataclass/configs.py`: | |
| ```python | |
| @dataclass | |
| class FairseqConfig(object): | |
| ... | |
| my_new_registry: Any = None | |
| ``` | |
| ## Training with `fairseq-hydra-train` | |
| To fully take advantage of configuration flexibility offered by Hydra, you may | |
| want to train new models using the `fairseq-hydra-train` entry point. Legacy CLI | |
| tools such as `fairseq-train` will remain supported for the foreseeable future | |
| but will be deprecated eventually. | |
| On startup, Hydra will create a configuration object that contains a hierarchy | |
| of all the necessary dataclasses populated with their default values in the | |
| code. The default values are overwritten by values found in YAML files in | |
| `fairseq/config` directory (which currently sets minimal defaults) and then | |
| further overwritten by values provided through command line arguments. | |
| Some of the most common use cases are shown below: | |
| ### 1. Override default values through command line: | |
| ```shell script | |
| $ fairseq-hydra-train \ | |
| distributed_training.distributed_world_size=1 \ | |
| dataset.batch_size=2 \ | |
| task.data=data-bin \ | |
| model=transformer_lm/transformer_lm_gpt \ | |
| task=language_modeling \ | |
| optimization.max_update=5000 | |
| ``` | |
| Note that along with explicitly providing values for parameters such as | |
| `dataset.batch_size`, this also tells Hydra to overlay configuration found in | |
| `fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml` over the default | |
| values in the dataclass. If you want to train a model without specifying a | |
| particular architecture you can simply specify `model=transformer_lm`. This only | |
| works for migrated tasks and models. | |
| ### 2. Replace bundled configs with an external config: | |
| ```shell script | |
| $ fairseq-hydra-train \ | |
| --config-dir /path/to/external/configs \ | |
| --config-name wiki103 | |
| ``` | |
| where `/path/to/external/configs/wiki103.yaml` contains: | |
| ```yaml | |
| # @package _group_ | |
| model: | |
| _name: transformer_lm | |
| distributed_training: | |
| distributed_world_size: 1 | |
| dataset: | |
| batch_size: 2 | |
| task: | |
| _name: language_modeling | |
| data: /path/to/data | |
| add_bos_token: false | |
| max_target_positions: 1024 | |
| optimization: | |
| max_update: 50000 | |
| lr: [ 0.25 ] | |
| criterion: cross_entropy | |
| optimizer: adam | |
| lr_scheduler: | |
| _name: cosine | |
| ``` | |
| Note that here bundled configs from `fairseq/config` directory are not used, | |
| however the defaults from each dataclass will still be used (unless overwritten | |
| by your external config). | |
| Additionally you can choose to break up your configs by creating a directory | |
| structure in the same location as your main config file, with the names of the | |
| top-level fields (such as "model", "dataset", etc), and placing config files | |
| with meaningful names that would populate that specific section of your | |
| top-level config file (for example, you might have | |
| `model/small_transformer_lm.yaml`, `model/big_transformer_lm.yaml`, etc). You | |
| can then specify the correct configuration via command line, defaults in the | |
| main config, or even launch all of them as a sweep (see Hydra documentation on | |
| how to do this). | |
| ### 3. Add an external config directory to Hydra search path: | |
| This allows combining default configuration (including using any bundled config | |
| files), while specifying your own config files for some parts of the | |
| configuration. | |
| ```shell script | |
| $ fairseq-hydra-train \ | |
| distributed_training.distributed_world_size=1 \ | |
| dataset.batch_size=2 \ | |
| task.data=/path/to/data/ \ | |
| model=transformer_lm/2_layers \ | |
| task=language_modeling \ | |
| optimization.max_update=5000 \ | |
| --config-dir /path/to/external/configs | |
| ``` | |
| where `/path/to/external/configs` has the following structure: | |
| ``` | |
| . | |
| +-- model | |
| | +-- transformer_lm | |
| | | +-- 2_layers.yaml | |
| ``` | |
| and `2_layers.yaml` contains a copy of `transformer_lm_gpt.yaml` but with | |
| `decoder_layers` set to 2. You can add other configs to configure other | |
| components as well. | |