|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
The elastic agent is the control plane of torchelastic. It is a process |
|
|
that launches and manages underlying worker processes. The agent is |
|
|
responsible for: |
|
|
|
|
|
1. Working with distributed torch: the workers are started with all the |
|
|
necessary information to successfully and trivially call |
|
|
``torch.distributed.init_process_group()``. |
|
|
|
|
|
2. Fault tolerance: monitors workers and upon detecting worker failures |
|
|
or unhealthiness, tears down all workers and restarts everyone. |
|
|
|
|
|
3. Elasticity: Reacts to membership changes and restarts workers with the new |
|
|
members. |
|
|
|
|
|
The simplest agents are deployed per node and works with local processes. |
|
|
A more advanced agent can launch and manage workers remotely. Agents can |
|
|
be completely decentralized, making decisions based on the workers it manages. |
|
|
Or can be coordinated, communicating to other agents (that manage workers |
|
|
in the same job) to make a collective decision. |
|
|
""" |
|
|
|
|
|
from .api import ( |
|
|
ElasticAgent, |
|
|
RunResult, |
|
|
SimpleElasticAgent, |
|
|
Worker, |
|
|
WorkerGroup, |
|
|
WorkerSpec, |
|
|
WorkerState, |
|
|
) |
|
|
from .local_elastic_agent import TORCHELASTIC_ENABLE_FILE_TIMER, TORCHELASTIC_TIMER_FILE |
|
|
|