Spaces:
Running
Running
| title: README | |
| emoji: 🏢 | |
| colorFrom: yellow | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| # **TheStage AI Platform** | |
| Inference optimization for LLMs, diffusion, and voice. Self-hosted or cloud. Works on NVIDIA GPUs, Apple Silicon, and edge devices. | |
| **Links:** | |
| [Web App](https://app.thestage.ai/) • [Docs](https://docs.thestage.ai/) • [Hugging Face](https://huggingface.co/TheStageAI) • [X](https://x.com/TheStageAI) • [LinkedIn](https://www.linkedin.com/company/thestageai) • [Discord](mailto:sergey@thestage.ai) (request invite) • [Email](mailto:support@thestage.ai) | |
| --- | |
| # **What is TheStage AI** | |
| TheStage AI is an inference optimization stack. It helps you compress, compile, and serve models. You keep control of the accuracy versus performance trade-off. | |
| --- | |
| # **Products / Components** | |
| - [**ANNA (Automatic Neural Network Acceleration)**](https://docs.thestage.ai/qlip/docs/source/anna_api.html) | |
| Automated compression analysis under user-defined constraints (size, MACs, latency, memory). Outputs a QlipConfig for compile and serve. | |
| - [**Qlip**](https://docs.thestage.ai/qlip/docs/source/index.html) | |
| Full-stack optimization and inference framework. Quantization, sparsification, and compilation for NVIDIA GPUs (Apple Silicon supported). Produces pre-compiled (non-JIT) artifacts with dynamic shapes and mixed precision. Triton-based serving. | |
| - [**Elastic Models**](https://docs.thestage.ai/tutorials/source/elastic_transformers.html) | |
| Qlip-optimized models with S / M / L / XL performance tiers (availability varies). L/M/S may include quantization or pruning for faster inference. | |
| - [**TheStage CLI**](https://docs.thestage.ai/platform/src/thestage-ai-cli.html) | |
| Manage projects, tokens, and hardware from the terminal. Launch/monitor jobs, rent instances, and stream logs. | |
| - [**TheStage Platform**](https://app.thestage.ai/) | |
| Web UI and APIs for instances, models, and deployments. Includes the [**Playground**](https://app.thestage.ai/) to test Elastic Models, switch hardware, and compare tiers before deployment. | |
| --- | |
| # **Key features** | |
| - **Elastic Models with S/M/L/XL tiers per model** (choose cost, quality, and memory balance; availability varies). | |
| - **ANNA constraint-driven compression analysis** (outputs a QlipConfig for compile and serve). | |
| - **Qlip compiler and runtime** (pre-compiled engines; no runtime JIT; dynamic shapes; mixed precision). | |
| - **OpenAI-compatible HTTP serving** (deploy and scale models through a standard API). | |
| - **Playground to test models and hardware** (compare performance and tiers before deployment). | |
| - **Self-host or run in the cloud** (use your own infrastructure; keep data private). | |
| - **Hardware support: NVIDIA (incl. Jetson), Apple Silicon, and edge targets** (NPUs, DSPs, and MCUs per model). | |
| - **Comprehensive tutorials and documentation** (from setup to evaluation and production). | |
| --- | |
| # **Quickstart** | |
| - Install CLI: `pip install thestage` | |
| - Set token: `thestage config set --api-token <YOUR_API_TOKEN>` (get it in the web app) | |
| - Use `elastic_models` in your code and choose a tier (S/M/L/XL). See Markdown version for a snippet. | |
| - Diffusion and voice examples are in the docs. | |
| --- | |
| # **Serving** | |
| OpenAI-compatible API flow with Modal is documented (single- and multi-GPU). | |
| Start here: https://docs.thestage.ai/ | |
| --- | |
| # **Supported hardware** | |
| - NVIDIA GPUs (incl. Jetson where applicable) | |
| - Apple Silicon | |
| - Edge/embedded devices | |
| --- | |