Instructions to use FINAL-Bench/Aether-14B-5Attn-prev with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FINAL-Bench/Aether-14B-5Attn-prev with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="FINAL-Bench/Aether-14B-5Attn-prev", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("FINAL-Bench/Aether-14B-5Attn-prev", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use FINAL-Bench/Aether-14B-5Attn-prev with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FINAL-Bench/Aether-14B-5Attn-prev" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Aether-14B-5Attn-prev", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/FINAL-Bench/Aether-14B-5Attn-prev
- SGLang
How to use FINAL-Bench/Aether-14B-5Attn-prev with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FINAL-Bench/Aether-14B-5Attn-prev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Aether-14B-5Attn-prev", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FINAL-Bench/Aether-14B-5Attn-prev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Aether-14B-5Attn-prev", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use FINAL-Bench/Aether-14B-5Attn-prev with Docker Model Runner:
docker model run hf.co/FINAL-Bench/Aether-14B-5Attn-prev
AETHER-Pilot-14B-5Attn (prev)
λ³Έ λͺ¨λΈμ 'μ λΆ μ²¨λ¨ GPU μ§μ μ¬μ 'μΌλ‘ λ§λ κ³Όμ μ°μΆλ¬Όμ΄λ©°, νμ¬ κ³ λνκ° μ§ν μ€μ λλ€. This model is a deliverable produced under the Government Advanced GPU Support Project, and is currently undergoing advancement (κ³ λν).
-prev μ λ―Έμ¬λ κ³ λν μ΄μ (preview) 체ν¬ν¬μΈνΈμμ μλ―Έν©λλ€. μμ΄Β·μν μΆκ° νμ΅(SFT v4)μ΄ λ³λλ‘ μ§ν μ€μ
λλ€.
κ°μ (Overview)
AETHERλ μ λ‘λΆν°(from-scratch) μ€κ³λ νμ΄λΈλ¦¬λ μν€ν μ² κΈ°λ° κ΅μ° νμ΄λ°μ΄μ λͺ¨λΈμ λλ€. λ¨μΌ μ΄ν μ μ΄ μλλΌ 5μ’ μ μλ‘ λ€λ₯Έ μ΄ν μ μ 5Γ5 λΌν΄ μ€νμ΄(Latin Square)λ‘ λ°°μΉν κ²μ΄ ν΅μ¬μ λλ€.
| νλͺ© | κ° |
|---|---|
| Total params | ~14.7B |
| Active params | ~3β4B (MoE top-5) |
| Layers | 25 (5Γ5 Latin Square) |
| Hidden size | 4096 |
| Intermediate | 12288 |
| Attention heads | 32 (GQA, KV 8) |
| Experts | 25 (top-5) + 1 shared |
| Vocab | 151,936 (Qwen tokenizer) |
| Context | 4096 |
| dtype | bfloat16 |
ν΅μ¬ μν€ν μ²
- 5μ’ νμ΄λΈλ¦¬λ μ΄ν μ (5Γ5 Latin Square) β MLA / Full / Slide / GDN / Mamba2 λ₯Ό νΒ·μ΄Β·λκ°μ μ κ° 1νμ© λ°°μΉ
- μ΄μ€ ν΄λ°ν΄ μ¬μ΄ν΄ (Oheng/μ€ν) β μ(η)Β·κ·Ή(ε ) κ²μ΄ν
- μ€ννΈλ΄ μ΄ν μ β μ£Όνμ μ νμ(F=12)
- MoE β 5μμ Γ 5μ λ¬Έ = 25 expert (top-5) + νκ·Ή(곡μ ) 1
- λ©νμΈμ§ ν€λ β 5μ’ μΆλ‘ μ ν μλ³
νμ¬ μν (Status)
- Research preview. νκ΅μ΄ instruction-following λ° AI μκΈ°μΈμ(self-recognition)μ΄ SFTλ‘ ν립λ 체ν¬ν¬μΈνΈμ λλ€.
- κ³ λν μ§ν μ€: μμ΄Β·μν λ₯λ ₯ 보κ°(SFT v4) λ° μ νΈλ μ΅μ νκ° μ§ν μ€μ λλ€.
- λ³Έ λͺ¨λΈμ μ°κ΅¬ λͺ©μ μ μ€κ° μ°μΆλ¬Όμ΄λ©°, μ¬μ€μ±Β·μ리 μ°μ° λ±μμ νκ³κ° μμ μ μμ΅λλ€.
μ¬μ©λ² (Usage)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "FINAL-Bench/Aether-14B-5Attn-prev"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
)
msgs = [{"role": "user", "content": "μλ
νμΈμ, μκΈ°μκ° ν΄μ£ΌμΈμ."}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.15)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
μꡬμ¬ν (Requirements)
pip install torch transformers
pip install flash-linear-attention # GatedDeltaNet / Mamba2 (νμ)
β οΈ λ³Έ λͺ¨λΈμ custom architectureμ΄λ―λ‘
trust_remote_code=Trueκ° νμνλ©°, GDNΒ·Mamba2 λ μ΄μ΄λflash-linear-attention(fla) ν¨ν€μ§μ μμ‘΄ν©λλ€.
μ 곡/ν¬λ λ§ (Provenance)
- κ³Όμ μ±κ²©: μ λΆ μ²¨λ¨ GPU μ§μ μ¬μ κ³Όμ μ°μΆλ¬Ό
- μν€ν μ²: AETHER 5Γ5 Latin Square Hybrid-Attention MoE
- ν ν¬λμ΄μ : Qwen tokenizer νΈν (vocab 151,936)
κ³ λνκ° μ§νλλ©΄μ 체ν¬ν¬μΈνΈμ λͺ¨λΈμΉ΄λκ° κ°±μ λ μ μμ΅λλ€.
- Downloads last month
- -