Update README.md
Browse files
README.md
CHANGED
|
@@ -18,16 +18,16 @@ Athena-4-15B is a 15-billion-parameter multimodal reasoning model designed for h
|
|
| 18 |
|
| 19 |
## Key capabilities
|
| 20 |
|
| 21 |
-
* Strong textual reasoning (math, logic, chain-of-thought style outputs).
|
| 22 |
-
* Multimodal understanding: able to process image+text prompts for captioning and image reasoning via an image-text processor.
|
| 23 |
-
* Optimised for instruction-following use cases (SFT on curated instruction data).
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
## Highlights / Benchmark notes
|
| 28 |
|
| 29 |
* Competitive performance on reasoning and multimodal benchmarks reported by the Apriel team (reported scores, e.g., Artificial Analysis index and IFBench in their model card). ([Hugging Face][1])
|
| 30 |
-
* Targeted to deliver high capability per parameter (aiming for frontier-level reasoning while keeping model size ~15B).
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
@@ -47,9 +47,8 @@ Athena-4-15B is a 15-billion-parameter multimodal reasoning model designed for h
|
|
| 47 |
|
| 48 |
## Limitations
|
| 49 |
|
| 50 |
-
* Generates internal chain-of-thought-style reasoning before final answer by design; this can increase token usage and latency. The Apriel upstream notes that the model explicitly produces stepwise reasoning and then a final response. This behaviour may need post-processing or filtering depending on your deployment.
|
| 51 |
-
*
|
| 52 |
-
* The model was trained and fine-tuned on curated datasets prioritising reasoning; domain coverage should be validated for specialised domains (medical, legal, etc.). ([Hugging Face][1])
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
@@ -63,15 +62,15 @@ Athena-4-15B is a 15-billion-parameter multimodal reasoning model designed for h
|
|
| 63 |
|
| 64 |
## Training summary (reference implementation)
|
| 65 |
|
| 66 |
-
* **Mid-training / continual pretraining:** Extensive CPT on reasoning-focused text and multimodal interleaved image-text corpora to strengthen reasoning capabilities.
|
| 67 |
-
* **Supervised fine-tuning (SFT):** Fine-tuned on >2M high-quality text samples consisting of mathematical problems, coding tasks, instruction-following data, and conversational examples. No RLHF was applied in the referenced Apriel workflow.
|
| 68 |
-
* **Training hardware (reference):** Apriel reports large-scale training hardware usage (e.g., H100 clusters) in their public card; Athena’s training choices may differ but were informed by this regimen.
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
## Evaluation
|
| 73 |
|
| 74 |
-
* Third-party and open-benchmark evaluations were used in the Apriel reference (Artificial Analysis for text benchmarks; VLMEvalKit/OpenCompass for image evaluation). Reported scores indicated strong reasoning performance relative to model size. Use case-specific evaluation is recommended before production deployment.
|
| 75 |
|
| 76 |
---
|
| 77 |
|
|
@@ -102,7 +101,7 @@ pipe(text=messages)
|
|
| 102 |
|
| 103 |
## License
|
| 104 |
|
| 105 |
-
Use a permissive license consistent with your organisation’s policy. The Apriel reference model uses an MIT license — check and align Athena’s license to your legal requirements before publishing.
|
| 106 |
|
| 107 |
---
|
| 108 |
|
|
@@ -115,6 +114,6 @@ If you publish results using Athena, include a citation to the design and traini
|
|
| 115 |
## Implementation notes & recommendations
|
| 116 |
|
| 117 |
* **Prompting:** Athena benefits from prompts that ask for stepwise reasoning when the trace is required, but for concise outputs prefer instructing the model to “Answer concisely” or to “Provide only the final answer.”
|
| 118 |
-
* **Latency vs. accuracy:** Expect higher token usage and slightly longer generation time due to explicit internal reasoning; benchmark inference cost and consider temperature/top-k adjustments for production.
|
| 119 |
* **Safety pipeline:** Add toxicity checks, hallucination detection, and a facts-verification layer for external claims before surfacing to end users.
|
| 120 |
* **Evaluation:** Run domain-specific benchmarks and human evaluations for calibration prior to public release.
|
|
|
|
| 18 |
|
| 19 |
## Key capabilities
|
| 20 |
|
| 21 |
+
* Strong textual reasoning (math, logic, chain-of-thought style outputs).
|
| 22 |
+
* Multimodal understanding: able to process image+text prompts for captioning and image reasoning via an image-text processor.
|
| 23 |
+
* Optimised for instruction-following use cases (SFT on curated instruction data).
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
## Highlights / Benchmark notes
|
| 28 |
|
| 29 |
* Competitive performance on reasoning and multimodal benchmarks reported by the Apriel team (reported scores, e.g., Artificial Analysis index and IFBench in their model card). ([Hugging Face][1])
|
| 30 |
+
* Targeted to deliver high capability per parameter (aiming for frontier-level reasoning while keeping model size ~15B).
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
|
|
| 47 |
|
| 48 |
## Limitations
|
| 49 |
|
| 50 |
+
* Generates internal chain-of-thought-style reasoning before final answer by design; this can increase token usage and latency. The Apriel upstream notes that the model explicitly produces stepwise reasoning and then a final response. This behaviour may need post-processing or filtering depending on your deployment.
|
| 51 |
+
* The model was trained and fine-tuned on curated datasets prioritising reasoning; domain coverage should be validated for specialised domains (medical, legal, etc.).
|
|
|
|
| 52 |
|
| 53 |
---
|
| 54 |
|
|
|
|
| 62 |
|
| 63 |
## Training summary (reference implementation)
|
| 64 |
|
| 65 |
+
* **Mid-training / continual pretraining:** Extensive CPT on reasoning-focused text and multimodal interleaved image-text corpora to strengthen reasoning capabilities.
|
| 66 |
+
* **Supervised fine-tuning (SFT):** Fine-tuned on >2M high-quality text samples consisting of mathematical problems, coding tasks, instruction-following data, and conversational examples. No RLHF was applied in the referenced Apriel workflow.
|
| 67 |
+
* **Training hardware (reference):** Apriel reports large-scale training hardware usage (e.g., H100 clusters) in their public card; Athena’s training choices may differ but were informed by this regimen.
|
| 68 |
|
| 69 |
---
|
| 70 |
|
| 71 |
## Evaluation
|
| 72 |
|
| 73 |
+
* Third-party and open-benchmark evaluations were used in the Apriel reference (Artificial Analysis for text benchmarks; VLMEvalKit/OpenCompass for image evaluation). Reported scores indicated strong reasoning performance relative to model size. Use case-specific evaluation is recommended before production deployment.
|
| 74 |
|
| 75 |
---
|
| 76 |
|
|
|
|
| 101 |
|
| 102 |
## License
|
| 103 |
|
| 104 |
+
Use a permissive license consistent with your organisation’s policy. The Apriel reference model uses an MIT license — check and align Athena’s license to your legal requirements before publishing.
|
| 105 |
|
| 106 |
---
|
| 107 |
|
|
|
|
| 114 |
## Implementation notes & recommendations
|
| 115 |
|
| 116 |
* **Prompting:** Athena benefits from prompts that ask for stepwise reasoning when the trace is required, but for concise outputs prefer instructing the model to “Answer concisely” or to “Provide only the final answer.”
|
| 117 |
+
* **Latency vs. accuracy:** Expect higher token usage and slightly longer generation time due to explicit internal reasoning; benchmark inference cost and consider temperature/top-k adjustments for production.
|
| 118 |
* **Safety pipeline:** Add toxicity checks, hallucination detection, and a facts-verification layer for external claims before surfacing to end users.
|
| 119 |
* **Evaluation:** Run domain-specific benchmarks and human evaluations for calibration prior to public release.
|