🧠 MathX-5M by XenArcAI — Scalable Math Reasoning for Smarter LLMs
Introducing MathX-5M, a high-quality, instruction-tuned dataset built to supercharge mathematical reasoning in large language models. With 5 million rigorously filtered examples, it spans everything from basic arithmetic to advanced calculus—curated from public sources and enhanced with synthetic data.
🔍 Key Highlights: - Step-by-step reasoning with verified answers - Covers algebra, geometry, calculus, logic, and more - RL-validated correctness and multi-stage filtering - Ideal for fine-tuning, benchmarking, and educational AI
The hype is real: a mysterious gpt2-chatbot model has appeared on the LLM Arena Leaderboard 👀. It seems to be at least on par with the top performing models (closed and open).
To try it out: https://chat.lmsys.org/ -> then click on the Direct Chat tab and select gpt2-chatbot.
🚀 Sentence Transformers v2.7.0 is out! Featuring a new loss function, easier Matryoshka model inference & evaluation, CrossEncoder improvements & Intel Gaudi2 Accelerator support. Details:
1️⃣ A new loss function: CachedGISTEmbedLoss This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.
2️⃣ Automatic Matryoshka model truncation Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a truncate_dim option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings from LlamaIndex or LangChain.
3️⃣ Additionally, you can now specify truncate_dim in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)
4️⃣ CrossEncoder improvements The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support trust_remote_code to load models with custom modelling code.
5️⃣ Inference on Intel Gaudi2 If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!
I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!