--- license: mit language: - en library_name: transformers pipeline_tag: text-generation tags: - reasoning - mathematics - programming - creative-writing - chain-of-thought - interpretability - fairness - security - deployment - sustainability - monitoring - plugin --- # Brello Thinking ## Model Description **Brello Thinking** is an advanced language model created by **Epic Systems** as a part of **Brello AI Family**. Built on the robust Tencent Hunyuan base model, Brello Thinking specializes in deep reasoning, mathematical problem-solving, coding, and creative thinking with enhanced chain-of-thought capabilities. ### Key Features - **Advanced Reasoning**: Enhanced chain-of-thought with both fast and slow thinking modes - **Mathematical Excellence**: Superior at math and symbolic computation - **Programming Prowess**: Strong coding abilities across Python, JS, C++, SQL, and more - **Long Context Understanding**: Handles up to 256K tokens, long docs, and codebases - **Creative Problem Solving**: Generates new solutions and approaches - **Multi-language Support**: Fluent in English and Chinese, robust cross-lingual transfer --- ## 1. Executive Summary **Brello Thinking v1.1.0** (2025-08-07) is a 1.8B-parameter causal language model engineered for complex reasoning, mathematics, and creative tasks. It combines ultra-long context, dual “fast”/“deep” thinking modes, and a plugin SDK for live tool integration. It is designed for safe, sustainable, and fair production deployments. #### Highlights in this Release - **Mixed-precision quantization** (BF16 & INT8) - **Plugin SDK** (JSON-RPC, HMAC auth, dynamic tool routing) - **Monitoring** (Prometheus, Grafana, carbon tracking) - **Sustainability Dashboard** (gCO₂eq/token metrics, CodeCarbon SDK) --- ## 2. Model Architecture | Component | Specification | |----------------------------|-----------------------------------------------------------------------------------------------------| | **Base Model** | Tencent Hunyuan / EpicBrelloV1ForCausalLM | | **Parameters** | 1.8B (BF16/INT8 quantization; LoRA adapters optional) | | **Context Window** | 256,000 tokens (rotary cache, sliding window, eviction logic) | | **Attention** | Grouped-Query + Multi-Head FlashAttention (16 heads, 4 KV heads) | | **Feed-Forward** | Two-stage (SiLU → Linear → SiLU) with RMSNorm, hidden size 6144 | | **Depth** | 32 transformer blocks + 4 “Safety Adapter” blocks | | **Adapters** | LoRA for math, code, creative, and domain fine-tuning (10–18M params each) | | **Inference Modes** | Autoregressive sampling (top-k, top-p), beam, contrastive decoding | | **Sharding** | ZeRO-3 / tensor-parallel / model-parallel combinations | --- ## 3. Training & Tuning ### 3.1 Pretraining Corpus - **Web General**: 400B tokens (CommonCrawl, CC-100, curated news) - **Science/Technical**: 50B tokens (arXiv, PubMed, patents) - **Code**: 20B tokens (public GitHub, CodeSearchNet, MBPP) - **Multilingual**: 30B tokens (Chinese, Spanish, German, Arabic) - **Augmentations**: 15% span corruption, zh–en back-translation, dynamic masking ### 3.2 Optimization - **Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) - **LR Schedule**: Linear warmup (10K steps), cosine decay (500K steps) - **Batch**: 2M tokens/step, grad accumulation ×8 ### 3.3 Instruction/RLHF Tuning - **Instruction Pairs**: 1.2M human-annotated QA/reasoning - **Reward Model**: Dual human-preference ranking (5K raters, Elo) - **Algorithm**: PPO w/ KL penalty (target KL=0.1), reward clipping --- ## 4. Specialized Modules | Adapter Name | Data Source | Params (M) | Use Case | |-------------------|-----------------------------------|------------|----------------------------------| | math-adapter | GSM8K, MATH, AIME datasets | 12 | Math proof, step-by-step logic | | code-adapter | MBPP, MultiPL-E, GitHub repos | 18 | Coding, debugging, codegen | | creative-adapter | Gutenberg, story corpora | 10 | Narrative, dialogue, ideation | --- ## 5. Plugin & Tooling SDK - **Interface**: JSON-RPC (Unix socket or REST), HMAC-SHA256 auth - **Plugins**: - DB connectors: PostgreSQL, MySQL, Snowflake - HTTP client: retry/backoff - Vector DB: FAISS, Pinecone #### Tool Call Example 1. Model emits: ```json {"tool_call": {"name": "weather_fetch", "args": {"location":"Mumbai"}}} ``` 2. Host executes plugin, returns: ```json {"tool_result": {"forecast":"Sunny, 32°C"}} ``` 3. Model resumes reasoning with tool result in context. --- ## 6. Inference, Monitoring & Scaling ### 6.1 Endpoint Performance | Mode | Batch | Seq Len | Throughput (tok/s) | Latency (p50) | |--------------|-------|----------|--------------------|---------------| | Fast-Think | 8 | 4,096 | 250,000 | 15 ms | | Deep-Think | 1 | 256,000 | 18,000 | 120 ms | | INT8 Quant | 16 | 2,048 | 320,000 | 12 ms | ### 6.2 Observability - **Prometheus Metrics**: - `brello_inference_latency_seconds` - `brello_generated_tokens_total` - `brello_cache_evictions_total` - **Grafana**: - Token latency histograms, CO₂ per generation --- ## 7. Sustainability & Carbon Tracking - **Data Center PUE**: 1.2 - **Carbon Emission**: ~0.0008 gCO₂eq/token (tracked with CodeCarbon) - **Offset**: Epic Systems funds VER 2.0 credits --- ## 8. Robustness, Safety & Fairness - **Adapters**: Real-time adversarial input filtering, personal data redaction, toxicity classifier (fine-tuned BERT-tox) - **Bias Audits**: - Toxicity variation <1.8% (12 demographic axes) - Gender parity ±2% - Dialect coverage 98% (EN & ZH) --- ## 9. Interpretability - **Chain-of-Thought logs**: Token-level reasoning trace - **Integrated Gradients**: Span attribution - **Attention Rollouts**: Layer-wise visualization (custom plugin) --- ## 10. Hyperparameters | Parameter | Value | |-------------------|----------| | num_layers | 32 | | d_model | 2048 | | d_hidden | 6144 | | num_heads | 16 | | kv_heads | 4 | | rotary_pct | 0.25 | | lr_warmup_steps | 10,000 | | weight_decay | 0.01 | | batch_size | 2M | | dropout_rate | 0.1 | --- ## 11. Evaluation & Error Analysis - **Benchmarks**: GSM8K, MBPP, BBH, LongBench, MATH - **Analysis**: Math/logic confusion matrix, hallucination drift cluster analysis --- ## 12. Roadmap | Version | Highlights | ETA | |-----------|----------------------------------------------|----------| | v1.1.0 | Plugins, carbon tracking, INT8 quantization | Released | | v1.2.0 | Vision-language, adapter expansion | Nov 2025 | | v1.3.0 | Audio, multilingual tuning | Feb 2026 | | v2.0 | Federated RAG, continuous learning | Q4 2026 | --- ## 13. Licensing & Compliance - **License**: Proprietary, Epic Systems - **Privacy**: GDPR, CCPA compliant - **Certifications**: ISO 27001, SOC 2 Type II, HIPAA (BAA on request) - **Restrictions**: No redistribution or large-scale rehosting --- ## 14. Usage Example ```python import os import torch from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel # For LoRA adapters from brello_sdk import BrelloPluginManager # Hypothetical SDK from codecarbon import EmissionsTracker from prometheus_client import CollectorRegistry, Counter, Histogram, push_to_gateway def setup_model( model_id: str = "BrelloES/brello-thinking", use_bf16: bool = True, load_int8: bool = True, ): tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16 if use_bf16 else torch.float32, load_in_8bit=load_int8, ) # Attach LoRA adapters model = PeftModel.from_pretrained(model, "adapters/math-adapter") model = PeftModel.from_pretrained(model, "adapters/code-adapter") return tokenizer, model def setup_plugins(): pm = BrelloPluginManager() pm.register( name="weather_fetch", path="/opt/brello/plugins/weather_plugin.so", auth_key=os.getenv("WEATHER_PLUGIN_KEY", "CHANGE_ME"), ) pm.register( name="db_query", path="/opt/brello/plugins/db_query_plugin.so", auth_key=os.getenv("DB_PLUGIN_KEY", "CHANGE_ME"), ) return pm def setup_metrics(): registry = CollectorRegistry() Histogram( "brello_inference_latency_seconds", "Inference latency (seconds) per request", registry=registry, buckets=(0.01, 0.05, 0.1, 0.2, 0.5, 1.0), ) Counter( "brello_generated_tokens_total", "Total number of tokens generated by Brello", registry=registry, ) return registry def generate_response(tokenizer, model, plugin_mgr, registry, messages, mode: str = "deep"): inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, enable_thinking=True if mode == "deep" else False, ) tracker = EmissionsTracker(project_name="brello_inference", output_dir="carbon_logs") tracker.start() # (Metrics update simplified for clarity) outputs = model.generate( inputs.to(model.device), max_new_tokens=512, top_p=0.9, temperature=0.6, plugin_manager=plugin_mgr, return_dict_in_generate=True, output_scores=True, ) emissions_kg = tracker.stop() text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True) return text, emissions_kg def main(): tokenizer, model = setup_model() plugin_mgr = setup_plugins() registry = setup_metrics() messages = [ {"role": "system", "content": "You are Brello Thinking in Deep-Think mode."}, {"role": "user", "content": "Explain why prime factorization is unique."}, ] response, co2 = generate_response(tokenizer, model, plugin_mgr, registry, messages, mode="deep") print("=== Deep-Think Output ===\n", response) print(f"CO₂ Emitted: {co2:.6f} kg") # Fast-Think comparison messages[0]["content"] = "You are Brello Thinking in Fast-Think mode." response_fast, co2_fast = generate_response(tokenizer, model, plugin_mgr, registry, messages, mode="fast") print("\n=== Fast-Think Output ===\n", response_fast) print(f"CO₂ Emitted: {co2_fast:.6f} kg") if __name__ == "__main__": main() ``` --- ## Otvd - **Creator**: Epic Systems - **Engineer**: Rehan Temkar - **Model**: Brello Thinking v1.0.0 --- *Brello Thinking - Advanced AI Reasoning by Epic Systems* ---