| # Concepts: Understanding OpenAI APIs | |
| This guide explains the fundamental concepts behind working with OpenAI's language models, which form the foundation for building AI agents. | |
| ## What is the OpenAI API? | |
| The OpenAI API provides programmatic access to powerful language models like GPT-4o and GPT-3.5-turbo. Instead of running models locally, you send requests to OpenAI's servers and receive responses. | |
| **Key characteristics:** | |
| - **Cloud-based:** Models run on OpenAI's infrastructure | |
| - **Pay-per-use:** Charged by token consumption | |
| - **Production-ready:** Enterprise-grade reliability and performance | |
| - **Latest models:** Immediate access to newest model releases | |
| **Comparison with Local LLMs (like node-llama-cpp):** | |
| | Aspect | OpenAI API | Local LLMs | | |
| |--------|------------|------------| | |
| | **Setup** | API key only | Download models, need GPU/RAM | | |
| | **Cost** | Pay per token | Free after initial setup | | |
| | **Performance** | Consistent, high-quality | Depends on your hardware | | |
| | **Privacy** | Data sent to OpenAI | Completely local/private | | |
| | **Scalability** | Unlimited (with payment) | Limited by your hardware | | |
| --- | |
| ## The Chat Completions API | |
| ### Request-Response Cycle | |
| ``` | |
| You (Client) OpenAI (Server) | |
| | | | |
| | POST /v1/chat/completions | | |
| | { | | |
| | model: "gpt-4o", | | |
| | messages: [...] | | |
| | } | | |
| |------------------------------->| | |
| | | | |
| | [Processing...] | | |
| | [Model inference] | | |
| | [Generate response] | | |
| | | | |
| | Response | | |
| | { | | |
| | choices: [{ | | |
| | message: { | | |
| | content: "..." | | |
| | } | | |
| | }] | | |
| | } | | |
| |<-------------------------------| | |
| | | | |
| ``` | |
| **Key point:** Each request is independent. The API doesn't store conversation history. | |
| --- | |
| ## Message Roles: The Conversation Structure | |
| Every message has a `role` that determines its purpose: | |
| ### 1. System Messages | |
| ```javascript | |
| { role: 'system', content: 'You are a helpful Python tutor.' } | |
| ``` | |
| **Purpose:** Define the AI's behavior, personality, and capabilities | |
| **Think of it as:** | |
| - The AI's "job description" | |
| - Invisible to the end user | |
| - Sets constraints and guidelines | |
| **Examples:** | |
| ```javascript | |
| // Specialist agent | |
| "You are an expert SQL database administrator." | |
| // Tone and style | |
| "You are a friendly customer support agent. Be warm and empathetic." | |
| // Output format control | |
| "You are a JSON API. Always respond with valid JSON, never plain text." | |
| // Behavioral constraints | |
| "You are a code reviewer. Be constructive and focus on best practices." | |
| ``` | |
| **Best practices:** | |
| - Keep it concise but specific | |
| - Place at the beginning of the messages array | |
| - Update it to change agent behavior | |
| - Use for ethical guidelines and output formatting | |
| ### 2. User Messages | |
| ```javascript | |
| { role: 'user', content: 'How do I use async/await?' } | |
| ``` | |
| **Purpose:** Represent the human's input or questions | |
| **Think of it as:** | |
| - What you're asking the AI | |
| - The prompt or query | |
| - The instruction to follow | |
| ### 3. Assistant Messages | |
| ```javascript | |
| { role: 'assistant', content: 'Async/await is a way to handle promises...' } | |
| ``` | |
| **Purpose:** Represent the AI's previous responses | |
| **Think of it as:** | |
| - The AI's conversation history | |
| - Context for follow-up questions | |
| - What the AI has already said | |
| ### Conversation Flow Example | |
| ```javascript | |
| [ | |
| { role: 'system', content: 'You are a math tutor.' }, | |
| // First exchange | |
| { role: 'user', content: 'What is 15 * 24?' }, | |
| { role: 'assistant', content: '15 * 24 = 360' }, | |
| // Follow-up (knows context) | |
| { role: 'user', content: 'What about dividing that by 3?' }, | |
| { role: 'assistant', content: '360 Γ· 3 = 120' }, | |
| ] | |
| ``` | |
| **Why this matters:** The role structure enables: | |
| 1. **Context awareness:** AI understands conversation history | |
| 2. **Behavior control:** System prompts shape responses | |
| 3. **Multi-turn conversations:** Natural back-and-forth dialogue | |
| --- | |
| ## Statelessness: A Critical Concept | |
| **Most important principle:** OpenAI's API is stateless. | |
| ### What does stateless mean? | |
| Each API call is independent. The model doesn't remember previous requests. | |
| ``` | |
| Request 1: "My name is Alice" | |
| Response 1: "Hello Alice!" | |
| Request 2: "What's my name?" | |
| Response 2: "I don't know your name." β No memory! | |
| ``` | |
| ### How to maintain context | |
| **You must send the full conversation history:** | |
| ```javascript | |
| const messages = []; | |
| // First turn | |
| messages.push({ role: 'user', content: 'My name is Alice' }); | |
| const response1 = await client.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: messages // ["My name is Alice"] | |
| }); | |
| messages.push(response1.choices[0].message); | |
| // Second turn - include full history | |
| messages.push({ role: 'user', content: "What's my name?" }); | |
| const response2 = await client.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: messages // Full conversation! | |
| }); | |
| ``` | |
| ### Implications | |
| **Benefits:** | |
| - β Simple architecture (no server-side state) | |
| - β Easy to scale (any server can handle any request) | |
| - β Full control over context (you decide what to include) | |
| **Challenges:** | |
| - β You manage conversation history | |
| - β Token costs increase with conversation length | |
| - β Must implement your own memory/persistence | |
| - β Context window limits eventually hit | |
| **Real-world solutions:** | |
| ```javascript | |
| // Trim old messages when too long | |
| if (messages.length > 20) { | |
| messages = [messages[0], ...messages.slice(-10)]; // Keep system + last 10 | |
| } | |
| // Summarize old context | |
| if (totalTokens > 10000) { | |
| const summary = await summarizeConversation(messages); | |
| messages = [systemMessage, summary, ...recentMessages]; | |
| } | |
| ``` | |
| --- | |
| ## Temperature: Controlling Randomness | |
| Temperature controls how "creative" or "random" the model's output is. | |
| ### How it works technically | |
| When generating each token, the model assigns probabilities to possible next tokens: | |
| ``` | |
| Input: "The sky is" | |
| Possible next tokens: | |
| - "blue" β 70% probability | |
| - "clear" β 15% probability | |
| - "dark" β 10% probability | |
| - "purple" β 5% probability | |
| ``` | |
| **Temperature modifies these probabilities:** | |
| **Temperature = 0.0 (Deterministic)** | |
| ``` | |
| Always pick the highest probability token | |
| "The sky is blue" β Same output every time | |
| ``` | |
| **Temperature = 0.7 (Balanced)** | |
| ``` | |
| Sample probabilistically with slight randomness | |
| "The sky is blue" or "The sky is clear" | |
| ``` | |
| **Temperature = 1.5 (Creative)** | |
| ``` | |
| Flatten probabilities, allow unlikely choices | |
| "The sky is purple" or "The sky is dancing" β More surprising! | |
| ``` | |
| ### Practical Guidelines | |
| **Temperature 0.0 - 0.3: Focused Tasks** | |
| - Code generation | |
| - Data extraction | |
| - Factual Q&A | |
| - Classification | |
| - Translation | |
| Example: | |
| ```javascript | |
| // Extract JSON from text - needs consistency | |
| temperature: 0.1 | |
| ``` | |
| **Temperature 0.5 - 0.9: Balanced Tasks** | |
| - General conversation | |
| - Customer support | |
| - Content summarization | |
| - Educational content | |
| Example: | |
| ```javascript | |
| // Friendly chatbot | |
| temperature: 0.7 | |
| ``` | |
| **Temperature 1.0 - 2.0: Creative Tasks** | |
| - Story writing | |
| - Brainstorming | |
| - Poetry/creative content | |
| - Generating variations | |
| Example: | |
| ```javascript | |
| // Generate 10 different marketing taglines | |
| temperature: 1.3 | |
| ``` | |
| --- | |
| ## Streaming: Real-time Responses | |
| ### Non-Streaming (Default) | |
| ``` | |
| User: "Tell me a story" | |
| [Wait...] | |
| [Wait...] | |
| [Wait...] | |
| Response: "Once upon a time, there was a..." (all at once) | |
| ``` | |
| **Pros:** | |
| - Simple to implement | |
| - Easy to handle errors | |
| - Get complete response before processing | |
| **Cons:** | |
| - Appears slow for long responses | |
| - No feedback during generation | |
| - Poor user experience for chat | |
| ### Streaming | |
| ``` | |
| User: "Tell me a story" | |
| "Once" | |
| "Once upon" | |
| "Once upon a" | |
| "Once upon a time" | |
| "Once upon a time there" | |
| ... | |
| ``` | |
| **Pros:** | |
| - Immediate feedback | |
| - Appears faster | |
| - Better user experience | |
| - Can process tokens as they arrive | |
| **Cons:** | |
| - More complex code | |
| - Harder error handling | |
| - Can't see full response before displaying | |
| ### When to Use Each | |
| **Use Non-Streaming:** | |
| - Batch processing scripts | |
| - When you need to analyze the full response | |
| - Simple command-line tools | |
| - API endpoints that return complete results | |
| **Use Streaming:** | |
| - Chat interfaces | |
| - Interactive applications | |
| - Long-form content generation | |
| - Any user-facing application where UX matters | |
| --- | |
| ## Tokens: The Currency of LLMs | |
| ### What are tokens? | |
| Tokens are the fundamental units that language models process. They're not exactly words, but pieces of text. | |
| **Tokenization examples:** | |
| ``` | |
| "Hello world" β ["Hello", " world"] = 2 tokens | |
| "coding" β ["coding"] = 1 token | |
| "uncoded" β ["un", "coded"] = 2 tokens | |
| ``` | |
| ### Why tokens matter | |
| **1. Cost** | |
| You pay per token (input + output): | |
| ``` | |
| Request: 100 tokens | |
| Response: 150 tokens | |
| Total billed: 250 tokens | |
| ``` | |
| **2. Context Limits** | |
| Each model has a maximum token limit: | |
| ``` | |
| gpt-4o: 128,000 tokens (β96,000 words) | |
| gpt-3.5-turbo: 16,384 tokens (β12,000 words) | |
| ``` | |
| **3. Performance** | |
| More tokens = longer processing time and higher cost | |
| ### Managing Token Usage | |
| **Monitor usage:** | |
| ```javascript | |
| console.log(response.usage.total_tokens); | |
| // Track cumulative usage for budgeting | |
| ``` | |
| **Limit response length:** | |
| ```javascript | |
| max_tokens: 150 // Cap the response | |
| ``` | |
| **Trim conversation history:** | |
| ```javascript | |
| // Keep only recent messages | |
| if (messages.length > 20) { | |
| messages = messages.slice(-20); | |
| } | |
| ``` | |
| **Estimate before sending:** | |
| ```javascript | |
| import { encode } from 'gpt-tokenizer'; | |
| const text = "Your message here"; | |
| const tokens = encode(text).length; | |
| console.log(`Estimated tokens: ${tokens}`); | |
| ``` | |
| --- | |
| ## Model Selection: Choosing the Right Tool | |
| ### GPT-4o: The Powerhouse | |
| **Best for:** | |
| - Complex reasoning tasks | |
| - Code generation and debugging | |
| - Technical content | |
| - Tasks requiring high accuracy | |
| - Working with structured data | |
| **Characteristics:** | |
| - Most capable model | |
| - Higher cost | |
| - Slower than GPT-3.5 | |
| - Best for quality-critical applications | |
| **Example use cases:** | |
| - Legal document analysis | |
| - Complex code refactoring | |
| - Research and analysis | |
| - Educational tutoring | |
| ### GPT-4o-mini: The Balanced Choice | |
| **Best for:** | |
| - General-purpose applications | |
| - Good balance of cost and performance | |
| - Most everyday tasks | |
| **Characteristics:** | |
| - Good performance | |
| - Moderate cost | |
| - Fast response times | |
| - Sweet spot for many applications | |
| **Example use cases:** | |
| - Customer support chatbots | |
| - Content summarization | |
| - General Q&A | |
| - Moderate complexity tasks | |
| ### GPT-3.5-turbo: The Speed Demon | |
| **Best for:** | |
| - High-volume, simple tasks | |
| - Speed-critical applications | |
| - Budget-conscious projects | |
| - Classification and extraction | |
| **Characteristics:** | |
| - Very fast | |
| - Lowest cost | |
| - Good for simple tasks | |
| - Less capable reasoning | |
| **Example use cases:** | |
| - Sentiment analysis | |
| - Text classification | |
| - Simple formatting | |
| - High-throughput processing | |
| ### Decision Framework | |
| ``` | |
| Is task critical and complex? | |
| ββ YES β GPT-4o | |
| ββ NO | |
| ββ Is speed important and task simple? | |
| ββ YES β GPT-3.5-turbo | |
| ββ NO β GPT-4o-mini | |
| ``` | |
| --- | |
| ## Error Handling and Resilience | |
| ### Common Error Scenarios | |
| **1. Authentication Errors (401)** | |
| ```javascript | |
| // Invalid API key | |
| Error: Incorrect API key provided | |
| ``` | |
| **2. Rate Limiting (429)** | |
| ```javascript | |
| // Too many requests | |
| Error: Rate limit exceeded | |
| ``` | |
| **3. Token Limits (400)** | |
| ```javascript | |
| // Context too long | |
| Error: This model's maximum context length is 16385 tokens | |
| ``` | |
| **4. Service Errors (500)** | |
| ```javascript | |
| // OpenAI service issue | |
| Error: The server had an error processing your request | |
| ``` | |
| ### Best Practices | |
| **1. Always use try-catch:** | |
| ```javascript | |
| try { | |
| const response = await client.chat.completions.create({...}); | |
| } catch (error) { | |
| if (error.status === 429) { | |
| // Implement backoff and retry | |
| } else if (error.status === 500) { | |
| // Retry with exponential backoff | |
| } else { | |
| // Log and handle appropriately | |
| } | |
| } | |
| ``` | |
| **2. Implement retry logic:** | |
| ```javascript | |
| async function retryWithBackoff(fn, maxRetries = 3) { | |
| for (let i = 0; i < maxRetries; i++) { | |
| try { | |
| return await fn(); | |
| } catch (error) { | |
| if (i === maxRetries - 1) throw error; | |
| await sleep(Math.pow(2, i) * 1000); // Exponential backoff | |
| } | |
| } | |
| } | |
| ``` | |
| **3. Monitor token usage:** | |
| ```javascript | |
| let totalTokens = 0; | |
| totalTokens += response.usage.total_tokens; | |
| if (totalTokens > MONTHLY_BUDGET_TOKENS) { | |
| throw new Error('Monthly token budget exceeded'); | |
| } | |
| ``` | |
| --- | |
| ## Architectural Patterns | |
| ### Pattern 1: Simple Request-Response | |
| **Use case:** One-off queries, simple automation | |
| ```javascript | |
| const response = await client.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [{ role: 'user', content: query }] | |
| }); | |
| ``` | |
| **Pros:** Simple, easy to understand | |
| **Cons:** No context, no memory | |
| ### Pattern 2: Stateful Conversation | |
| **Use case:** Chat applications, tutoring, customer support | |
| ```javascript | |
| class Conversation { | |
| constructor() { | |
| this.messages = [ | |
| { role: 'system', content: 'Your behavior' } | |
| ]; | |
| } | |
| async ask(userMessage) { | |
| this.messages.push({ role: 'user', content: userMessage }); | |
| const response = await client.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: this.messages | |
| }); | |
| this.messages.push(response.choices[0].message); | |
| return response.choices[0].message.content; | |
| } | |
| } | |
| ``` | |
| **Pros:** Maintains context, natural conversation | |
| **Cons:** Token costs grow, needs management | |
| ### Pattern 3: Specialized Agents | |
| **Use case:** Domain-specific applications | |
| ```javascript | |
| class PythonTutor { | |
| async help(question) { | |
| return await client.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [ | |
| { | |
| role: 'system', | |
| content: 'You are an expert Python tutor. Explain concepts clearly with code examples.' | |
| }, | |
| { role: 'user', content: question } | |
| ], | |
| temperature: 0.3 // Focused responses | |
| }); | |
| } | |
| } | |
| ``` | |
| **Pros:** Consistent behavior, optimized for domain | |
| **Cons:** Less flexible | |
| --- | |
| ## Hybrid Approach: Combining Proprietary and Open Source Models | |
| In real-world projects, the best solution often isn't choosing between OpenAI and local LLMs - it's using **both strategically**. | |
| ### Why Use a Hybrid Approach? | |
| **Cost optimization:** Use expensive models only when necessary | |
| **Privacy compliance:** Keep sensitive data local while leveraging cloud for general tasks | |
| **Performance balance:** Fast local models for simple tasks, powerful cloud models for complex ones | |
| **Reliability:** Fallback options when one service is down | |
| **Flexibility:** Match the right tool to each specific task | |
| ### Common Hybrid Architectures | |
| #### Pattern 1: Tiered Processing | |
| ``` | |
| Simple tasks β Local LLM (fast, free, private) | |
| β If complex | |
| Complex tasks β OpenAI API (powerful, accurate) | |
| ``` | |
| **Example workflow:** | |
| ```javascript | |
| async function processQuery(query) { | |
| const complexity = await assessComplexity(query); | |
| if (complexity < 0.5) { | |
| // Use local model for simple queries | |
| return await localLLM.generate(query); | |
| } else { | |
| // Use OpenAI for complex reasoning | |
| return await openai.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [{ role: 'user', content: query }] | |
| }); | |
| } | |
| } | |
| ``` | |
| **Use cases:** | |
| - Customer support: Local model for FAQs, GPT-4 for complex issues | |
| - Code generation: Local for simple scripts, GPT-4 for architecture | |
| - Content moderation: Local for obvious cases, cloud for edge cases | |
| #### Pattern 2: Privacy-Based Routing | |
| ``` | |
| Public data β OpenAI (best quality) | |
| Sensitive data β Local LLM (private, secure) | |
| ``` | |
| **Example:** | |
| ```javascript | |
| async function handleRequest(data, containsSensitiveInfo) { | |
| if (containsSensitiveInfo) { | |
| // Process locally - data never leaves your infrastructure | |
| return await localLLM.generate(data, { | |
| systemPrompt: "You are a HIPAA-compliant assistant" | |
| }); | |
| } else { | |
| // Use cloud for better quality | |
| return await openai.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [{ role: 'user', content: data }] | |
| }); | |
| } | |
| } | |
| ``` | |
| **Use cases:** | |
| - Healthcare: Patient data β Local, General medical info β OpenAI | |
| - Finance: Transaction details β Local, Market analysis β OpenAI | |
| - Legal: Client communications β Local, Legal research β OpenAI | |
| #### Pattern 3: Specialized Agent Ecosystem | |
| ``` | |
| Agent 1 (Local): Fast classifier | |
| β Routes to | |
| Agent 2 (OpenAI): Deep analyzer | |
| β Routes to | |
| Agent 3 (Local): Action executor | |
| ``` | |
| **Example:** | |
| ```javascript | |
| class MultiModelAgent { | |
| async process(input) { | |
| // Step 1: Local model classifies intent (fast, cheap) | |
| const intent = await localLLM.classify(input); | |
| // Step 2: Route to appropriate handler | |
| if (intent.requiresReasoning) { | |
| // Complex reasoning with GPT-4 | |
| const analysis = await openai.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [{ role: 'user', content: input }] | |
| }); | |
| return analysis.choices[0].message.content; | |
| } else { | |
| // Simple response with local model | |
| return await localLLM.generate(input); | |
| } | |
| } | |
| } | |
| ``` | |
| **Use cases:** | |
| - Multi-stage pipelines with different complexity levels | |
| - Agent systems where each agent has specialized capabilities | |
| - Workflows requiring both speed and intelligence | |
| #### Pattern 4: Development vs Production | |
| ``` | |
| Development β OpenAI (fast iteration, best results) | |
| β Optimize | |
| Production β Local LLM (cost-effective, private) | |
| ``` | |
| **Workflow:** | |
| ```javascript | |
| const MODEL_PROVIDER = process.env.NODE_ENV === 'production' | |
| ? 'local' | |
| : 'openai'; | |
| async function generateResponse(prompt) { | |
| if (MODEL_PROVIDER === 'local') { | |
| return await localLLM.generate(prompt); | |
| } else { | |
| return await openai.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [{ role: 'user', content: prompt }] | |
| }); | |
| } | |
| } | |
| ``` | |
| **Strategy:** | |
| 1. Develop with GPT-4 to get best results quickly | |
| 2. Fine-tune prompts and test thoroughly | |
| 3. Switch to local model for production | |
| 4. Fall back to OpenAI for edge cases | |
| #### Pattern 5: Ensemble Approach | |
| ``` | |
| Query β [Local Model, OpenAI, Another API] | |
| β β β | |
| Response Response Response | |
| β β β | |
| Aggregator / Validator | |
| β | |
| Best Response | |
| ``` | |
| **Example:** | |
| ```javascript | |
| async function ensembleGenerate(prompt) { | |
| // Get responses from multiple sources | |
| const [local, openai, backup] = await Promise.allSettled([ | |
| localLLM.generate(prompt), | |
| openaiClient.chat.completions.create({ | |
| model: 'gpt-4o', | |
| messages: [{ role: 'user', content: prompt }] | |
| }), | |
| backupAPI.generate(prompt) | |
| ]); | |
| // Use validator to pick best or combine | |
| return validator.selectBest([local, openai, backup]); | |
| } | |
| ``` | |
| **Use cases:** | |
| - Critical applications requiring high confidence | |
| - Fact-checking and verification | |
| - Reducing hallucinations through consensus | |
| ### Cost-Benefit Analysis | |
| #### Scenario: Customer Support Chatbot (10,000 queries/day) | |
| **Option A: OpenAI Only** | |
| ``` | |
| 10,000 queries Γ 500 tokens avg = 5M tokens/day | |
| Cost: ~$25-50/day = ~$750-1500/month | |
| Pros: Highest quality, zero infrastructure | |
| Cons: Expensive at scale, privacy concerns | |
| ``` | |
| **Option B: Local LLM Only** | |
| ``` | |
| Infrastructure: $100-500/month (server/GPU) | |
| Cost: $100-500/month | |
| Pros: Predictable costs, private, unlimited usage | |
| Cons: Setup complexity, maintenance, lower quality | |
| ``` | |
| **Option C: Hybrid (80% local, 20% OpenAI)** | |
| ``` | |
| 8,000 simple queries β Local LLM (free after setup) | |
| 2,000 complex queries β OpenAI (~$5-10/day) | |
| Infrastructure: $100-500/month | |
| API costs: $150-300/month | |
| Total: $250-800/month | |
| Pros: Cost-effective, high quality when needed, flexible | |
| Cons: More complex architecture | |
| ``` | |
| **Winner for most projects: Hybrid approach** β | |
| ### Decision Framework | |
| ``` | |
| START: New query arrives | |
| β | |
| Is data sensitive/regulated? | |
| ββ YES β Use local model (privacy first) | |
| ββ NO β Continue | |
| β | |
| Is task simple/repetitive? | |
| ββ YES β Use local model (cost-effective) | |
| ββ NO β Continue | |
| β | |
| Is high accuracy critical? | |
| ββ YES β Use OpenAI (quality first) | |
| ββ NO β Continue | |
| β | |
| Is it high volume? | |
| ββ YES β Use local model (cost at scale) | |
| ββ NO β Use OpenAI (simplicity) | |
| ``` | |
| ### The Future: Intelligent Model Selection | |
| Advanced systems will automatically choose models based on real-time factors: | |
| ```javascript | |
| class IntelligentModelSelector { | |
| async selectModel(query, context) { | |
| const factors = { | |
| complexity: await this.analyzeComplexity(query), | |
| latency: context.userTolerance, | |
| budget: context.remainingBudget, | |
| accuracy: context.requiredConfidence, | |
| privacy: context.dataClassification | |
| }; | |
| // ML model predicts best provider | |
| const selection = await this.mlSelector.predict(factors); | |
| return { | |
| provider: selection.provider, // 'local' | 'openai-mini' | 'openai-4' | |
| confidence: selection.confidence, | |
| reasoning: selection.reasoning | |
| }; | |
| } | |
| } | |
| ``` | |
| ### Key Takeaway | |
| **You don't have to choose.** Modern AI applications benefit from using the right model for each task: | |
| - **OpenAI / Claude / Host own big open source models:** Complex reasoning, critical accuracy, rapid development | |
| - **Local for scale:** Privacy, cost control, high volume, offline operation | |
| - **Both for success:** Cost-effective, flexible, reliable production systems | |
| The best architecture leverages the strengths of each approach while mitigating their weaknesses. | |
| --- | |
| ## Preparing for Agents | |
| The concepts covered here are **foundational** for building AI agents: | |
| ### You now understand: | |
| - **How to communicate with LLMs** (API basics) | |
| - **How to shape behavior** (system prompts) | |
| - **How to maintain context** (message history) | |
| - **How to control output** (temperature, tokens) | |
| - **How to handle responses** (streaming, errors) | |
| ### What's next for agents: | |
| - **Function calling / Tool use** - Let the AI take actions | |
| - **Memory systems** - Persistent state across sessions | |
| - **ReAct patterns** - Iterative reasoning and observation | |
| **Bottom line:** You can't build good agents without mastering these fundamentals. Every agent pattern builds on this foundation. | |
| --- | |
| ## Key Insights | |
| 1. **Statelessness is power and burden:** You control context, but you must manage it | |
| 2. **System prompts are your secret weapon:** Same model β different behaviors | |
| 3. **Temperature changes everything:** Match it to your task type | |
| 4. **Tokens are the real currency:** Monitor and optimize usage | |
| 5. **Model choice matters:** Don't use a sledgehammer for a nail | |
| 6. **Streaming improves UX:** Use it for user-facing applications | |
| 7. **Error handling is not optional:** The network will fail, plan for it | |
| --- | |
| ## Further Reading | |
| - [OpenAI API Documentation](https://platform.openai.com/docs/api-reference) | |
| - [OpenAI Cookbook](https://cookbook.openai.com/) | |
| - [Best Practices for Prompt Engineering](https://platform.openai.com/docs/guides/prompt-engineering) | |
| - [Token Counting](https://platform.openai.com/tokenizer) | |