Spaces:
Sleeping
Sleeping
| GENERAL_QUALITY_STANDARDS = """ | |
| The overall goal of the quiz is to set the learner up for success in answering | |
| interesting and non-trivial questions (not to give them a painful or discouraging | |
| experience). A perfect score should be attainable for someone who thoughtfully | |
| consumed the course content the quiz is based on. The question you write must be | |
| aligned with the course content and the provided | |
| learning objective and correct answer. | |
| Because this is a software development course and an AI development course, refrain from any references to manual intervention unless absolutely relevant | |
| """ | |
| MULTIPLE_CHOICE_STANDARDS = """ | |
| - Each question must have EXACTLY ONE correct answer (no more, no less) | |
| - Each question should have a clear, unambiguous correct answer | |
| - Distractors (wrong answer options) should be plausible and represent common misconceptions. Not obviously wrong. | |
| - All options should be of similar length and detail | |
| - Options should be mutually exclusive | |
| - Avoid "all/none of the above" options unless pedagogically necessary | |
| - Typically include 4 options (A, B, C, D) | |
| - IMPORTANT NOTE: do not start the answer feedback with “correct” or “i | |
| """ | |
| EXAMPLE_QUESTIONS = """ | |
| <EXAMPLE_QUESTION_1> | |
| What is a code agent in the context of an LLM workflow? | |
| A: An AI model that generates text responses without executing external actions. | |
| Feedback: Code agents do more than generate text—they generate and run code to perform actions. | |
| *B: An AI agent that can write and execute code as part of its decision-making process. | |
| Feedback: Well done! Code agents can write and execute code to handle tasks, rather than just output text or follow a strict script. | |
| C: A pre-programmed or “coded” AI system that follows a strict decision tree to perform tasks. | |
| Feedback: Code agents are not fixed, rule-based systems. Code agents can write and execute code, rather than following a single predetermined decision tree. | |
| D: An AI assistant that can perform calculations and simple tasks without external interaction. | |
| Feedback: Code agents can write and execute code to perform complex tasks, not just simple calculations. | |
| </EXAMPLE_QUESTION_1> | |
| <EXAMPLE_QUESTION_2> | |
| In the context of agent architectures, what is a key performance trade-off between representing agent actions as code (code agents) versus representing them as JSON tool calls, when it comes to complex multi-step tasks? | |
| *A: Code agents use fewer tokens, exhibit lower latency, and have reduced error rates, since complex actions can be represented and executed in a single, consolidated code snippet. | |
| Feedback: Nice work! Code agents can execute loops, reuse variables, and call multiple tools with a single code snippet, so they need far fewer LLM turns. Fewer turns mean lower token usage, shorter round-trip latency, and fewer chances for the model to make mistakes. | |
| B: JSON-based agents require fewer tokens because each step is more compact, resulting in faster execution and fewer errors for complex tasks. | |
| Feedback: JSON tool calls actually increase token usage because each micro-action and its context must be sent back to the model repeatedly. This longer chain of steps results in slower execution and an increased chance of errors. | |
| C: Both code-based and JSON-based representations are equivalent in terms of token usage, latency, and error rates for complex multi-step tasks. | |
| Feedback: Code agents can chain many actions in a single code execution step, which reduces token usage, latency, and error rates, while tool calling agents execute a chain step-by-step, which typically increases token usage, latency, and the chance of errors. | |
| D: Code agents have higher latency due to the complexity of parsing code, while JSON action representation avoids errors by breaking down tasks into isolated calls. | |
| Feedback: Parsing a short code snippet is straightforward and fast, and overall latency tends to be dominated by LLM-token traffic. Code agents can execute complex tasks in one step, so they typically run faster and with a reduced chance of errors. JSON action agents often execute the same complex task step-by-step using many LLM turns, so latency and error rate are higher, not lower. | |
| </EXAMPLE_QUESTION_2> | |
| <EXAMPLE_QUESTION_3> | |
| What is one of the main risks associated with running code agents on your local computer? | |
| A: They might send spam emails from your email account. | |
| Feedback: The lesson highlights risks such as file deletion, resource abuse, or network compromise, not sending spam emails. | |
| *B: They could execute code that deletes critical files or creates many files that bloat your system. | |
| Feedback: Good work! Letting an agent run code locally can compromise your system in a number of ways, such as deleting vital files, or generating a large number of files. | |
| C: They might launch a denial-of-service attack on your own computer. | |
| Feedback: The lesson focuses on more direct threats such as file deletion and creation, or installing malware. While a denial-of-service is theoretically possible, it isn’t emphasized as a primary risk. | |
| D: They might cause your computer to overheat by overusing the CPU. | |
| Feedback: Overheating and CPU overuse isn’t one of the primary risks discussed in the lesson. | |
| </EXAMPLE_QUESTION_3> | |
| <EXAMPLE_QUESTION_4> | |
| How does the custom local Python interpreter demonstrated in the course mitigate risks from harmful code execution? | |
| A: The local interpreter uses the standard Python interpreter but logs all output for manual review. | |
| Feedback: The custom interpreter presented in the course does not rely on the normal Python interpreter at all. It enforces safeguards such as blocking imports, ignoring shell commands, and capping the number of loop-iteration caps. | |
| *B: It ignores undefined commands, disallows imports outside an explicit whitelist, and sets a hard cap on loop iterations to prevent infinite loops and resource abuse. | |
| Feedback: That’s right! The custom Python interpreter presented in the course skips undefined shell-style commands, blocks any import not explicitly approved, and stops loop executions that exceed the cap, all of which help mitigate security and resource risks. | |
| C: It only allows execution of code that does not require any external packages, preventing all imports regardless of configuration. | |
| Feedback: The interpreter isn’t a blanket “no-imports” sandbox. It blocks imports by default, but you can pass an explicit whitelist (e.g., LocalPythonExecutor(["numpy", "PIL"])) that lets approved external packages load. | |
| D: It prevents all code execution by rejecting any code containing loops or function definitions. | |
| Feedback: The interpreter doesn’t blanket-ban loops or function definitions. It still runs normal Python code—including loops and functions—but adds safeguards: it caps loop iterations, blocks disallowed imports, and skips undefined commands. | |
| </EXAMPLE_QUESTION_4> | |
| <EXAMPLE_QUESTION_5> | |
| Which of the following is a key security advantage of using a remote sandbox environment for executing code agents, as discussed in the lesson? | |
| A: It ensures faster execution of agents by optimizing code compilation. Feedback: Remote sandboxes primarily protect systems from potential harm, not enhance code execution speed. | |
| *B: It allows execution of code without the risk of affecting local systems. | |
| Feedback: Great job! Running agents in a remote sandbox isolates their code execution so that any errors or malicious actions cannot harm your local system. | |
| C: It provides detailed real-time monitoring of all code executions. | |
| Feedback: A remote sandbox mainly prevents harmful code from threatening your local system, not through real-time monitoring. | |
| D: It guarantees execution of the code with reduced computational cost. | |
| Feedback: Using a remote sandbox protects your local system from malicious or faulty code. It does not guarantee reduced computational cost. | |
| </EXAMPLE_QUESTION_5> | |
| Note that all example questions follow the general quality standards as well as | |
| the question specific quality standards. The correct answer (marked with a *) and | |
| incorrect answer options follow the standards specific to correct and incorrect | |
| answers. | |
| """ | |
| QUESTION_SPECIFIC_QUALITY_STANDARDS = """ | |
| The question you write must: | |
| - be in the language and tone of the course. | |
| - be at a similar level of difficulty or complexity as encountered in the course. | |
| - assess only information from the course and not depend on information that was | |
| not covered in the course. | |
| - not attempt to teach something as part of the quiz. | |
| - use clear and concise language | |
| - not induce confusion | |
| - provide a slight (not major) challenge. | |
| - be easily interpreted and unambiguous. | |
| - be well written in clear and concise language, proper grammar, good sentence | |
| structure, and consistent formatting | |
| - be thoughtful and specific rather than broad and ambiguous | |
| - be complete in its wording such that understanding the question is not part | |
| of the assessment | |
| """ | |
| CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS = """ | |
| The correct answer you write: | |
| - must be factually correct and unambiguous | |
| - be in the language and tone of the course and in complete sentence form. | |
| - be at a similar level of difficulty or complexity as encountered in the course. | |
| - contain only information from the course and not depend on information that was | |
| not covered in the course. | |
| - not attempt to teach something as part of the quiz. | |
| - use clear and concise language | |
| - be thoughtful and specific rather than broad and ambiguous | |
| - be complete in its wording such that understanding which is the correct answer | |
| is not part of the assessment | |
| """ | |
| INCORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS = """ | |
| The incorrect answer options you write should ideally represent reasonable potential misconceptions, but they could also be answers that would sound plausible to someone who has not taken the course, or was not paying close enough attention during the course. In that sense, they should require some thought, even by the learner who has diligently completed the course, to determine they are incorrect. | |
| When constructing incorrect answer feedback, pay attention to the incorrect_answer_suggestions provided along with the learning objective. These are not your only options for incorrect answers but you can use them directly, or as a starting point or in addition to other plausible incorrect answers. | |
| Wrong answeres should not be so obviously wrong that a learner who has not taken the course can immediately rule them out. | |
| Here are some examples of poorly written incorrect answer options for a particular | |
| question: | |
| <QUESTION> | |
| Which statement best explains why monitoring an agent's trace can be helpful | |
| in debugging and establishing the performance of and agent? | |
| </QUESTION> | |
| <POORLY_WRITTEN_INCORRECT_ANSWER_OPTION_1> | |
| The agent's trace is a visual interface for users that adds limited insight into | |
| the agent's internal processes. | |
| </POORLY_WRITTEN_INCORRECT_ANSWER_OPTION_1> | |
| The example above is poorly written because it is obviously wrong. The question | |
| is asking for why monitoring the agent trace can be helpful and this answer option | |
| states that it is a visual interface that provides limited insight, which is | |
| certainly incorrect, but does not represent a reasonable potential misconception | |
| and even a learner who has not taken the course can immediately rule it out. | |
| <POORLY_WRITTEN_INCORRECT_ANSWER_OPTION_2> | |
| The agent trace is used exclusively for scenarios where the agent is underperforming. | |
| </POORLY_WRITTEN_INCORRECT_ANSWER_OPTION_2> | |
| This answer option is also poorly written because it is obviously wrong. The use of the word "exclusively" is a tipoff that this is not the right answer. Similarly, formulating incorrect answer options with words like "only", "always", "never", and similar words that in and of themselves make the answer option wrong represent poor word choice for incorrect answer options. | |
| Below is an example of a well written incorrect answer option for the same question: | |
| <WELL_WRITTEN_INCORRECT_ANSWER_OPTION> | |
| The agent trace comprises a list of the error messages generated during an agent's | |
| execution, which is helpful for debugging. | |
| </WELL_WRITTEN_INCORRECT_ANSWER_OPTION> | |
| The example above is well written because it is not obviously wrong and represents | |
| a reasonable potential misconception. It requires some thought to determine it is | |
| incorrect and a learner who has not taken the course will not be able to | |
| immediately rule it out. In fact, if you changed the word "comprises" to "includes" the answer option would be correct, in a sense, just incomplete. But in this case, the learner needs to be paying close attention to identify this as incorrect.a | |
| """ | |
| ANSWER_FEEDBACK_QUALITY_STANDARDS = """ | |
| Every correct and incorrect answer must include feedback. | |
| Incorrect answer feedback should: | |
| - be informational and encouraging, not punitive. | |
| - be a single sentence, concise and to the point. | |
| - Do not say "Incorrect" or "Wrong". | |
| Correct answer feedback should: | |
| - be informational and encouraging. | |
| - be a single sentence, concise and to the point. | |
| - Do not say Correct! or anything that will sound redundant after the string "Correct: ", e.g. "Correct: Correct!". | |
| """ | |
| INCORRECT_ANSWER_PROMPT = """ | |
| # CORE PRINCIPLES WITH EXAMPLES: | |
| ## 1. CREATE COMMON MISUNDERSTANDINGS | |
| Create incorrect answer suggestions that represent how students actually misunderstand the material: | |
| <example> | |
| Learning Objective: "What is version control in software development?" | |
| Correct Answer: "A system that tracks changes to files over time so specific versions can be recalled later." | |
| Plausible Incorrect Answer Suggestions: | |
| - "A testing method that ensures software works correctly across different operating system versions." (Confuses with cross-platform testing) | |
| - "A project management approach where each team member works on a separate software version." (Misunderstands the concept entirely) | |
| - "A release strategy that maintains multiple versions of software for different customer needs." (Confuses with product versioning) | |
| </example> | |
| ## 2. MAINTAIN IDENTICAL STRUCTURE | |
| All incorrect answer suggestions must match the correct answer's grammatical pattern, length, and formatting: | |
| <example> | |
| Learning Objective: "What are the three primary data structures used in machine learning algorithms?" | |
| Correct Answer: "Arrays, matrices, and graphs." | |
| Good Incorrect Answer Suggestions: | |
| - "Dictionaries, trees, and queues." (Same structure, different data structures) | |
| - "Tensors, vectors, and databases." (Same structure but mixing concepts) | |
| - "Features, labels, and parameters." (Same structure, but confuses data structures with ML concepts) | |
| Bad Incorrect Answer Suggestion: | |
| - "Machine learning algorithms first store data in arrays, then process it using functional programming." (Different structure) | |
| </example> | |
| ## 3. USE COURSE TERMINOLOGY CORRECTLY BUT IN WRONG CONTEXTS | |
| Use terms from the course material but apply them incorrectly: | |
| <example> | |
| Learning Objective: "What is the purpose of backpropagation in neural networks?" | |
| Correct Answer: "To calculate gradients used to update weights during training." | |
| Plausible Incorrect Answer Suggestions: | |
| - "To normalize input data across layers to prevent gradient explosion." (Uses correct terms but describes batch normalization) | |
| - "To optimize the activation functions by adjusting their thresholds during inference." (Misapplies neural network terminology) | |
| - "To propagate inputs forward through the network during the prediction phase." (Confuses with forward propagation) | |
| </example> | |
| ## 4. INCLUDE PARTIALLY CORRECT INFORMATION | |
| Create incorrect answer suggestions that contain some correct elements but miss critical aspects: | |
| <example> | |
| Learning Objective: "How does transfer learning improve deep neural network training?" | |
| Correct Answer: "By reusing features learned from a large dataset to initialize a model that can then be fine-tuned on a smaller, task-specific dataset." | |
| Plausible Incorrect Answer Suggestions: | |
| - "By transferring trained models between different neural network frameworks to improve compatibility and deployment options." (Misunderstands the concept of knowledge transfer) | |
| - "By reusing features learned from a large dataset and freezing all weights to prevent any updates during task-specific training." (First part correct, second part wrong) | |
| - "By combining multiple pre-trained models into a committee that votes on final predictions for improved accuracy." (Confuses with ensemble learning) | |
| </example> | |
| ## 5. AVOID OBVIOUSLY WRONG ANSWERS | |
| Don't create incorrect answer suggestions that anyone with basic knowledge could eliminate: | |
| <example> | |
| Learning Objective: "What is unit testing in software development?" | |
| Correct Answer: "Testing individual components in isolation to verify they work as expected." | |
| Bad Incorrect Answer Suggestions to Avoid: | |
| - "A process where code is randomly modified to see if it still works." (Too obviously wrong) | |
| - "Testing that should never be done because it wastes development time." (Contradicts basic principles) | |
| - "Running the software on different units of hardware like phones and laptops." (Misunderstands the basic concept) | |
| </example> | |
| ## 6. MIRROR THE DETAIL LEVEL AND STYLE | |
| Match the technical depth and tone of the correct answer: | |
| <example> | |
| Learning Objective: "What is the time complexity of quicksort in the average case?" | |
| Correct Answer: "O(n log n), where n is the number of elements to be sorted." | |
| Good Incorrect Answer Suggestions: | |
| - "O(n^2), where n is the number of elements to be sorted." (Same level of detail) | |
| - "O(n), where n is the number of elements to be sorted." (Same structure and detail) | |
| - "O(log n), where n is the number of elements to be sorted." (Same structure but incorrect complexity) | |
| Bad Incorrect Answer Suggestion: | |
| - "Quicksort is generally faster than bubble sort but can perform poorly on already sorted arrays." (Different style and not answering the specific objective) | |
| </example> | |
| ## 7. FOR LIST QUESTIONS, MAINTAIN CONSISTENCY | |
| If the correct answer lists specific items, all incorrect answer suggestions should list the same number of items: | |
| <example> | |
| Learning Objective: "What are the three key principles of object-oriented programming?" | |
| Correct Answer: "Encapsulation, inheritance, and polymorphism." | |
| Good Incorrect Answer Suggestions: | |
| - "Encapsulation, inheritance, and composition." (Same structure, two correct, one incorrect) | |
| - "Abstraction, polymorphism, and delegation." (Same structure, mix of correct/incorrect) | |
| - "Instantiation, implementation, and isolation." (Same structure, all incorrect but plausible terms) | |
| </example> | |
| ## 8. AVOID ABSOLUTE TERMS AND UNNECESSARY COMPARISONS | |
| Don't use words like "always," "never,", "mainly", "exclusively", "primarily" or "rather than". | |
| These words are absolute or extreme qualifiers and comparative terms that artificially limit or overgeneralize statements, creating false dichotomies or unsubstantiated hierarchies. | |
| More words you should avoid are: All, every, entire, complete, none, nothing, no one, only, solely, merely, completely, totally, utterly, always, forever, constantly, never, impossible, must, mandatory, required, instead of, as opposed to, exclusively, purely | |
| <example> | |
| Learning Objective: "What is the purpose of index partitioning in databases?" | |
| Correct Answer: "To improve query performance by dividing large indexes into smaller, more manageable segments." | |
| Bad Incorrect Answer Suggestions to Avoid: | |
| - "To always guarantee the fastest possible query performance regardless of data size." (Uses "always") | |
| - "To improve query performance rather than ensuring data integrity or providing backup functionality." (Unnecessary comparison) | |
| - "To exclusively support distributed database systems that never operate on a single server." (Uses absolute terms) | |
| </example> | |
| """ | |
| INCORRECT_ANSWER_EXAMPLES = """ | |
| <example> | |
| Learning Objective: "What is the purpose of activation functions in neural networks?" | |
| Correct Answer: "To introduce non-linearity into the network's output." | |
| Plausible Incorrect Answer Suggestions: | |
| - "To normalize input data across different feature scales." (Confuses with data normalization) | |
| - "To reduce computational complexity during forward propagation." (Misunderstands as performance optimization) | |
| - "To prevent gradient explosion during backpropagation training." (Confuses with gradient clipping) | |
| </example> | |
| Note: All options follow the same grammatical structure ("To [verb] [object]") across all options. | |
| <example> | |
| Learning Objective: "What is the main function of Git branching?" | |
| Correct Answer: "To separate work on different features or fixes from the main codebase." | |
| Plausible Incorrect Answer Suggestions: | |
| - "To create backup copies of the repository in case of system failure." (Confuses with backup functionality) | |
| - "To track different versions of files across multiple development environments." (Mixes up with version tracking) | |
| - "To isolate unstable code until it passes integration testing protocols." (Focuses only on testing aspects) | |
| </example> | |
| Note: All options maintain identical sentence structure ("To [verb] [object phrase]") with similar length and complexity. | |
| <example> | |
| Learning Objective: "Which category of machine learning algorithms does K-means clustering belong to?" | |
| Correct Answer: "Unsupervised learning algorithms that identify patterns without labeled training data." | |
| Plausible Incorrect Answer Suggestions: | |
| - "Supervised learning algorithms that predict continuous values based on labeled examples." (Confuses with regression) | |
| - "Reinforcement learning algorithms that optimize decisions through environment interaction." (Misclassifies algorithm type) | |
| - "Semi-supervised learning algorithms that combine labeled and unlabeled data for training." (Incorrect classification) | |
| </example> | |
| Note: All options follow consistent structure: "[Category] algorithms that [what they do]" while using correct ML terminology in wrong contexts. | |
| <example> | |
| Learning Objective: "How does feature scaling improve the performance of distance-based machine learning models?" | |
| Correct Answer: "By ensuring all features contribute equally to distance calculations regardless of their original ranges." | |
| Plausible Incorrect Answer Suggestions: | |
| - "By removing redundant features that would otherwise dominate the learning algorithm." (Confuses with feature selection) | |
| - "By converting categorical variables into numerical representations for mathematical operations." (Mixes up with encoding) | |
| - "By increasing the dimensionality of the feature space to capture more complex relationships." (Confuses with feature expansion) | |
| </example> | |
| Note: All options maintain consistent grammatical structure ("By [verb+ing] [object] [qualification]") while including partially correct concepts. | |
| <example> | |
| Learning Objective: "How do NoSQL databases differ from relational databases?" | |
| Correct Answer: "NoSQL databases use flexible schema designs while relational databases enforce strict predefined schemas." | |
| Plausible Incorrect Answer Suggestions: | |
| - "NoSQL databases support ACID transactions while relational databases prioritize eventual consistency." (Reverses actual characteristics) | |
| - "NoSQL databases require SQL for queries while relational databases support multiple query languages." (Fundamentally incorrect) | |
| - "NoSQL databases are primarily used for small datasets while relational databases handle big data applications." (Inverts typical use cases) | |
| </example> | |
| Note: All options follow identical grammatical structure: "NoSQL databases [characteristic] while relational databases [contrasting characteristic]" with similar technical detail. | |
| <example> | |
| Learning Objective: "What are the three primary service models in cloud computing?" | |
| Correct Answer: "Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS)." | |
| Plausible Incorrect Answer Suggestions: | |
| - "Virtual Machines as a Service (VMaaS), Containers as a Service (CaaS), and Functions as a Service (FaaS)." (Confuses with deployment methods) | |
| - "Storage as a Service (STaaS), Network as a Service (NaaS), and Compute as a Service (CaaS)." (Mixes up with service categories) | |
| - "Public Cloud, Private Cloud, and Hybrid Cloud." (Confuses with deployment models) | |
| </example> | |
| Note: Each option follows the pattern "[Item 1], [Item 2], and [Item 3]" with consistent abbreviation formatting and exactly three items. | |
| <example> | |
| Learning Objective: "What is the best practice for conducting effective code reviews?" | |
| Correct Answer: "Review small, focused changes regularly rather than large batches of code infrequently." | |
| Plausible Incorrect Answer Suggestions: | |
| - "Ensure only senior developers conduct reviews to maintain code quality standards." (Overemphasizes seniority) | |
| - "Focus on identifying bugs rather than architectural or stylistic issues." (Narrows scope too much) | |
| - "Require code to pass automated tests with 100 percent coverage before human review." (Overstates requirements) | |
| </example> | |
| Note: All options follow similar imperative structure with concrete recommendations while avoiding absolute terms like "always" or "never". | |
| <example> | |
| Learning Objective: "Which statement accurately describes the role of a Scrum Master in agile development?" | |
| Correct Answer: "A facilitator who removes impediments and ensures the team follows agile practices." | |
| Plausible Incorrect Answer Suggestions: | |
| - "A technical leader who reviews code quality and makes final architectural decisions." (Confuses with tech lead role) | |
| - "A project manager who assigns tasks and tracks individual team member performance." (Mixes up with traditional PM) | |
| - "A product owner who prioritizes features and accepts completed work on behalf of stakeholders." (Confuses with Product Owner) | |
| </example> | |
| Note: All options follow consistent grammatical structure ("A [role] who [does something specific]") with parallel descriptions. | |
| <example> | |
| Learning Objective: "What is the most likely cause of an SQL injection vulnerability?" | |
| Correct Answer: "Directly incorporating user input into database queries without proper validation or parameterization." | |
| Plausible Incorrect Answer Suggestions: | |
| - "Using outdated database management systems that lack modern security features." (Confuses with database vulnerabilities) | |
| - "Implementing weak password hashing algorithms for user authentication." (Mixes up with authentication issues) | |
| - "Failing to enable HTTPS for secure data transmission between client and server." (Confuses with transport security) | |
| </example> | |
| Note: All options follow consistent structure describing a security issue while focusing on different security domains that students might confuse. | |
| """ | |
| RANK_QUESTIONS_PROMPT = """ | |
| Rank the following multiple-choice questions based on their quality as assessment items. | |
| These questions have been selected as the best in a group of questions already. Your task is to rank them based on their quality as assessment items. | |
| <RANKING_CRITERIA> | |
| 1. Question clarity and unambiguity | |
| 2. Alignment with the stated learning objective | |
| 3. Quality of incorrect answer options see guidelines | |
| 4. Quality of feedback for each option | |
| 5. Appropriate difficulty level and use of simple english. See below examples of simple versus complex english, and consider simple english better for your ranking. | |
| <DIFFICULTY_LEVEL_GUIDELINES> | |
| <EXAMPLE_1> | |
| <SIMPLE_ENGLISH>AI engineers create computer programs that can learn from data and make decisions.</SIMPLE_ENGLISH> | |
| <COMPLEX_ENGLISH>AI engineering practitioners architect computational paradigms exhibiting autonomous erudition capabilities via statistical data assimilation and subsequent decisional extrapolation.</COMPLEX_ENGLISH> | |
| </EXAMPLE_1> | |
| <EXAMPLE_2> | |
| <SIMPLE_ENGLISH>Machine learning models need large amounts of good data to work well.</SIMPLE_ENGLISH> | |
| <COMPLEX_ENGLISH>Machine learning algorithmic frameworks necessitate voluminous, high-fidelity datasets to achieve optimal efficacy in their inferential capacities.</COMPLEX_ENGLISH> | |
| </EXAMPLE_2> | |
| </DIFFICULTY_LEVEL_GUIDELINES> | |
| 6. It's adherence to the below guidelines: | |
| <GUIDELINES> | |
| <General Quality Standards> | |
| {GENERAL_QUALITY_STANDARDS} | |
| </General Quality Standards> | |
| <Multiple Choice Specific Standards> | |
| {MULTIPLE_CHOICE_STANDARDS} | |
| </Multiple Choice Specific Standards> | |
| Follows these example questions: | |
| <Example Questions> | |
| {EXAMPLE_QUESTIONS} | |
| </Example Questions> | |
| Questions followed these instructions: | |
| <Question Specific Quality Standards> | |
| {QUESTION_SPECIFIC_QUALITY_STANDARDS} | |
| </Question Specific Quality Standards> | |
| Correct answers followed these instructions: | |
| <Correct Answer Specific Quality Standards> | |
| {CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS} | |
| </Correct Answer Specific Quality Standards> | |
| Incorrect answers followed these instructions: | |
| <Incorrect Answer Specific Quality Standards> | |
| {INCORRECT_ANSWER_PROMPT} | |
| </Incorrect Answer Specific Quality Standards> | |
| Here are some examples of high quality incorrect answer suggestions: | |
| <incorrect_answer_examples> | |
| {INCORRECT_ANSWER_EXAMPLES} | |
| </incorrect_answer_examples> | |
| Words to avoid: | |
| <Words To Avoid> | |
| AVOID ABSOLUTE TERMS AND UNNECESSARY COMPARISONS | |
| Don't use words like "always," "never,", "mainly", "exclusively", "primarily" or "rather than". | |
| These words are absolute or extreme qualifiers and comparative terms that artificially limit or overgeneralize statements, creating false dichotomies or unsubstantiated hierarchies. | |
| More words you should avoid are: All, every, entire, complete, none, nothing, no one, only, solely, merely, completely, totally, utterly, always, forever, constantly, never, impossible, must, mandatory, required, instead of, as opposed to, exclusively, purely | |
| </Words To Avoid> | |
| <Answer Feedback Quality Standards> | |
| {ANSWER_FEEDBACK_QUALITY_STANDARDS} | |
| </Answer Feedback Quality Standards> | |
| </GUIDELINES> | |
| </RANKING_CRITERIA> | |
| <IMPORTANT RANKING INSTRUCTIONS> | |
| 1. DO NOT change the question with ID=1 (if present). | |
| 2. Rank ONLY the questions listed below. | |
| 3. Return a JSON array with each question's original ID and its rank (2, 3, 4, etc.). | |
| 4. The best question should have rank 2 (since rank 1 is reserved). | |
| 5. Consider clarity, specificity, alignment with the learning objectives, and how well each question follows the criteria above. | |
| 6. CRITICAL: You MUST return ALL questions that were provided for ranking. Do not omit any questions. Each question must be assigned a unique rank. | |
| 7. CRITICAL: Each question must have a UNIQUE rank. No two questions can have the same rank. | |
| <CRITICAL INSTRUCTION - READ CAREFULLY> | |
| YOU MUST RETURN ALL QUESTIONS THAT WERE PROVIDED FOR RANKING. | |
| If you receive 30 questions to rank, you must return all 30 questions in your response. | |
| DO NOT OMIT ANY QUESTIONS. | |
| EACH QUESTION MUST HAVE A UNIQUE RANK (2, 3, 4, 5, etc. with no duplicates). | |
| </CRITICAL INSTRUCTION - READ CAREFULLY> | |
| Your response must be in the following JSON format. Each question must include ALL of the following fields: | |
| [ | |
| {{ | |
| "id": int, | |
| "question_text": str, | |
| "options": list[dict], | |
| "learning_objective": str, | |
| "learning_objective_id": int, | |
| "correct_answer": str, | |
| "source_reference": list[str] or str, | |
| "judge_feedback": str or null, | |
| "approved": bool or null, | |
| "rank": int, | |
| "ranking_reasoning": str, | |
| "in_group": bool, | |
| "group_members": list[int], | |
| "best_in_group": bool | |
| }}, | |
| ... | |
| ] | |
| <RANKING EXAMPLE> | |
| { | |
| "id": 2, | |
| "question_text": "What is the primary purpose of AI agents?", | |
| "options": [...], | |
| "learning_objective_id": 3, | |
| "learning_objective": "Describe the main applications of AI agents.", | |
| "correct_answer": "To automate tasks and make decisions", | |
| "source_reference": ["sc-Arize-C1-L3-eng.vtt"], | |
| "judge_feedback": "This question effectively tests understanding of AI agent applications.", | |
| "approved": true, | |
| "rank": 3, | |
| "ranking_reasoning": "Clear question that tests understanding of AI agents, but could be more specific.", | |
| "in_group": false, | |
| "group_members": [2], | |
| "best_in_group": true | |
| } | |
| { | |
| "id": 3, | |
| "question_text": "Which of the following best describes machine learning?", | |
| "options": [...], | |
| "learning_objective_id": 2, | |
| "learning_objective": "Define machine learning.", | |
| "correct_answer": "A subset of AI that enables systems to learn from data", | |
| "source_reference": ["sc-Arize-C1-L2-eng.vtt"], | |
| "judge_feedback": "Good fundamental question.", | |
| "approved": true, | |
| "rank": 2, | |
| "ranking_reasoning": "Excellent clarity and directly addresses a fundamental concept.", | |
| "in_group": true, | |
| "group_members": [3, 8], | |
| "best_in_group": true | |
| } | |
| { | |
| "id": 4, | |
| "question_text": "What is a neural network?", | |
| "options": [...], | |
| "learning_objective_id": 4, | |
| "learning_objective": "Explain neural networks.", | |
| "correct_answer": "A computing system inspired by biological neural networks", | |
| "source_reference": ["sc-Arize-C1-L4-eng.vtt"], | |
| "judge_feedback": "Basic definition question.", | |
| "approved": true, | |
| "rank": 4, | |
| "ranking_reasoning": "Clear but very basic definition question without application context.", | |
| "in_group": false, | |
| "group_members": [4], | |
| "best_in_group": true | |
| } | |
| </RANKING EXAMPLE> | |
| </IMPORTANT RANKING INSTRUCTIONS> | |
| """ | |
| GROUP_QUESTIONS_PROMPT = """ | |
| Group the following multiple-choice questions based on their quality as assessment items. | |
| <GROUPING_INSTRUCTIONS> | |
| 1. Identify groups of similar questions that test essentially the same concept or knowledge area. | |
| 2. You can identify similar groups if the learning_objective.id is the same. If two questions have the same learning_objective.id assume they are testing the same concept. | |
| 3. For each question, indicate whether it belongs to a group of similar questions by setting "in_group" to true or false. | |
| 4. For questions that are part of a group, include a "group_members" field with a list of all IDs in that group (including the question itself). If a question has only one group member, set "group_members" to a list with the ID of the question itself. | |
| 5. For each question, add a boolean field "best_in_group": set this to true for the highest-ranked (lowest rank number) question in each group, and false for all others in the group. For questions not in a group, set "best_in_group" to true by default. | |
| 6. CRITICAL: You MUST return ALL questions that were provided for grouping. Do not omit any questions. | |
| 7. CRITICAL: Each question must have a UNIQUE rank. No two questions can have the same rank. | |
| Your response must be in the following JSON format. Each question must include ALL of the following fields: | |
| </GROUPING_INSTRUCTIONS> | |
| <CRITICAL INSTRUCTION - READ CAREFULLY> | |
| YOU MUST RETURN ALL QUESTIONS THAT WERE PROVIDED FOR GROUPING. | |
| If you receive 30 questions to group, you must return all 30 questions in your response. | |
| DO NOT OMIT ANY QUESTIONS. | |
| </CRITICAL INSTRUCTION - READ CAREFULLY> | |
| Your response must be in the following JSON format. Each question must include ALL of the following fields: | |
| [ | |
| {{ | |
| "id": int, | |
| "question_text": str, | |
| "options": list[dict], | |
| "learning_objective_id": int, | |
| "learning_objective": str, | |
| "correct_answer": str, | |
| "source_reference": list[str] or str, | |
| "judge_feedback": str or null, | |
| "approved": bool or null, | |
| "in_group": bool, | |
| "group_members": list[int], | |
| "best_in_group": bool | |
| }}, | |
| ... | |
| ] | |
| <Example> | |
| [ | |
| {{ | |
| "id": 2, | |
| "question_text": "What is the primary purpose of AI agents?", | |
| "options": [ | |
| {{ | |
| "option_text": "To automate tasks and make decisions", | |
| "is_correct": true, | |
| "feedback": "Correct! AI agents are designed to automate tasks and make decisions based on their programming and environment." | |
| }}, | |
| {{ | |
| "option_text": "To replace human workers entirely", | |
| "is_correct": false, | |
| "feedback": "Incorrect. While AI agents can automate certain tasks, they are not designed to replace humans entirely." | |
| }}, | |
| {{ | |
| "option_text": "To process large amounts of data", | |
| "is_correct": false, | |
| "feedback": "Incorrect. While data processing is a capability of some AI systems, it's not the primary purpose of AI agents specifically." | |
| }}, | |
| {{ | |
| "option_text": "To simulate human emotions", | |
| "is_correct": false, | |
| "feedback": "Incorrect. AI agents are not primarily designed to simulate human emotions." | |
| }} | |
| ], | |
| "learning_objective_id": 3, | |
| "learning_objective": "Describe the main applications of AI agents.", | |
| "correct_answer": "To automate tasks and make decisions", | |
| "source_reference": ["sc-Arize-C1-L3-eng.vtt"], | |
| "judge_feedback": "This question effectively tests understanding of AI agent applications.", | |
| "approved": true, | |
| "in_group": true, | |
| "group_members": [2, 5, 7], | |
| "best_in_group": true | |
| }} | |
| ] | |
| </Example> | |
| <EXAMPLE OF COMPLETE GROUPING RESPONSE> | |
| Here's an example of how to properly group a set of 5 questions: | |
| Input questions with IDs: [2, 3, 4, 5, 6] | |
| Correct output (all questions returned with unique ranks): | |
| [ | |
| { | |
| "id": 2, | |
| "question_text": "What is the primary purpose of AI agents?", | |
| "options": [...], | |
| "learning_objective_id": 3, | |
| "learning_objective": "Describe the main applications of AI agents.", | |
| "correct_answer": "To automate tasks and make decisions", | |
| "source_reference": ["sc-Arize-C1-L3-eng.vtt"], | |
| "judge_feedback": "This question effectively tests understanding of AI agent applications.", | |
| "approved": true, | |
| "in_group": true, | |
| "group_members": [2, 5], | |
| "best_in_group": true | |
| }, | |
| { | |
| "id": 3, | |
| "question_text": "Which of the following best describes machine learning?", | |
| "options": [...], | |
| "learning_objective_id": 2, | |
| "learning_objective": "Define machine learning.", | |
| "correct_answer": "A subset of AI that enables systems to learn from data", | |
| "source_reference": ["sc-Arize-C1-L2-eng.vtt"], | |
| "judge_feedback": "Good fundamental question.", | |
| "approved": true, | |
| "in_group": false, | |
| "group_members": [3], | |
| "best_in_group": true | |
| }, | |
| { | |
| "id": 4, | |
| "question_text": "What is a neural network?", | |
| "options": [...], | |
| "learning_objective_id": 4, | |
| "learning_objective": "Explain neural networks.", | |
| "correct_answer": "A computing system inspired by biological neural networks", | |
| "source_reference": ["sc-Arize-C1-L4-eng.vtt"], | |
| "judge_feedback": "Basic definition question.", | |
| "approved": true, | |
| "in_group": false, | |
| "group_members": [4], | |
| "best_in_group": true | |
| }, | |
| { | |
| "id": 5, | |
| "question_text": "How do AI agents help in automation?", | |
| "options": [...], | |
| "learning_objective_id": 3, | |
| "learning_objective": "Describe the main applications of AI agents.", | |
| "correct_answer": "By performing tasks based on programmed rules or learned patterns", | |
| "source_reference": ["sc-Arize-C1-L3-eng.vtt"], | |
| "judge_feedback": "Related to question 2 but more specific.", | |
| "approved": true, | |
| "in_group": true, | |
| "group_members": [2, 5], | |
| "best_in_group": false | |
| }, | |
| { | |
| "id": 6, | |
| "question_text": "What is deep learning?", | |
| "options": [...], | |
| "learning_objective_id": 5, | |
| "learning_objective": "Differentiate deep learning from traditional machine learning.", | |
| "correct_answer": "A subset of machine learning using multi-layered neural networks", | |
| "source_reference": ["sc-Arize-C1-L5-eng.vtt"], | |
| "judge_feedback": "Good definition question.", | |
| "approved": true, | |
| "in_group": false, | |
| "group_members": [6], | |
| "best_in_group": true | |
| } | |
| ] | |
| Notice that: | |
| 1. ALL 5 input questions are returned in the output | |
| 2. Each question has a UNIQUE rank (2, 3, 4, 5, 6) | |
| 3. Questions 2 and 5 are identified as being in the same group | |
| 4. Question 2 is marked as best_in_group=true while question 5 has best_in_group=false | |
| 5. Questions that aren't in groups with other questions have group_members containing only their own ID | |
| </EXAMPLE OF COMPLETE RANKING RESPONSE> | |
| </IMPORTANT RANKING INSTRUCTIONS> | |
| """ | |
| RULES_FOR_SECOND_CLAUSES = """ | |
| Avoid contradictory second clauses - Don't add qualifying phrases that explicitly negate the main benefit or create obvious limitations | |
| Bad: "Human feedback enables complex reasoning, allowing workflows to handle cases without any human involvement" (contradicts the premise of human feedback) | |
| Fixed: "Human feedback enables the agent to develop more sophisticated reasoning patterns for handling complex document structures" (stays positive, just misdirects the benefit) | |
| Additional guidance: | |
| Keep second clauses supportive - If you include a second clause, it should reinforce the incorrect direction, not contradict it | |
| Bad: "Context awareness helps agents understand code, but prevents them from adapting to new situations" | |
| Good: "Context awareness helps agents understand code by focusing on the most recently modified files and functions" | |
| Focus on misdirection, not negation - Wrong answers should point toward a plausible but incorrect benefit, not explicitly limit or negate the concept | |
| Bad: "Version control tracks changes but cannot recall previous versions" | |
| Good: "Version control tracks changes to ensure compatibility across different development environments" | |
| Maintain positive framing - All options should sound like genuine benefits, just targeting the wrong aspect | |
| Bad: "Transfer learning reuses features but freezes all weights, preventing any updates" | |
| Good: "Transfer learning reuses features to establish consistent baseline performance across different model architectures" | |
| Better versions of those options: | |
| B: "Human feedback enables the agent to develop more sophisticated automated reasoning capabilities for handling complex document analysis tasks." | |
| C: "Human feedback provides the agent with contextual understanding that enhances its decision-making framework for future similar documents." | |
| D: "Human feedback allows the agent to establish consistent formatting and presentation standards across all processed documents." | |
| * Look for explicit negations using "without," "rather than," "instead of," "but not," "but", "except", or "excluding" that directly contradict the core concept | |
| Avoid negating phrases that explicitly exclude the main concept: | |
| - Bad: "provides simple Q&A without automating structured tasks" | |
| - Good: "provides simple Q&A and basic document classification capabilities" | |
| - Bad: "focuses on efficiency rather than handling complex processing" | |
| - Good: "focuses on optimizing document throughput and processing speed" | |
| - Bad: "uses pre-defined rules with agents handling only basic tasks" | |
| - Good: "uses standardized rule frameworks with agents managing document classification" | |
| It is very important to consider the following: | |
| <VERY IMPORTANT> | |
| IMMEDIATE RED FLAGS - Mark as needing regeneration if ANY option contains: | |
| - "but not necessarily" | |
| - "at the expense of" | |
| - "sometimes at the expense" | |
| - "rather than [core concept]" | |
| - "ensuring X rather than Y" | |
| - "without necessarily" | |
| - "but has no impact on" | |
| </VERY IMPORTANT> | |
| """ | |
| IMMEDIATE_RED_FLAGS = """ | |
| IMMEDIATE RED FLAGS - Mark as needing regeneration if ANY option contains: | |
| CONTRADICTORY SECOND CLAUSES: | |
| - "but not necessarily" | |
| - "at the expense of" | |
| - "sometimes at the expense" | |
| - "rather than [core concept]" | |
| - "ensuring X rather than Y" | |
| - "without necessarily" | |
| - "but has no impact on" | |
| - "but cannot" | |
| - "but prevents" | |
| - "but limits" | |
| - "but reduces" | |
| EXPLICIT NEGATIONS OF CORE CONCEPTS: | |
| - "without automating" | |
| - "without incorporating" | |
| - "without using" | |
| - "without supporting" | |
| - "preventing [main benefit]" | |
| - "limiting [main capability]" | |
| - "reducing the need for [core function]" | |
| OPPOSITE DESCRIPTIONS: | |
| - "fixed steps" or "rigid sequences" (when describing flexible systems) | |
| - "manual intervention" (when describing automation) | |
| - "passive components" (when describing active agents) | |
| - "simple question answering" (when describing complex processing) | |
| - "predefined rules" (when describing adaptive systems) | |
| ABSOLUTE/COMPARATIVE TERMS TO AVOID: | |
| - "always," "never," "exclusively," "purely," "solely," "only" | |
| - "primarily," "mainly," "instead of," "as opposed to" | |
| - "all," "every," "none," "nothing," "must," "required" | |
| - "completely," "totally," "utterly," "impossible" | |
| HEDGING THAT CREATES OBVIOUS LIMITATIONS: | |
| - "sometimes," "occasionally," "might," "could potentially" | |
| - "generally," "typically," "usually" (when limiting capabilities) | |
| - "to some extent," "partially," "somewhat" | |
| TRADE-OFF LANGUAGE THAT CREATES FALSE DICHOTOMIES: | |
| - "focusing on X instead of Y" | |
| - "prioritizing X over Y" | |
| - "emphasizing X rather than Y" | |
| - "optimizing for X at the cost of Y" | |
| Check for descriptions of opposite approaches: | |
| Identify when an answer describes a fundamentally different methodology | |
| For example, "intuition-based" vs "evaluation-based", "feature-driven" vs "evaluation-driven" | |
| """ |