| | import os |
| | import random |
| |
|
| | def generate_finance_data(): |
| | """Generate financial analysis training data""" |
| | finance_samples = [ |
| | "Q: What is the relationship between interest rates and bond prices?\nA: Interest rates and bond prices have an inverse relationship. When interest rates rise, existing bond prices fall because new bonds are issued with higher yields, making older bonds less attractive. Conversely, when interest rates decline, existing bond prices increase as their fixed yields become more valuable compared to newly issued bonds with lower rates.", |
| | |
| | "Q: Explain market capitalization and its categories.\nA: Market capitalization represents the total value of a company's outstanding shares, calculated by multiplying share price by total shares outstanding. Companies are categorized as: Large-cap (over $10 billion), representing established corporations with stable operations; Mid-cap ($2-10 billion), offering growth potential with moderate risk; and Small-cap (under $2 billion), typically higher growth potential but increased volatility.", |
| | |
| | "Q: What factors influence currency exchange rates?\nA: Currency exchange rates are influenced by multiple factors including interest rate differentials between countries, inflation rates, political stability, economic performance indicators like GDP growth, trade balances, and central bank monetary policies. Market sentiment and speculation also play significant roles in short-term fluctuations.", |
| | |
| | "Q: Describe the concept of diversification in investment portfolios.\nA: Diversification is a risk management strategy that involves spreading investments across various asset classes, sectors, and geographic regions. The principle is that different investments will react differently to the same economic events, reducing overall portfolio volatility. A well-diversified portfolio typically includes stocks, bonds, real estate, and alternative investments in proportions aligned with the investor's risk tolerance and time horizon.", |
| | |
| | "Q: What are the main types of financial statements?\nA: The three primary financial statements are: The Balance Sheet, showing assets, liabilities, and equity at a specific point in time; The Income Statement, reporting revenues, expenses, and profit over a period; and The Cash Flow Statement, tracking cash inflows and outflows from operating, investing, and financing activities. Together, these provide comprehensive insight into a company's financial health.", |
| | |
| | "Q: Explain the difference between active and passive investing.\nA: Active investing involves frequent buying and selling with the goal of outperforming market benchmarks through research, analysis, and market timing. It typically incurs higher fees and requires significant expertise. Passive investing involves buying and holding diversified investments like index funds to match market returns, with lower fees and minimal trading. Studies show passive strategies often outperform active management over long periods after accounting for fees.", |
| | |
| | "Q: What is a bear market versus a bull market?\nA: A bull market is characterized by rising prices, typically defined as a 20% or greater increase from recent lows, accompanied by investor optimism and strong economic fundamentals. A bear market occurs when prices fall 20% or more from recent highs, marked by pessimism and economic contraction. These cycles are natural parts of market behavior and can last months to years.", |
| | |
| | "Q: How does inflation affect purchasing power?\nA: Inflation erodes purchasing power by increasing the general price level of goods and services over time. If inflation is 3% annually, $100 today will only purchase $97 worth of goods next year. This makes cash holdings less valuable over time and emphasizes the importance of investments that can outpace inflation, such as stocks, real estate, and inflation-protected securities.", |
| | |
| | "Q: What is the purpose of credit ratings?\nA: Credit ratings assess the creditworthiness of borrowers, including corporations and governments, indicating their ability to repay debt obligations. Ratings agencies like Moody's, S&P, and Fitch assign grades from AAA (highest quality) to D (default). These ratings influence borrowing costs, with higher-rated entities accessing capital at lower interest rates due to perceived lower default risk.", |
| | |
| | "Q: Explain the concept of compound interest.\nA: Compound interest is the process where interest earned on an investment is reinvested to generate additional earnings over time. Unlike simple interest calculated only on principal, compound interest calculates on both principal and accumulated interest. This creates exponential growth, making it a powerful wealth-building tool. The effect is magnified over longer time periods, highlighting the importance of early investing." |
| | ] |
| | return finance_samples |
| |
|
| | def generate_general_purpose_data(): |
| | """Generate general knowledge training data""" |
| | general_samples = [ |
| | "Q: Explain the water cycle in nature.\nA: The water cycle is a continuous process where water circulates between Earth's oceans, atmosphere, and land. It begins with evaporation, where heat transforms liquid water into vapor. This vapor rises and condenses into clouds through condensation. When conditions are right, precipitation occurs as rain, snow, or hail. Water then flows through rivers and streams back to oceans, infiltrates soil to replenish groundwater, or is absorbed by plants and released through transpiration, completing the cycle.", |
| | |
| | "Q: What are the main components of a computer system?\nA: A computer system consists of hardware and software components working together. Hardware includes the Central Processing Unit (CPU) for processing instructions, Random Access Memory (RAM) for temporary data storage, storage devices like hard drives or SSDs for permanent data, input devices such as keyboards and mice, output devices like monitors and printers, and the motherboard connecting all components. Software includes the operating system managing hardware resources and application programs performing specific tasks.", |
| | |
| | "Q: How does photosynthesis work in plants?\nA: Photosynthesis is the process by which plants convert light energy into chemical energy. In chloroplasts, chlorophyll captures sunlight. During light-dependent reactions, water molecules split, releasing oxygen and creating energy carriers ATP and NADPH. In light-independent reactions (Calvin cycle), these energy carriers help convert carbon dioxide from air into glucose. This glucose serves as food for the plant and the basis for most food chains on Earth.", |
| | |
| | "Q: What is the scientific method?\nA: The scientific method is a systematic approach to investigating phenomena and acquiring knowledge. It begins with observation of a phenomenon and formulating a question. A hypothesis is then proposed as a potential explanation. Experiments are designed and conducted to test the hypothesis under controlled conditions. Data is collected and analyzed to determine if results support or refute the hypothesis. Conclusions are drawn and the process may be repeated or refined. This method ensures objectivity and reproducibility in scientific research.", |
| | |
| | "Q: Explain the layers of Earth's atmosphere.\nA: Earth's atmosphere consists of five main layers. The troposphere, closest to Earth's surface (0-12 km), contains most weather phenomena and breathable air. The stratosphere (12-50 km) houses the ozone layer protecting against UV radiation. The mesosphere (50-80 km) is where meteors burn up. The thermosphere (80-700 km) contains ionized particles and is where auroras occur. The exosphere (700+ km) gradually transitions to space. Temperature and composition vary across these layers.", |
| | |
| | "Q: What causes seasons on Earth?\nA: Seasons result from Earth's axial tilt of approximately 23.5 degrees relative to its orbital plane around the Sun. This tilt causes different hemispheres to receive varying amounts of direct sunlight throughout the year. When the Northern Hemisphere tilts toward the Sun, it experiences summer with longer days and more direct sunlight, while the Southern Hemisphere experiences winter. Six months later, positions reverse. The equinoxes occur when both hemispheres receive equal illumination.", |
| | |
| | "Q: How do vaccines work?\nA: Vaccines work by training the immune system to recognize and combat pathogens without causing disease. They contain weakened, inactive, or partial forms of disease-causing organisms or their proteins. When administered, these trigger an immune response, causing the body to produce antibodies and memory cells. If exposed to the actual pathogen later, the immune system quickly recognizes and neutralizes it before illness develops. This principle of immunological memory provides long-lasting protection.", |
| | |
| | "Q: What is the theory of evolution?\nA: Evolution is the process by which species change over time through natural selection. Organisms with traits better suited to their environment are more likely to survive and reproduce, passing advantageous characteristics to offspring. Over many generations, these small changes accumulate, potentially resulting in new species. Evidence supporting evolution includes fossil records, comparative anatomy, genetic similarities across species, and observable changes in populations. This theory, first comprehensively described by Charles Darwin, is fundamental to modern biology.", |
| | |
| | "Q: Explain how electricity is generated.\nA: Electricity generation typically involves converting mechanical energy into electrical energy through electromagnetic induction. In most power plants, fuel (coal, natural gas, nuclear, or renewable sources) heats water to produce steam. This steam drives turbines connected to generators. Inside generators, rotating magnets create changing magnetic fields that induce electric current in wire coils. Alternative methods include solar panels converting light directly to electricity through photovoltaic effect, and wind turbines directly driving generators.", |
| | |
| | "Q: What is the greenhouse effect?\nA: The greenhouse effect is a natural process where certain atmospheric gases trap heat from the Sun. Solar radiation passes through the atmosphere and warms Earth's surface. The surface radiates this energy back as infrared radiation. Greenhouse gases like carbon dioxide, methane, and water vapor absorb and re-emit this infrared radiation, preventing it from escaping to space. This keeps Earth warm enough to support life. However, increased greenhouse gas concentrations from human activities enhance this effect, contributing to global warming." |
| | ] |
| | return general_samples |
| |
|
| | def generate_safety_data(): |
| | """Generate content safety training data""" |
| | safety_samples = [ |
| | "Q: Is this content safe: 'Learn Python programming with our free tutorial'?\nA: SAFE. This content promotes educational material about programming, which is constructive and appropriate for general audiences. It contains no harmful elements, offensive language, or misleading information.", |
| | |
| | "Q: Evaluate this statement for safety: 'Join our community to discuss technology trends'.\nA: SAFE. This is a neutral invitation to participate in technology discussions. It promotes community engagement around a legitimate topic without any concerning elements.", |
| | |
| | "Q: Is this content appropriate: 'Experts recommend consulting professionals before major financial decisions'?\nA: SAFE. This is responsible advice encouraging professional consultation for important financial matters. It promotes informed decision-making and professional guidance.", |
| | |
| | "Q: Assess this content: 'Our product guarantees you will become a millionaire in 30 days'.\nA: UNSAFE. This contains unrealistic financial promises and deceptive marketing. It could mislead users into making poor financial decisions based on impossible guarantees. Such claims should be flagged as potentially fraudulent.", |
| | |
| | "Q: Review this statement: 'Research shows regular exercise benefits cardiovascular health'.\nA: SAFE. This presents factual health information supported by scientific research. It provides educational content about exercise benefits without making dangerous medical claims.", |
| | |
| | "Q: Is this appropriate content: 'Click here to download certified documents'?\nA: CONTEXT NEEDED. While seemingly neutral, this could be safe (legitimate document service) or unsafe (phishing attempt). Additional context about the source and nature of documents is required for proper safety assessment.", |
| | |
| | "Q: Evaluate: 'Our study group meets Tuesdays to prepare for the exam'.\nA: SAFE. This is straightforward information about an academic study group. It promotes educational collaboration with no concerning elements.", |
| | |
| | "Q: Assess this content: 'Share your bank details to claim your prize'.\nA: UNSAFE. This is a clear phishing attempt seeking sensitive financial information. It combines false prize claims with requests for confidential banking details, representing significant security and fraud risks.", |
| | |
| | "Q: Is this statement safe: 'Mental health support is available through counseling services'?\nA: SAFE. This appropriately directs people toward professional mental health resources. It provides helpful information about accessing support services.", |
| | |
| | "Q: Review this content: 'Our AI can replace all human workers in your company'.\nA: POTENTIALLY CONCERNING. While discussing AI capabilities, this makes sweeping claims that could promote workforce displacement fears or unrealistic expectations about AI capabilities. It requires context about whether this is marketing hyperbole or factual discussion." |
| | ] |
| | return safety_samples |
| |
|
| | def create_training_files(output_dir="data"): |
| | """Create training and validation files""" |
| | os.makedirs(output_dir, exist_ok=True) |
| | |
| | |
| | finance_data = generate_finance_data() |
| | general_data = generate_general_purpose_data() |
| | safety_data = generate_safety_data() |
| | |
| | |
| | all_data = finance_data + general_data + safety_data |
| | random.shuffle(all_data) |
| | |
| | |
| | split_idx = int(len(all_data) * 0.8) |
| | train_data = all_data[:split_idx] |
| | val_data = all_data[split_idx:] |
| | |
| | |
| | with open(f"{output_dir}/train.txt", "w", encoding="utf-8") as f: |
| | for item in train_data: |
| | f.write(item + "\n\n") |
| | |
| | |
| | with open(f"{output_dir}/val.txt", "w", encoding="utf-8") as f: |
| | for item in val_data: |
| | f.write(item + "\n\n") |
| | |
| | print(f"Training data created successfully!") |
| | print(f"Training samples: {len(train_data)}") |
| | print(f"Validation samples: {len(val_data)}") |
| | print(f"Files created in: {output_dir}/") |
| | print(f" - train.txt") |
| | print(f" - val.txt") |
| | print(f"\nYou can now run: python train.py --data_path {output_dir}") |
| |
|
| | if __name__ == "__main__": |
| | create_training_files() |