Spaces:

egumasa
/

simple-text-analyzer

Building

File size: 16,153 Bytes

5ac114d

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ca78213e",
   "metadata": {},
   "source": [
    "# Extracting fine-grained grammatical feature with POS and dependency information"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c09236c",
   "metadata": {},
   "source": [
    "This notebook walk you through parsing and identifying grammatical constructions using POS and dependency information.\n",
    "Most of the functionality (except the actual extraction rules) is already coded for you.\n",
    "This notebook does not assume prior knowledge in Python, but requires your understanding of the course materials (particularly POS and dependency labels and how grammatical structure is operationalized.)\n",
    "\n",
    "Now let me explain the approach we will take in this demo."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0afb73e8",
   "metadata": {},
   "source": [
    "# Design of the current pipeline\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "009d25f6",
   "metadata": {},
   "source": [
    "The current analysis pipeline is very basic one in the sense that **no prior coding experience or skill** is required.\n",
    "It is meant for educational purpose, meaning that you might want to use more sophisticated tools such as TAASSC, when conducting actual studies.\n",
    "\n",
    "However, if you understand the following design principle very well, you have a good understanding of grammatical dependency analysis through spaCy Python package–What it can offer. You can start thinking more deeply about their implication to your research."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4031817",
   "metadata": {},
   "source": [
    "## Algorithm used in this notebook.\n",
    "\n",
    "In this notebook, the following analysis pipeline is implemented for you.\n",
    "\n",
    "- Your input is `file path` to yout corpus files.\n",
    "- The current code loads the corpus files onto colab.\n",
    "- It then iterate through the corpus files one by one to do the following:\n",
    "  - Parse the sentence using spacy\n",
    "  - Conduct basic analysis (such as calculating the number of tokens, sentences, etc.)\n",
    "  - **Count the number of specific grammatical structures** (**MAIN FEATURE**)\n",
    "  - Store the results into a Python dictionary\n",
    "- After every corpus file is processed, it can create a dataset to export.\n",
    "- You can export the results for further analysis\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64bf752d",
   "metadata": {},
   "source": [
    "## What you are expected to do\n",
    "\n",
    "You are expected to do the following:\n",
    "\n",
    "- Articulate extraction rules that can be used to identify desired grammatical structure\n",
    "- Identify which extraction template functions to use in order to extract the grammatical structure\n",
    "- Follow the instruction of the current notebook to run the analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bfc9be3",
   "metadata": {},
   "source": [
    "# Loading necessary packages\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b936f041",
   "metadata": {},
   "source": [
    "Okay. Let's start! First we will import necessary package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "2e8b371a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import spacy #spacy for language analysis\n",
    "import pandas as pd # import pandas and name it `pd`. Pandas allow easy handling of dataset structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "05e3d34b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize spaCy model\n",
    "nlp = spacy.load(\"en_core_web_sm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5918a3b0",
   "metadata": {},
   "source": [
    "## Define functions\n",
    "\n",
    "In the following I will define necessary functions for this pipeline to work.\n",
    "Let me know if you are curious to know what each does"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b312d06f",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "5db6cea0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_file(filepath: str):\n",
    "    \"\"\"Load text from file\"\"\"\n",
    "    with open(filepath, 'r', encoding='utf-8') as f:\n",
    "        return f.read()\n",
    "\n",
    "\n",
    "def find_filename(filename: str):\n",
    "    new_name = os.path.split(filename)[-1]\n",
    "    return new_name\n",
    "\n",
    "\n",
    "def update_results(index_name: str, result_dictionary):\n",
    "    if index_name in result_dictionary:\n",
    "        result_dictionary[index_name] += 1\n",
    "    else:\n",
    "        result_dictionary[index_name] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "4f5f4341",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_basic_stats(doc):\n",
    "    basic_stats = {}\n",
    "    \n",
    "    basic_stats[\"nToken\"] = len(doc)\n",
    "    basic_stats[\"nSentence\"] = len(list(doc.sents))\n",
    "    return basic_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38b0774d",
   "metadata": {},
   "source": [
    "# Defining Extraction template functions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ee7486e",
   "metadata": {},
   "source": [
    "### Basic extraction rules (difficulty ★☆☆)\n",
    "\n",
    "The first extraction function is very simple rule to identify the construction by:\n",
    "\n",
    "- one dependency label only\n",
    "\n",
    "For example, this can identify `amod` (adjectival modifier)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "1ae37d2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_by_simple_dependency(result_dictionary: dict,token, dep_rel: str, index_name: str):\n",
    "\n",
    "    if token.dep_ == dep_rel: #if the dependency label is the same as the specified one.\n",
    "        update_results(index_name, result_dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "8bc7aa10",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_by_pos(result_dictionary: dict, token, pos: str, index_name: str):\n",
    "\n",
    "    if token.pos_ == pos:\n",
    "        update_results(index_name, result_dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "ce1ca7a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_by_tag(result_dictionary: dict, token, tag: str, index_name: str):\n",
    "\n",
    "    if token.tag_ == tag:\n",
    "        update_results(index_name, result_dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "8118c640",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_by_dependency_and_head_pos(result_dictionary: dict, token, dep_rel: str, head_pos: str, index_name: str):\n",
    "    \"\"\"\n",
    "    Extract token when it has specific dependency relation AND its head has specific POS\n",
    "    Example: Find adjectival modifiers (amod) whose head is a NOUN\n",
    "    \"\"\"\n",
    "    if token.dep_ == dep_rel and token.head.pos_ == head_pos:\n",
    "        update_results(index_name, result_dictionary)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "970cbf42",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_by_dependency_and_child_features(result_dictionary: dict, token, dep_rel: str, child_lemma: str, child_pos: str, index_name: str):\n",
    "    \"\"\"\n",
    "    Extract token when it has specific dependency AND has a child with specific lemma and POS\n",
    "    \n",
    "    \"\"\"\n",
    "    if token.dep_ == dep_rel:\n",
    "        for child in token.children:\n",
    "            if child.lemma_ == child_lemma and child.pos_ == child_pos:\n",
    "                update_results(index_name, result_dictionary)\n",
    "                break  # Found one match, don't count multiple times\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "294f39f8",
   "metadata": {},
   "source": [
    "## Change the rule in the `run_extraction` below!\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "6a8f0a77",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_grammar_extraction(doc, filepath: str):\n",
    "    global extraction_results\n",
    "    extraction_results = {}\n",
    "\n",
    "    for token in doc:\n",
    "        # Extract simple dependency-based features\n",
    "        extract_by_simple_dependency(extraction_results, token, dep_rel=\"amod\", index_name=\"adjectival_modifier\")\n",
    "        extract_by_simple_dependency(extraction_results, token, dep_rel=\"dobj\", index_name=\"direct_object\")\n",
    "        ## Add more rule here\n",
    "\n",
    "        # Extract POS-based features\n",
    "        extract_by_pos(extraction_results, token, pos= \"NOUN\", index_name= \"Nouns\")\n",
    "        ## Add more rule here\n",
    "\n",
    "        # Extract tag-based features\n",
    "        extract_by_tag(extraction_results, token, tag=\"VBZ\", index_name=\"Third-Person Singular Verbs\")\n",
    "        ## Add more rule here\n",
    "\n",
    "\n",
    "    return extraction_results\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b64aa747",
   "metadata": {},
   "source": [
    "## Wanna test?\n",
    "\n",
    "You can test if you will get the desired result by passing example sentence and your rules."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97c99296",
   "metadata": {},
   "source": [
    "### Run the main analysis loop\n",
    "\n",
    "This where the processing happens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "99d314a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Main processing loop\n",
    "def main(CORPUS_FILES):\n",
    "    results = {}\n",
    "\n",
    "    for file in CORPUS_FILES:  # Process first 5 files for testing\n",
    "        global extraction_results\n",
    "        # 1. Load the corpus file\n",
    "        text = load_file(file)\n",
    "        filename = find_filename(file)\n",
    "        \n",
    "        # 2. Parse the sentences\n",
    "        doc = nlp(text)\n",
    "        \n",
    "        # 3. run extraction pipeline\n",
    "        basic_stats = run_basic_stats(doc)\n",
    "        result = run_grammar_extraction(doc, filename)\n",
    "        \n",
    "        # 4. Append results\n",
    "        results[filename] = basic_stats\n",
    "        results[filename].update(result)\n",
    "        \n",
    "        print(f\"Processed: {file}\")\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "8d95fb13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed: ../../corpus_data/brown_single/ca_ca01.txt\n"
     ]
    }
   ],
   "source": [
    "CORPUS_FILES = [\"../../corpus_data/brown_single/ca_ca01.txt\"]\n",
    "results = main(CORPUS_FILES)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b510094",
   "metadata": {},
   "source": [
    "Check the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "e861ed91",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ca_ca01.txt': {'nToken': 2376,\n",
       "  'nSentence': 88,\n",
       "  'Nouns': 458,\n",
       "  'adjectival_modifier': 112,\n",
       "  'direct_object': 95,\n",
       "  'Third-Person Singular Verbs': 34}}"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d02c1f57",
   "metadata": {},
   "source": [
    "## Write the results into dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "6474ad42",
   "metadata": {},
   "outputs": [],
   "source": [
    "def results_to_dataframe(results):\n",
    "    \"\"\"\n",
    "    Convert results dictionary to pandas DataFrame with additional options\n",
    "    \n",
    "    Parameters:\n",
    "    - results: dictionary with corpus_size, unigram, and bigram data\n",
    "    - min_freq: minimum collocation frequency to include (default: 1)\n",
    "    - include_dep_rel: whether to include dependency relation in output (default: True)\n",
    "    \"\"\"\n",
    "    rows = []\n",
    "    \n",
    "    for filename, grammar_info in results.items():\n",
    "            \n",
    "        row = {\n",
    "            \"filename\": filename,\n",
    "              }\n",
    "        \n",
    "        for key, value in grammar_info.items():\n",
    "            print(key, value)\n",
    "            row.update({key:value})\n",
    "        \n",
    "        rows.append(row)\n",
    "    \n",
    "    # Create DataFrame and sort by collocation frequency\n",
    "    df = pd.DataFrame(rows)\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "3168e6b5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nToken 2376\n",
      "nSentence 88\n",
      "Nouns 458\n",
      "adjectival_modifier 112\n",
      "direct_object 95\n",
      "Third-Person Singular Verbs 34\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filename</th>\n",
       "      <th>nToken</th>\n",
       "      <th>nSentence</th>\n",
       "      <th>Nouns</th>\n",
       "      <th>adjectival_modifier</th>\n",
       "      <th>direct_object</th>\n",
       "      <th>Third-Person Singular Verbs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ca_ca01.txt</td>\n",
       "      <td>2376</td>\n",
       "      <td>88</td>\n",
       "      <td>458</td>\n",
       "      <td>112</td>\n",
       "      <td>95</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      filename  nToken  nSentence  Nouns  adjectival_modifier  direct_object  \\\n",
       "0  ca_ca01.txt    2376         88    458                  112             95   \n",
       "\n",
       "   Third-Person Singular Verbs  \n",
       "0                           34  "
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = results_to_dataframe(results)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb28321c",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b94f5181",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "linguistic-data-analysis-I",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}