{"cells":[{"cell_type":"markdown","source":["# `Web Scraping`\n"],"metadata":{"id":"wNfH3-kA5Usr"}},{"cell_type":"markdown","source":["# Introduction to web scraping with BeautifulSoup\n","* Web scraping involves extracting data from websites.\n","* Useful for gathering information easily.\n","* BeautifulSoup, from the bs4 package, is a popular Python library for web scraping.\n","* Simplifies parsing of HTML and XML documents to extract desired information.\n","\n","* Reference: [brightdata](https://brightdata.com/blog/how-tos/beautiful-soup-web-scraping#:~:text=Web%20Scraping%20with%20Beautiful%20Soup&text=The%20library%20automatically%20selects%20the,fast%20and%20efficient%20lxml%20parser.)\n","* User Agent: [whatismybrowser](https://www.whatismybrowser.com/detect/what-is-my-user-agent/)"],"metadata":{"id":"pp3qeU7J5Bqb"}},{"cell_type":"markdown","source":["Installation and Declaration"],"metadata":{"id":"-7U3woPO6i-5"}},{"cell_type":"code","source":["# %pip install beautifulsoup4 requests"],"metadata":{"id":"IWesAnWQ6XZH"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Fetching Web Content"],"metadata":{"id":"sNfUIUU17PGr"}},{"cell_type":"code","source":["# Fetching Web Content\n","import requests\n","url = \"https://walid.vercel.app\"\n","headers = {\"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36\"}\n","response = requests.get(url,headers)\n","print(response)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ZtyuOw5u60XX","executionInfo":{"status":"ok","timestamp":1734172922647,"user_tz":-360,"elapsed":3487,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}},"outputId":"7110f21e-24c3-4abe-97cd-a23de2776505"},"execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["\n"]}]},{"cell_type":"code","source":["# Parsing the HTML\n","from bs4 import BeautifulSoup\n","soup = BeautifulSoup(response.content, \"html.parser\")"],"metadata":{"id":"Nm_Sd86T7AOW","executionInfo":{"status":"ok","timestamp":1734173000239,"user_tz":-360,"elapsed":347,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}}},"execution_count":2,"outputs":[]},{"cell_type":"markdown","source":["Basic Operations with BeautifulSoup"],"metadata":{"id":"UXxbjr0l8Bu6"}},{"cell_type":"code","source":["# Find tag name\n","title = soup.find(\"title\")\n","print(title)\n","print(title.text)\n","print(title.contents)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"yxzR1JJ-7bwe","executionInfo":{"status":"ok","timestamp":1734173006245,"user_tz":-360,"elapsed":349,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}},"outputId":"71c27510-0fc5-4418-c152-e7d5d2d51c40"},"execution_count":3,"outputs":[{"output_type":"stream","name":"stdout","text":["Walid\n","Walid\n","['Walid']\n"]}]},{"cell_type":"code","source":["# All occurrences of a tag\n","links = soup.find_all(\"a\")"],"metadata":{"id":"VR7-pWDE7bs-","executionInfo":{"status":"ok","timestamp":1734173054542,"user_tz":-360,"elapsed":455,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}}},"execution_count":4,"outputs":[]},{"cell_type":"markdown","source":["**Practical exercises**"],"metadata":{"id":"1YOwsGSk8z63"}},{"cell_type":"code","source":["import re\n","lst = []\n","for i in links:\n"," result = re.findall(r'https:\\/\\/(?:www\\.)?[a-zA-Z0-9-]+\\.[a-zA-Z]{2,}(?:\\/[a-zA-Z0-9._~:/?#@!$&\\'()*+,;=%-]*)?', str(i))\n"," if result:\n"," lst.append(result)\n","print(lst)\n","\n","new_lst = []\n","for i in lst:\n"," new_lst.append(''.join(i))\n","print(new_lst)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"uBBRevC724cZ","executionInfo":{"status":"ok","timestamp":1734173240756,"user_tz":-360,"elapsed":351,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}},"outputId":"1bb162e4-a819-4de7-dcff-46c48c2a4950"},"execution_count":6,"outputs":[{"output_type":"stream","name":"stdout","text":["[['https://www.linkedin.com/in/munsiwalidalhassannizhu'], ['https://www.facebook.com/whalidmunshi'], ['https://huggingface.co/WalidAlHassan'], ['https://github.com/walid3271'], ['https://huggingface.co/WalidAlHassan/Face-Detection-Using-URL'], ['https://huggingface.co/WalidAlHassan/Floor-Object-Rooms-and-Bed-direction-Identification-according-to-Vastu-angle'], ['https://huggingface.co/WalidAlHassan/GMP_Face_Authentication'], ['https://huggingface.co/WalidAlHassan/Find-Direction-Of-A-Bolt'], ['https://huggingface.co/WalidAlHassan/Virtual-Mouse'], ['https://huggingface.co/WalidAlHassan/ChatBot'], ['https://chatbot-with-gemini.streamlit'], ['https://huggingface.co/WalidAlHassan/ChatBot-Gemini'], ['https://huggingface.co/WalidAlHassan/Romero-ChatBot'], ['https://huggingface.co/WalidAlHassan/SCREW-APP'], ['https://huggingface.co/WalidAlHassan/Conveyor-Belt-Screw-Count'], ['https://walid.vercel'], ['https://www.linkedin.com/in/munsiwalidalhassannizhu'], ['https://www.facebook.com/whalidmunshi'], ['https://huggingface.co/WalidAlHassan'], ['https://github.com/walid3271']]\n","['https://www.linkedin.com/in/munsiwalidalhassannizhu', 'https://www.facebook.com/whalidmunshi', 'https://huggingface.co/WalidAlHassan', 'https://github.com/walid3271', 'https://huggingface.co/WalidAlHassan/Face-Detection-Using-URL', 'https://huggingface.co/WalidAlHassan/Floor-Object-Rooms-and-Bed-direction-Identification-according-to-Vastu-angle', 'https://huggingface.co/WalidAlHassan/GMP_Face_Authentication', 'https://huggingface.co/WalidAlHassan/Find-Direction-Of-A-Bolt', 'https://huggingface.co/WalidAlHassan/Virtual-Mouse', 'https://huggingface.co/WalidAlHassan/ChatBot', 'https://chatbot-with-gemini.streamlit', 'https://huggingface.co/WalidAlHassan/ChatBot-Gemini', 'https://huggingface.co/WalidAlHassan/Romero-ChatBot', 'https://huggingface.co/WalidAlHassan/SCREW-APP', 'https://huggingface.co/WalidAlHassan/Conveyor-Belt-Screw-Count', 'https://walid.vercel', 'https://www.linkedin.com/in/munsiwalidalhassannizhu', 'https://www.facebook.com/whalidmunshi', 'https://huggingface.co/WalidAlHassan', 'https://github.com/walid3271']\n"]}]},{"cell_type":"code","source":["import requests\n","from bs4 import BeautifulSoup\n","import pandas as pd\n","\n","url = \"http://quotes.toscrape.com/\"\n","headers = {\"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36\"}\n","response = requests.get(url, headers)\n","# print(response.content)\n","print('Status: ',response)\n","soup = BeautifulSoup(response.text, \"html.parser\")\n","\n","quotes = soup.find_all(\"span\", attrs={\"class\":\"text\"})\n","authors = soup.find_all(\"small\", attrs={\"class\":\"author\"})\n","\n","qu = []\n","for quote, author in zip(quotes, authors):\n"," qu.append({\"Quote\": quote.text, \"Author\": author.text})\n","\n","csv_file = \"Quote.csv\"\n","df = pd.DataFrame(qu)\n","df.to_csv(csv_file, index=False, encoding=\"utf-8\")"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"qKQp6kpX7bja","executionInfo":{"status":"ok","timestamp":1734173627017,"user_tz":-360,"elapsed":1372,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}},"outputId":"253cf8ec-5e14-41c1-e3b9-3323ff18a9d3"},"execution_count":8,"outputs":[{"output_type":"stream","name":"stdout","text":["Status: \n"]}]},{"cell_type":"code","source":["df"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":363},"id":"KdYnO-sH88YW","executionInfo":{"status":"ok","timestamp":1734173631548,"user_tz":-360,"elapsed":360,"user":{"displayName":"44-271-Munsi Walid Al Hassan Nizhu","userId":"16216461530557409787"}},"outputId":"3699902c-bb7c-4791-8e1e-402ac7aa75cc"},"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Quote Author\n","0 “The world as we have created it is a process ... Albert Einstein\n","1 “It is our choices, Harry, that show what we t... J.K. Rowling\n","2 “There are only two ways to live your life. On... Albert Einstein\n","3 “The person, be it gentleman or lady, who has ... Jane Austen\n","4 “Imperfection is beauty, madness is genius and... Marilyn Monroe\n","5 “Try not to become a man of success. Rather be... Albert Einstein\n","6 “It is better to be hated for what you are tha... André Gide\n","7 “I have not failed. I've just found 10,000 way... Thomas A. Edison\n","8 “A woman is like a tea bag; you never know how... Eleanor Roosevelt\n","9 “A day without sunshine is like, you know, nig... Steve Martin"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
QuoteAuthor
0“The world as we have created it is a process ...Albert Einstein
1“It is our choices, Harry, that show what we t...J.K. Rowling
2“There are only two ways to live your life. On...Albert Einstein
3“The person, be it gentleman or lady, who has ...Jane Austen
4“Imperfection is beauty, madness is genius and...Marilyn Monroe
5“Try not to become a man of success. Rather be...Albert Einstein
6“It is better to be hated for what you are tha...André Gide
7“I have not failed. I've just found 10,000 way...Thomas A. Edison
8“A woman is like a tea bag; you never know how...Eleanor Roosevelt
9“A day without sunshine is like, you know, nig...Steve Martin
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","\n","
\n"," \n"," \n"," \n","
\n","\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"df","summary":"{\n \"name\": \"df\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"Quote\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"\\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\\u201d\",\n \"\\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\\u201d\",\n \"\\u201cTry not to become a man of success. Rather become a man of value.\\u201d\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Author\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"J.K. Rowling\",\n \"Thomas A. Edison\",\n \"Albert Einstein\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":9}]}],"metadata":{"kernelspec":{"display_name":"pp","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.12.7"},"colab":{"provenance":[],"collapsed_sections":["pp3qeU7J5Bqb"]}},"nbformat":4,"nbformat_minor":0}