{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data collection through Webscraping\n", "\n", "Rafiq Islam \n", "2024-08-14\n", "\n", "## Introduction\n", "\n", "Collecting data and preparing it for a project is one of the most\n", "important tasks in any data science or machine learning project. There\n", "are many sources from where we can collect data for a project, such as\n", "\n", "- Connecting to a SQL database server \n", "- Data Source Websites such as\n", " Kaggle,\n", " Google Dataset Search,\n", " UCI\n", " Machine Learning Repo etc \n", "- Web Scraping with Beautiful Soup\n", "- Using Python API\n", "\n", "## Data Source Websites\n", "\n", "Data source websites mainly falls into two categories such as data\n", "repositories and data science competitions. There are many such\n", "websites.\n", "\n", "1. The\n", " UCI\n", " Machine Learning Repository \n", "2. The Harvard\n", " Dataverse\n", "3. The\n", " Mendeley Data\n", " Repository\n", "4. The 538\n", "5. The\n", " New Yourk Times \n", "6. The International Data Analysis Olympiad\n", "7. Kaggle Competition\n", "\n", "Example of collecting data from\n", "UCI\n", "Machine Learning Repository" ], "id": "ca293291-0509-4576-b100-316c308b5ec9" }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'uci_id': 53, 'name': 'Iris', 'repository_url': 'https://archive.ics.uci.edu/dataset/53/iris', 'data_url': 'https://archive.ics.uci.edu/static/public/53/data.csv', 'abstract': 'A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.\\n', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 150, 'num_features': 4, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1936, 'last_updated': 'Tue Sep 12 2023', 'dataset_doi': '10.24432/C56C76', 'creators': ['R. A. Fisher'], 'intro_paper': {'ID': 191, 'type': 'NATIVE', 'title': 'The Iris data set: In search of the source of virginica', 'authors': 'A. Unwin, K. Kleinman', 'venue': 'Significance, 2021', 'year': 2021, 'journal': 'Significance, 2021', 'DOI': '1740-9713.01589', 'URL': 'https://www.semanticscholar.org/paper/4599862ea877863669a6a8e63a3c707a787d5d7e', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': 'This is one of the earliest datasets used in the literature on classification methods and widely used in statistics and machine learning. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other.\\n\\nPredicted attribute: class of iris plant.\\n\\nThis is an exceedingly simple domain.\\n\\nThis data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick@espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,\"Iris-setosa\" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,\"Iris-setosa\" where the errors are in the second and third features. ', 'purpose': 'N/A', 'funded_by': None, 'instances_represent': 'Each instance is a plant', 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': None, 'citation': None}}\n", " name role type demographic \\\n", "0 sepal length Feature Continuous None \n", "1 sepal width Feature Continuous None \n", "2 petal length Feature Continuous None \n", "3 petal width Feature Continuous None \n", "4 class Target Categorical None \n", "\n", " description units missing_values \n", "0 None cm no \n", "1 None cm no \n", "2 None cm no \n", "3 None cm no \n", "4 class of iris plant: Iris Setosa, Iris Versico... None no " ] } ], "source": [ "from ucimlrepo import fetch_ucirepo \n", " \n", "# fetch dataset \n", "iris = fetch_ucirepo(id=53) \n", " \n", "# data (as pandas dataframes) \n", "X = iris.data.features \n", "y = iris.data.targets \n", " \n", "# metadata \n", "print(iris.metadata) \n", " \n", "# variable information \n", "print(iris.variables) " ], "id": "e4e39c21" }, { "cell_type": "markdown", "metadata": {}, "source": [ "you may need to install the\n", "UCI\n", "Machine Learning Repository as a package using pip.\n", "\n", " pip install ucimlrepo" ], "id": "bb1d0e6f-40b3-47f9-9361-d5636db65579" }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/html": [ "\n", "" ] } } ], "source": [ "X.head()" ], "id": "d22cdb63" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Web Scraping\n", "\n", "We scrapping is another way of collecting the data for the research if\n", "the data is not available in any repositiory. We can collect the data\n", "from a website using a library called `BeautifulSoup` if the website has\n", "permision for other people to collect data from the website." ], "id": "3ddcf85c-5639-4494-9957-403c1f610168" }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import bs4 # library for BeautifulSoup\n", "from bs4 import BeautifulSoup # import the BeautifulSoup object\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from seaborn import set_style\n", "set_style(\"whitegrid\")" ], "id": "cbea31aa" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let’s make a html object using `BeautifulSoup`. Let’s say we have a\n", "html website that looks like below" ], "id": "3dfb3f6a-0b0f-4ff5-a278-b97a0fa7559d" }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "html_doc=\"\"\"\n", "\n", "\n", "
\n", "This is a paragraph in my dummy HTML document.
\n", " Blog \n", " Research \n", "\n", "\n", "\"\"\"" ], "id": "a77b005b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to grab information from the dummy html documnet above." ], "id": "a23e733d-95af-42f9-b69b-db3e61086f94" }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "soup=BeautifulSoup(html_doc, features='html.parser')" ], "id": "8936be00" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the object `soup` we can walk through each elements in\n", "this object. For example, if we want to grab the title element," ], "id": "90cf2ff4-a785-485d-9235-fe8281201490" }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "