Collecting data and preparing it for a project is one of the most important tasks in any data science or machine learning project. There are many sources from where we can collect data for a project, such as
from ucimlrepo import fetch_ucirepo # fetch dataset iris = fetch_ucirepo(id=53) # data (as pandas dataframes) X = iris.data.features y = iris.data.targets # metadata print(iris.metadata) # variable information print(iris.variables)
{'uci_id': 53, 'name': 'Iris', 'repository_url': 'https://archive.ics.uci.edu/dataset/53/iris', 'data_url': 'https://archive.ics.uci.edu/static/public/53/data.csv', 'abstract': 'A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.\n', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 150, 'num_features': 4, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1936, 'last_updated': 'Tue Sep 12 2023', 'dataset_doi': '10.24432/C56C76', 'creators': ['R. A. Fisher'], 'intro_paper': {'ID': 191, 'type': 'NATIVE', 'title': 'The Iris data set: In search of the source of virginica', 'authors': 'A. Unwin, K. Kleinman', 'venue': 'Significance, 2021', 'year': 2021, 'journal': 'Significance, 2021', 'DOI': '1740-9713.01589', 'URL': 'https://www.semanticscholar.org/paper/4599862ea877863669a6a8e63a3c707a787d5d7e', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': 'This is one of the earliest datasets used in the literature on classification methods and widely used in statistics and machine learning. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other.\n\nPredicted attribute: class of iris plant.\n\nThis is an exceedingly simple domain.\n\nThis data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick@espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features. ', 'purpose': 'N/A', 'funded_by': None, 'instances_represent': 'Each instance is a plant', 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': None, 'citation': None}}
name role type demographic \
0 sepal length Feature Continuous None
1 sepal width Feature Continuous None
2 petal length Feature Continuous None
3 petal width Feature Continuous None
4 class Target Categorical None
description units missing_values
0 None cm no
1 None cm no
2 None cm no
3 None cm no
4 class of iris plant: Iris Setosa, Iris Versico... None no
We scrapping is another way of collecting the data for the research if the data is not available in any repositiory. We can collect the data from a website using a library called BeautifulSoup if the website has permision for other people to collect data from the website.
import bs4 # library for BeautifulSoupfrom bs4 import BeautifulSoup # import the BeautifulSoup objectimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom seaborn import set_styleset_style("whitegrid")
Now let’s make a html object using BeautifulSoup. Let’s say we have a html website that looks like below
html_doc="""<!DOCTYPE html><html lang="en"><head> <title>My Dummy HTML Document</title></head><body> <h1>Welcome to My Dummy HTML Document</h1> <p>This is a paragraph in my dummy HTML document.</p> <a href="https://mrislambd.github.io/blog" class="blog" id="blog"> Blog </a> <a href="htpps://mrislambd.github.io/research" class="research" id="research"> Research </a></body></html>"""
Now we want to grab information from the dummy html documnet above.
In this example we want to obtain some information from NVIDIA Graduate Fellowship Program. Before accessing this website we need to know if we have permision to access their data through webscraping.
The status_code\(200\) ensures that we have enough permision to acccess their website data. However, if we obtain status_code of \(403, 400,\) or \(500\) then we do not permision or a bad request. For more about the status codes click here.
We want to make an analysis based on the institution of the past graduate fellows. Insepecting the elements in this website we see that the div those have class="archive-group" contains the information of the past graduate fellows.
pf = soup.find_all("div", class_="archive-group")
and the first element of this pf contains the information of the graduate fellows in the year of 2021.
Now let’s make a pandas dataframe using the information in this page. We can make an use of the output from the above chunk. To grab the year, we see that archive-group__title class with a h4 tag contains the year for all years. With strip=True, the text is cleaned by removing extra whitespace from the beginning and end. We need the first element so a split()[0] will do the job. Then we make another group called fellows that contains the fellows in a certian year by using the div and class"views-row". Once the new group created, we then iterate through this group to extract their names and corresponding institutions.
data=[]for group in pf: year = group.find("h4",class_="archive-group__title" ).get_text(strip=True).split()[0] fellows = group.find_all("div", class_="views-row")for fellow in fellows: name = fellow.find("div", class_="views-field-title" ).get_text(strip=True) institute = fellow.find("div", class_="views-field-field-grad-fellow-institution" ).get_text(strip=True) data.append({"Name": name, "Year": year, "Institute": institute})data=pd.DataFrame(data)data.head()
Name
Year
Institute
0
Alexander Sax
2021
University of California, Berkeley
1
Hanrui Wang
2021
Massachusetts Institute of Technology
2
Ji Lin
2021
Massachusetts Institute of Technology
3
Krishna Murthy Jatavallabhula
2021
University of Montreal
4
Rohan Sawhney
2021
Carnegie Mellon University
Now let’s perform some Exploratory Data Analysis (EDA). First, we analyze the unique values and distributions.
# Count the number of fellows each yearyear_counts = data['Year'].value_counts().sort_values(ascending=False)# Create a DataFrame where years are columns and counts are values in the next rowyear_data = {'Year': year_counts.index,'Count': year_counts.values}# Create the DataFrameyear_data_counts = pd.DataFrame(year_data)# Transpose the DataFrame and reset index to get years as columnsyear_data_counts = year_data_counts.set_index('Year').T# Display the DataFrameprint(year_data_counts)
university_counts = data['Institute'].value_counts()print(university_counts.head(10)) # Display the top 10 universities
Institute
Stanford University 24
Massachusetts Institute of Technology 15
University of California, Berkeley 14
Carnegie Mellon University 13
University of Utah 10
University of Washington 9
University of Illinois, Urbana-Champaign 9
University of California, Davis 8
Georgia Institute of Technology 8
University of North Carolina, Chapel Hill 6
Name: count, dtype: int64
To visualize the award distributions per year,
plt.figure(figsize=(9,5))sns.countplot(x='Year', data=data, order=sorted(data['Year'].unique()))plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.title('Number of Fellows Per Year')plt.show()
Top 10 universities visualization
plt.figure(figsize=(6,4))top_universities = data['Institute'].value_counts().head(10)sns.barplot(y=top_universities.index, x=top_universities.values)plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.title('Top 10 Universities by Number of Fellows')plt.xlabel('Number of Fellows')plt.ylabel('University')plt.show()
Trend over time
plt.figure(figsize=(9,5))data['Year'] = data['Year'].astype(int) yearly_trend = data.groupby('Year').size()yearly_trend.plot(kind='line', marker='o')plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.title('Trend of Fellows Over Time')plt.xlabel('Year')plt.ylabel('Number of Fellows')plt.show()
This is just a simple example of collecting data through webscraping. This BeautifulSoup has endless potentials to use in many projects to collect the data that are not publicly available in cleaned or organized form. Thank you for reading.
---title: "Data collection through Webscraping"date: "2024-08-14"author: Rafiq Islamcategories: [Data Science, Machine Learning, Artificial Intelligence, Data Engineering]citation: truesearch: truelightbox: trueimage: ws.jpglisting: contents: "/../../dsandml" max-items: 3 type: grid categories: false date-format: full fields: [image, date, title, author, reading-time] format: html: toc: true---## Introduction Collecting data and preparing it for a project is one of the most important tasks in any data science or machine learning project. There are many sources from where we can collect data for a project, such as - Connecting to a SQL database server - Data Source Websites such as [Kaggle](https://www.kaggle.com){target="_blank"}, [Google Dataset Search](https://datasetsearch.research.google.com){target="_blank"}, [UCI Machine Learning Repo](https://archive.ics.uci.edu/datasets){target="_blank"} etc - Web Scraping with Beautiful Soup- Using Python API ## Data Source Websites Data source websites mainly falls into two categories such as data repositories and data science competitions. There are many such websites. 1. The [UCI Machine Learning Repository](https://archive.ics.uci.edu/datasets){target="_blank"} 2. The [Harvard Dataverse](https://dataverse.harvard.edu){target="_blank"}3. The [Mendeley Data Repository](https://data.mendeley.com){target="_blank"}4. The [538](https://github.com/fivethirtyeight/data){target="_blank"}5. The [New Yourk Times](https://github.com/nytimes){target="_blank"} 6. The [International Data Analysis Olympiad](https://www.competitionsciences.org/competitions/international-data-analysis-olympiad/){target="_blank"}7. [Kaggle Competition](https://www.kaggle.com){target="_blank"}Example of collecting data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/datasets){target="_blank"} ```{python}#| code-fold: false#| code-summary: "Show the code" from ucimlrepo import fetch_ucirepo # fetch dataset iris = fetch_ucirepo(id=53) # data (as pandas dataframes) X = iris.data.features y = iris.data.targets # metadata print(iris.metadata) # variable information print(iris.variables) ```you may need to install the [UCI Machine Learning Repository](https://archive.ics.uci.edu/datasets){target="_blank"} as a package using pip. ```pip install ucimlrepo``````{python}#| code-fold: falseX.head()```## Web Scraping We scrapping is another way of collecting the data for the research if the data is not available in any repositiory. We can collect the data from a website using a library called `BeautifulSoup` if the website has permision for other people to collect data from the website.```{python}#| code-fold: false#| warning: falseimport bs4 # library for BeautifulSoupfrom bs4 import BeautifulSoup # import the BeautifulSoup objectimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom seaborn import set_styleset_style("whitegrid")```Now let's make a html object using `BeautifulSoup`. Let's say we have a html website that looks like below ```{python}#| code-fold: falsehtml_doc="""<!DOCTYPE html><html lang="en"><head> <title>My Dummy HTML Document</title></head><body> <h1>Welcome to My Dummy HTML Document</h1> <p>This is a paragraph in my dummy HTML document.</p> <a href="https://mrislambd.github.io/blog" class="blog" id="blog"> Blog </a> <a href="htpps://mrislambd.github.io/research" class="research" id="research"> Research </a></body></html>"""```Now we want to grab information from the dummy html documnet above.```{python}#| code-fold: falsesoup=BeautifulSoup(html_doc, features='html.parser')```Now that we have the object `soup` we can walk through each elements in this object. For example, if we want to grab the title element, ```{python}#| code-fold: false soup.html.head.title```Since the html document has only one title, therefore, we can simply use the following command```{python}#| code-fold: false soup.title ```or this command to get the text only```{python}#| code-fold: false soup.title.text```This `soup` object is like a family tree. It has parents, children, greatgrand parents etc. ```{python}#| code-fold: false soup.title.parent```Now to grab an attribute from the `soup` object we can use ```{python}#| code-fold: false soup.a```or any particular thing from the attribute ```{python}#| code-fold: false soup.a['class']```We can also find multiple attribute of the same kind ```{python}#| code-fold: false soup.findAll('a')```Then if we want any particular object from all `a` attribute ```{python}#| code-fold: false soup.findAll('a')[0]['id']```For any `p` tag ```{python}#| code-fold: false soup.p.text ```Similarly, if we want to grab all the `href`s from the `a` tags ```{python}#| code-fold: false [h['href'] for h in soup.findAll('a')]```## Example of Webscraping from a real website In this example we want to obtain some information from [NVIDIA Graduate Fellowship Program](https://research.nvidia.com/graduate-fellowships/archive){target="_blank"}. Before accessing this website we need to know if we have permision to access their data through webscraping. ```{python}#| code-fold: falseimport requestsresponse = requests.get(url="https://research.nvidia.com/graduate-fellowships/archive")response.status_code```The `status_code` $200$ ensures that we have enough permision to acccess their website data. However, if we obtain `status_code` of $403, 400,$ or $500$ then we do not permision or a `bad request`. For more about the status codes [click here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status){target="_blank"}. ```{python}#| code-fold: falsesoup = BeautifulSoup(response.text, 'html.parser')```We want to make an analysis based on the institution of the past graduate fellows. Insepecting the elements in [this website](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status){target="_blank"} we see that the `div` those have `class="archive-group"` contains the information of the past graduate fellows. ```{python}#| code-fold: falsepf = soup.find_all("div", class_="archive-group")```and the first element of this `pf` contains the information of the graduate fellows in the year of 2021. ```{python}#| code-fold: falsepf[0]```Now let's make a `pandas` dataframe using the information in this page. We can make an use of the output from the above chunk. To grab the year, we see that `archive-group__title` class with a `h4` tag contains the year for all years. With `strip=True`, the text is cleaned by removing extra whitespace from the beginning and end. We need the first element so a `split()[0]` will do the job. Then we make another group called `fellows` that contains the fellows in a certian year by using the `div` and `class"views-row"`. Once the new group created, we then iterate through this group to extract their names and corresponding institutions.```{python}#| code-fold: false#| code-overflow: scrolldata=[]for group in pf: year = group.find("h4",class_="archive-group__title" ).get_text(strip=True).split()[0] fellows = group.find_all("div", class_="views-row")for fellow in fellows: name = fellow.find("div", class_="views-field-title" ).get_text(strip=True) institute = fellow.find("div", class_="views-field-field-grad-fellow-institution" ).get_text(strip=True) data.append({"Name": name, "Year": year, "Institute": institute})data=pd.DataFrame(data)data.head()```Now let's perform some Exploratory Data Analysis (EDA). First, we analyze the unique values and distributions. ```{python}#| code-fold: false# Count the number of fellows each yearyear_counts = data['Year'].value_counts().sort_values(ascending=False)# Create a DataFrame where years are columns and counts are values in the next rowyear_data = {'Year': year_counts.index,'Count': year_counts.values}# Create the DataFrameyear_data_counts = pd.DataFrame(year_data)# Transpose the DataFrame and reset index to get years as columnsyear_data_counts = year_data_counts.set_index('Year').T# Display the DataFrameprint(year_data_counts)```Next we see that most represented universities ```{python}#| code-fold: falseuniversity_counts = data['Institute'].value_counts()print(university_counts.head(10)) # Display the top 10 universities```To visualize the award distributions per year,```{python}#| code-fold: falseplt.figure(figsize=(9,5))sns.countplot(x='Year', data=data, order=sorted(data['Year'].unique()))plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.title('Number of Fellows Per Year')plt.show()```Top 10 universities visualization ```{python}#| code-fold: falseplt.figure(figsize=(6,4))top_universities = data['Institute'].value_counts().head(10)sns.barplot(y=top_universities.index, x=top_universities.values)plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.title('Top 10 Universities by Number of Fellows')plt.xlabel('Number of Fellows')plt.ylabel('University')plt.show()```Trend over time ```{python}#| code-fold: falseplt.figure(figsize=(9,5))data['Year'] = data['Year'].astype(int) yearly_trend = data.groupby('Year').size()yearly_trend.plot(kind='line', marker='o')plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.title('Trend of Fellows Over Time')plt.xlabel('Year')plt.ylabel('Number of Fellows')plt.show()```This is just a simple example of collecting data through webscraping. This `BeautifulSoup` has endless potentials to use in many projects to collect the data that are not publicly available in cleaned or organized form. Thank you for reading.## References - [Fisher,R. A.. (1988). Iris. UCI Machine Learning Repository.](https://doi.org/10.24432/C56C76.){target="_blank"}**Share on** <div id="fb-root"></div><script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v20.0"></script><div class="share-buttons"><div class="fb-share-button" data-href="https://mrislambd.github.io/posts/datacollection/index.html"data-layout="button_count" data-size="small"><a target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fmrislambd.github.io%2Fposts%2Fdatacollection%2Findex.html&src=sdkpreparse" class="fb-xfbml-parse-ignore">Share</a></div><script src="https://platform.linkedin.com/in.js" type="text/javascript">lang: en_US</script><script type="IN/Share" data-url="https://mrislambd.github.io/posts/datacollection/index.html"></script> <a href="https://twitter.com/share?ref_src=twsrc%5Etfw" class="twitter-share-button" data-url="https://mrislambd.github.io/posts/datacollection/index.html" data-show-count="true">Tweet</a><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div><div class="fb-comments" data-href="https://mrislambd.github.io/posts/datacollection/index.html" data-width="" data-numposts="5"></div>**You may also like**