# Web scraper for Wikipedia 

We will be scraping information about Finnish athletes who have medaled at the Summer Olympics. For more information about scraping, consult the documentation for Beautiful Soup at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. 

### Imports

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

## Downloading one web page

In [None]:
# the site we want to scrape
url = "https://en.wikipedia.org/wiki/Mira_Potkonen"

# we ask the server for the webpage
r = requests.get(url)

r

In [None]:
# status code tells us about the server's response
r.status_code

In [None]:
# this is the webpage -- like if you clicked "View Source" in a broswer
r.content

## Working with the HTML

In [None]:
page = r.content

#parse page with bs4
soup = BeautifulSoup(page, 'html.parser')
soup

In [None]:
# make it look a little nicer
print(soup.prettify())

In [None]:
#selects the title element of this HTML page (<title>)
soup.title

In [None]:
#selects the text content of the title element of this HTML page
soup.title.text

In [None]:
#selects the first HTML large heading element (<h1>)
soup.find('h1')

In [None]:
#selects all the hyperlinks on the page (<a>)
soup.findAll('a')

### Try it out!

In [None]:
#write some code that selects the first table (<table>)



In [None]:
#write some code that selects all the items in lists on the page <li>



In [None]:
#write some code that selects the text within the first paragraph (<p>)



[Challenge] Go on a point-and-click adventure with the Developer Tools on your browser and try to select the text of your choice.

In [None]:
#write some code that selects the text of your choice


## Downloading multiple web pages
- First, we'll grab all the URLs to medalist Wikipedia pages.

In [None]:
url = "https://en.wikipedia.org/wiki/Finland_at_the_Olympics"
r = requests.get(url)

In [None]:
page = r.content

#parse page with bs4
soup = BeautifulSoup(page, 'html.parser')
soup

In [None]:
#we are interested in a tabls
tables = soup.findAll('table')

#the 9th table on the page contains Summer Olympic medalists
table = tables[8]

In [None]:
# select just the rows in the table
rows = table.find_all('tr')

# the first row contains the table header
rows[0].text

In [None]:
# select the cells in last row in the table
cells = rows[-1].find_all('td')

# select the second cell in that row
cells[1]

In [None]:
cells[1].text

In [None]:
links_to_athletes = []

# loop through all the rows in the table
for row in rows:
    # break down each row into cells
    cells = row.find_all('td')
    # check if there more than one cell
    if len(cells) > 1:
        # check if the second cell (corresponding to the name column) contains a hyperlink
        if (cells[1].find('a')):
            # if the cell contains a hyperlink, get the link referenced
            link_to_athlete = cells[1].find('a')['href']
            # add the link referenced to a list
            links_to_athletes.append(link_to_athlete)

In [None]:
links_to_athletes

In [None]:
# to get full urls, add the base part of the URL to each
links_to_athletes = ['https://en.wikipedia.org' + i for i in links_to_athletes]
links_to_athletes

- Next, we'll visit each of the pages

In [None]:
data = []

# loop through all of the links
for athlete_page in links_to_athletes:
    
    # request the athlete's wiki page
    r = requests.get(athlete_page)
    page = r.content
    
    # parse page with bs4
    soup = BeautifulSoup(page, 'html.parser')
    
    # this selects the first HTML element span (<span>) that has the class attribute "birthplace"
    birthplace = soup.find("span", {"class": "birthplace"})
    
    # check to see if birthplace was found
    if birthplace:
        # add name and birthplace of athlete to an object
        athlete_info = {}
        athlete_info['name'] = soup.find('h1').text
        athlete_info['birthplace'] = birthplace.text
        # add that object to our data variable
        data.append(athlete_info)
    
    # be kind to Wikipedia and take a break between scrapes
    time.sleep(1)

In [None]:
data

## Check this data out in pandas

In [None]:
#convert our findings to a dataframe
df = pd.DataFrame(data)
df

In [None]:
#we can select rows that contain athletes born in Helsinki

df[df.birthplace=="Helsinki, Finland"]

### Try it out!

In [None]:
# select the row that contains our pal Mira

