Overview
This post is about web crawling and scraping in python. The modules we will need are requests and BeautifulSoup. More information can be found here for requests and here for BeautifulSoup.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# first we need the 2 modules
import requests
from bs4 import BeautifulSoup
def crawl():
# retrieve google page
source = requests.get("http://www.google.com");
# get the source in text format
plain_text = source.text
soup = BeautifulSoup(plain_text)
# get every link on the page and print it
for link in soup.findAll('a'):
print( link.get('href').text )
# get the first table on the page and print all the table columns
for tableColumn in soup.findAll('table')[0].findAll('td'):
print( tableColumn.text )