Overview
This post is about web crawling and scraping in python. The modules we will need are requests
and BeautifulSoup
. More information can be found here for requests and here for BeautifulSoup.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# first we need the 2 modules
import requests
from bs4 import BeautifulSoup
def crawl():
# retrieve google page
source = requests.get("http://www.google.com");
# get the source in text format
plain_text = source.text
soup = BeautifulSoup(plain_text)
# get every link on the page and print it
for link in soup.findAll('a'):
print( link.get('href').text )
# get the first table on the page and print all the table columns
for tableColumn in soup.findAll('table')[0].findAll('td'):
print( tableColumn.text )