Monday, January 26, 2015

Overview

This post is about web crawling and scraping in python. The modules we will need are requests and BeautifulSoup. More information can be found here for requests and here for BeautifulSoup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# first we need the 2 modules
import requests
from bs4 import BeautifulSoup

def crawl():
    # retrieve google page
    source = requests.get("http://www.google.com");
    
    # get the source in text format
    plain_text = source.text
    soup = BeautifulSoup(plain_text)

    # get every link on the page and print it
    for link in soup.findAll('a'):
        print( link.get('href').text )

    # get the first table on the page and print all the table columns
    for tableColumn in soup.findAll('table')[0].findAll('td'):
        print( tableColumn.text )

Random Posts