Friday, January 29, 2016

Overview

Clipboard

If you ever had to extract a lot of content from a pdf or a website and wanted a faster way to do it, this is it. In this post, I’m going to use python’s pyperclip and re modules to extract the phone numbers from some text that I’ve copied to the clipboard.

Content

We first need to pyperclip. It can be installed with pip. If you don’t have pip, open the terminal and run sudo easy_install pip. Then run sudo pip install pyperclip.

Now that we have the 2 modules we need, we can scrape!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re, pyperclip

# we need a regex that covers the following types
# (111) 111-1111, 222-222-2222, 333-3333

numberRegex = re.compile(r'''
(
    ((\(\d\d\d\))|(\d\d\d))?    # area code  (111) or 111
    (\s|-)                      # separator  " " or -
    \d\d\d                      # first 3 digits 222
    -                           # separator
    \d\d\d                      # last 4 digits
)
''', re.VERBOSE) 

The re.VERBOSE flag allows you to write comments in the string. This is easier to understand and modify than having everything on one line. I added a group that surrounds the whole regex using ().

1
2
3
4
5
# We get the text from the clipboard
text = pyperclip.paste()

allPhoneNumberGroups = numberRegex.findall(text)
phoneNumbers = [phone[0] for phone in allPhoneNumberGroups]

The findall method returns a list of tuples. Each tuple represents a group in our regular expression. A group is assigned using ().

To get the phone numbers, I have to get the first group from the matches. The full code is below with an example. The code below assumes that we copied the string (111) 111-1111, 222-222-2222, 333-3333.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#! python3
import re, pyperclip

# we need a regex that covers the following types
# (111) 111-1111, 222-222-2222, 333-3333

numberRegex = re.compile(r'''
(
    ((\(\d\d\d\))|(\d\d\d))?    # area code  (111) or 111
    (\s|-)                      # separator  " " or -
    \d\d\d                      # first 3 digits 222
    -                           # separator
    \d\d\d                      # last 4 digits
)
''', re.VERBOSE)

# We get the text from the clipboard
text = pyperclip.paste()

allPhoneNumberGroups = numberRegex.findall(text)
phoneNumbers = [phone[0] for phone in allPhoneNumberGroups]

# [('(111) 111-111', '(111)', '(111)', '', ' '),    (' 222-222', '', '', '', ' '),     (' 333-333', '', '', '', ' ')]
print(allPhoneNumberGroups)

# ['(111) 111-111', ' 222-222', ' 333-333']
print(phoneNumbers)

Random Posts