Chapter 20 Web Scraping
Web scraping is the term for using a program to download and process content from the web.
webbrowser
: A default Python module to open a browser to specific page.requests
: A module to download files and web pages from the Internet.bs4
: A modile to parse HTML, i.e., the format that web pages are written in.selenium
: A module to launch and control a web browser (e.g., filling in forms, simulating mouse clicks.)
20.1 webbrowswer
module
- Create a python script with the following codes, named
py-checkword.py
#! python3
import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
# Get input from the command line
target = ' '.join(sys.argv[1:])
else:
# Get input from the clipboard
target = pyperclip.paste()
webbrowser.open('https://www.dictionary.com/browse/'+ target)
- Run the python script in the terminal
Exercise 20.1 How to modify the py-checkword.py
so that the user can attach a list of words separated by spaces for checking? For example, the modified script will be able to open three web browsers for beauty
, internet
, and national
.
20.2 requests
Module
- The
requests
modules allow us to easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression.
<class 'requests.models.Response'>
## Check status code to see if the download is successful
res.status_code == requests.codes.ok ## `requests.codes.ok` == 200
True
549750
The Project Gutenberg eBook of Grimms’ Fairy Tales, by Jacob Grimm and Wilhelm Grimm
This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You
- Prepare potential errors during the file download
import requests
res = requests.get('https://www.gutenberg.org/file-that-does-not-exist.txt')
## Check status code to see if the download is successful
res.status_code == requests.codes.ok ## `requests.codes.ok` == 200
False
6414
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>404 | Project Gutenberg</title>
<link rel="stylesheet" href="/gutenberg/style.css?v=1.1">
<link rel="stylesheet" href="/gutenberg/collapsible.css
404 Client Error: Not Found for url:
https://www.gutenberg.org/file-that-does-not-exist.txt
- A better way to modify the codes is to make sure that the program stops as soon as some unexpected error happens.
- Always call
raise_for_status()
after callingrequests.get()
because we need to make sure the file has been successfully downloaded before the program continues.
import requests
res = requests.get('https://www.gutenberg.org/file-that-does-not-exist.txt')
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem with the link: %s' % (exc))
There was a problem with the link: 404 Client Error: Not Found for url: https://www.gutenberg.org/file-that-does-not-exist.txt
- Usually we may want to scrape the texts from the web and save them on the Hard Drive.
import requests
res = requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem with the link: %s' % (exc))
with open('grimms.txt', 'w') as f:
f.write(res.text)
549750
20.3 bs4
Module (Beautiful Soup)
Beautiful Soup is a module for extracting information from an HTML page. The package name is pip install -U beatifulsoup4
but in use, it is import bs4
.
- Each Word is a
dict
:- “headword”: head word string
- “pronunciation”: IPA
- “parts-of-speech”: A list of senses {“definition”: ’‘, “example”:’’}
import requests, bs4
target='individual'
res = requests.get('https://www.dictionary.com/browse/' + target)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
entries = soup.select('.css-1avshm7') # entries
## Define the word structure (dict)
cur_word = {}
# for each entry
for i, entry in enumerate(entries):
## Include only the main entry of the page
if len(entry.select('h1')) > 0:
#print('Entry Number: ', i)
## headword and pronunciations
cur_headword = entry.select('h1')[0].getText()
cur_spell = entry.select('.pron-spell-content')[0].getText()
cur_ipa = entry.select('.pron-ipa-content')[0].getText().encode('utf-8').decode('utf-8')
#print('Headword: ', cur_headword)
#print('Pronunciation: ', cur_ipa)
cur_word['headword'] = cur_headword
cur_word['pronunciation'] = cur_ipa
# for each POS type in the current entry
for pos in entry.select('.css-pnw38j'):
cur_pos = pos.select('.luna-pos')[0].getText()
#print('='*10)
#print('POS: ', cur_pos.upper())
cur_definitions = pos.select('div[value]')
cur_sense_list =[]
# for each definition in the current POS
for sense in cur_definitions:
#print('DEF: ' + sense.find(text=True, recursive=True))
## check if there's any example
ex = sense.find(attrs={'class':'luna-example'})
if ex is not None:
cur_ex = ex.getText()
_ = ex.extract()
else:
cur_ex = ''
cur_def = sense.getText()
#print('-'*10)
#print('Definition: ' + cur_def)
#print('Example: '+ cur_ex)
cur_sense = {'definition': cur_def, 'example': cur_ex}
cur_sense_list.append(cur_sense)
cur_word[cur_pos] = cur_sense_list
import json
with open(target+'.json', 'w', encoding='utf-8') as f:
json.dump(cur_word, f, ensure_ascii=False)
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
import json
with open('produce.json','r', encoding='utf-8') as f:
cur_word = json.load(f)
cur_word.keys()
dict_keys(['headword', 'pronunciation', 'verb (used with object),', 'verb (used without object),', 'noun'])
{
"headword": "produce",
"pronunciation": "/ verb prəˈdus, -ˈdyus; noun ˈprɒd us, -yus, ˈproʊ dus, -dyus /",
"verb (used with object),": [
{
"definition": "to bring into existence; give rise to; cause: ",
"example": "to produce steam."
},
{
"definition": "to bring into existence by intellectual or creative ability: ",
"example": "to produce a great painting."
},
{
"definition": "to make or manufacture: ",
"example": "to produce automobiles for export."
},
{
"definition": "to bring forth; give birth to; bear: ",
"example": "to produce a litter of puppies."
},
{
"definition": "to provide, furnish, or supply; yield: ",
"example": "a mine producing silver."
},
{
"definition": "Finance. ",
"example": ""
},
{
"definition": "to cause to accrue: ",
"example": "stocks producing unexpected dividends."
},
{
"definition": "to bring forward; present to view or notice; exhibit: ",
"example": "to produce one's credentials."
},
{
"definition": "to bring (a play, movie, opera, etc.) before the public.",
"example": ""
},
{
"definition": "to extend or prolong, as a line.",
"example": ""
}
],
"verb (used without object),": [
{
"definition": "to create, bring forth, or yield offspring, products, etc.: ",
"example": "Their mines are closed because they no longer produce."
},
{
"definition": "Economics. ",
"example": ""
},
{
"definition": "to create economic value; bring crops, goods, etc., to a point at which they will command a price.",
"example": ""
}
],
"noun": [
{
"definition": "something that is produced; yield; product. ",
"example": ""
},
{
"definition": "agricultural products collectively, especially vegetables and fruits.",
"example": ""
},
{
"definition": "offspring, especially of a female animal: ",
"example": "the produce of a mare."
}
]
}
Exercise 20.2 Now how to extend this short script to allow the users to perform searches of multiple words all at once, scrape all definitions and examples from the website of Dictionary.com, and save them to the Hard Drive as json files in a specific directory?
import json
outdir = 'dictionary_results/'
with open(outdir+'individual.json','r') as f:
cur_word = json.load(f)
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
{
"headword": "individual",
"pronunciation": "/ ˌɪn dəˈvɪdʒ u əl /",
"noun": [
{
"definition": "a single human being, as distinguished from a group.",
"example": ""
},
{
"definition": "a person: ",
"example": "a strange individual."
},
{
"definition": "a distinct, indivisible entity; a single thing, being, instance, or item.",
"example": ""
},
{
"definition": "a group considered as a unit.",
"example": ""
},
{
"definition": "Biology. a single organism capable of independent existence. a member of a compound organism or colony.",
"example": ""
},
{
"definition": "Cards. a duplicate-bridge tournament in which each player plays the same number of hands in partnership with every other player, individual scores for each player being kept for each hand.",
"example": ""
}
],
"adjective": [
{
"definition": "single; particular; separate: ",
"example": "to number individual copies of a limited edition."
},
{
"definition": "intended for the use of one person only: ",
"example": "to serve individual portions of a pizza."
},
{
"definition": "of, relating to, or characteristic of a particular person or thing: ",
"example": "individual tastes."
},
{
"definition": "distinguished by special, singular, or markedly personal characteristics; exhibiting unique or unusual qualities: ",
"example": "a highly individual style of painting."
},
{
"definition": "existing as a distinct, indivisible entity, or considered as such; discrete: ",
"example": "individual parts of a tea set."
},
{
"definition": "of which each is different or of a different design from the others: ",
"example": "a set of individual coffee cups."
}
]
}