Chapter 20 Web Scraping
Web scraping is the term for using a program to download and process content from the web.
webbrowser
: A default Python module to open a browser to specific page.requests
: A module to download files and web pages from the Internet.bs4
: A modile to parse HTML, i.e., the format that web pages are written in.selenium
: A module to launch and control a web browser (e.g., filling in forms, simulating mouse clicks.)
20.1 webbrowswer
module
- Create a python script with the following codes, named
py-checkword.py
#! python3
import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
# Get input from the command line
= ' '.join(sys.argv[1:])
target else:
# Get input from the clipboard
= pyperclip.paste()
target
open('https://www.dictionary.com/browse/'+ target) webbrowser.
- Run the python script in the terminal
python py-checkword.py beauty
Exercise 20.1 How to modify the py-checkword.py
so that the user can attach a list of words separated by spaces for checking? For example, the modified script will be able to open three web browsers for beauty
, internet
, and national
.
python py-checkword2.py beauty internet national
20.2 requests
Module
- The
requests
modules allow us to easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression.
import requests
= requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')
res type(res)
## Check status code to see if the download is successful
<class 'requests.models.Response'>
== requests.codes.ok ## `requests.codes.ok` == 200 res.status_code
True
len(res.text)
560045
print(res.text[:250])
The Project Gutenberg eBook of Grimmsâ Fairy Tales, by Jacob Grimm and Wilhelm Grimm
This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever.
- Prepare potential errors during the file download
import requests
= requests.get('https://www.gutenberg.org/file-that-does-not-exist.txt')
res
## Check status code to see if the download is successful
== requests.codes.ok ## `requests.codes.ok` == 200 res.status_code
False
len(res.text)
6392
print(res.text[:250])
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>404 | Project Gutenberg</title>
<link rel="stylesheet" href="/gutenberg/style.css?v=1.1">
<link rel="stylesheet" href="/gutenberg/collapsible.css
res.raise_for_status()
Error: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://www.gutenberg.org/file-that-does-not-exist.txt
- A better way to modify the codes is to make sure that the program stops as soon as some unexpected error happens.
- Always call
raise_for_status()
after callingrequests.get()
because we need to make sure the file has been successfully downloaded before the program continues.
import requests
= requests.get('https://www.gutenberg.org/file-that-does-not-exist.txt')
res try:
res.raise_for_status()except Exception as exc:
print('There was a problem with the link: %s' % (exc))
There was a problem with the link: 404 Client Error: Not Found for url: https://www.gutenberg.org/file-that-does-not-exist.txt
- Usually we may want to scrape the texts from the web and save them on the Hard Drive.
import requests
= requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')
res try:
res.raise_for_status()except Exception as exc:
print('There was a problem with the link: %s' % (exc))
with open('grimms.txt', 'w') as f:
f.write(res.text)
560045
20.3 bs4
Module (Beautiful Soup)
Beautiful Soup is a module for extracting information from an HTML page. The package name is pip install -U beatifulsoup4
but in use, it is import bs4
.
- Each Word is a
dict
:- “headword”: head word string
- “pronunciation”: IPA
- “parts-of-speech”: A list of senses {“definition”: ’‘, “example”:’’}
import requests, bs4
='individual'
target= requests.get('https://www.dictionary.com/browse/' + target)
res
res.raise_for_status()
= bs4.BeautifulSoup(res.text, 'lxml')
soup = soup.select('.css-1avshm7') # entries
entries
## Define the word structure (dict)
= {}
cur_word # for each entry
for i, entry in enumerate(entries):
## Include only the main entry of the page
if len(entry.select('h1')) > 0:
#print('Entry Number: ', i)
## headword and pronunciations
= entry.select('h1')[0].getText()
cur_headword = entry.select('.pron-spell-content')[0].getText()
cur_spell = entry.select('.pron-ipa-content')[0].getText().encode('utf-8').decode('utf-8')
cur_ipa #print('Headword: ', cur_headword)
#print('Pronunciation: ', cur_ipa)
'headword'] = cur_headword
cur_word['pronunciation'] = cur_ipa
cur_word[
# for each POS type in the current entry
for pos in entry.select('.css-pnw38j'):
= pos.select('.luna-pos')[0].getText()
cur_pos #print('='*10)
#print('POS: ', cur_pos.upper())
= pos.select('div[value]')
cur_definitions =[]
cur_sense_list # for each definition in the current POS
for sense in cur_definitions:
#print('DEF: ' + sense.find(text=True, recursive=True))
## check if there's any example
= sense.find(attrs={'class':'luna-example'})
ex if ex is not None:
= ex.getText()
cur_ex = ex.extract()
_ else:
= ''
cur_ex = sense.getText()
cur_def #print('-'*10)
#print('Definition: ' + cur_def)
#print('Example: '+ cur_ex)
= {'definition': cur_def, 'example': cur_ex}
cur_sense
cur_sense_list.append(cur_sense)= cur_sense_list
cur_word[cur_pos]
import json
with open(target+'.json', 'w', encoding='utf-8') as f:
=False)
json.dump(cur_word, f, ensure_ascii
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
import json
with open('produce.json','r', encoding='utf-8') as f:
= json.load(f)
cur_word cur_word.keys()
dict_keys(['headword', 'pronunciation', 'verb (used with object),', 'verb (used without object),', 'noun'])
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
{
"headword": "produce",
"pronunciation": "/ verb prəˈdus, -ˈdyus; noun ˈprɒd us, -yus, ˈproʊ dus, -dyus /",
"verb (used with object),": [
{
"definition": "to bring into existence; give rise to; cause: ",
"example": "to produce steam."
},
{
"definition": "to bring into existence by intellectual or creative ability: ",
"example": "to produce a great painting."
},
{
"definition": "to make or manufacture: ",
"example": "to produce automobiles for export."
},
{
"definition": "to bring forth; give birth to; bear: ",
"example": "to produce a litter of puppies."
},
{
"definition": "to provide, furnish, or supply; yield: ",
"example": "a mine producing silver."
},
{
"definition": "Finance. ",
"example": ""
},
{
"definition": "to cause to accrue: ",
"example": "stocks producing unexpected dividends."
},
{
"definition": "to bring forward; present to view or notice; exhibit: ",
"example": "to produce one's credentials."
},
{
"definition": "to bring (a play, movie, opera, etc.) before the public.",
"example": ""
},
{
"definition": "to extend or prolong, as a line.",
"example": ""
}
],
"verb (used without object),": [
{
"definition": "to create, bring forth, or yield offspring, products, etc.: ",
"example": "Their mines are closed because they no longer produce."
},
{
"definition": "Economics. ",
"example": ""
},
{
"definition": "to create economic value; bring crops, goods, etc., to a point at which they will command a price.",
"example": ""
}
],
"noun": [
{
"definition": "something that is produced; yield; product. ",
"example": ""
},
{
"definition": "agricultural products collectively, especially vegetables and fruits.",
"example": ""
},
{
"definition": "offspring, especially of a female animal: ",
"example": "the produce of a mare."
}
]
}
Exercise 20.2 Now how to extend this short script to allow the users to perform searches of multiple words all at once, scrape all definitions and examples from the website of Dictionary.com, and save them to the Hard Drive as json files in a specific directory?
=["individual", "wonderful"], outdir = 'dictionary_results/') checkwords(targets
import json
= 'dictionary_results/'
outdir with open(outdir+'individual.json','r') as f:
= json.load(f)
cur_word print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
{
"headword": "individual",
"pronunciation": "/ ˌɪn dəˈvɪdʒ u əl /",
"noun": [
{
"definition": "a single human being, as distinguished from a group.",
"example": ""
},
{
"definition": "a person: ",
"example": "a strange individual."
},
{
"definition": "a distinct, indivisible entity; a single thing, being, instance, or item.",
"example": ""
},
{
"definition": "a group considered as a unit.",
"example": ""
},
{
"definition": "Biology. a single organism capable of independent existence. a member of a compound organism or colony.",
"example": ""
},
{
"definition": "Cards. a duplicate-bridge tournament in which each player plays the same number of hands in partnership with every other player, individual scores for each player being kept for each hand.",
"example": ""
}
],
"adjective": [
{
"definition": "single; particular; separate: ",
"example": "to number individual copies of a limited edition."
},
{
"definition": "intended for the use of one person only: ",
"example": "to serve individual portions of a pizza."
},
{
"definition": "of, relating to, or characteristic of a particular person or thing: ",
"example": "individual tastes."
},
{
"definition": "distinguished by special, singular, or markedly personal characteristics; exhibiting unique or unusual qualities: ",
"example": "a highly individual style of painting."
},
{
"definition": "existing as a distinct, indivisible entity, or considered as such; discrete: ",
"example": "individual parts of a tea set."
},
{
"definition": "of which each is different or of a different design from the others: ",
"example": "a set of individual coffee cups."
}
]
}