Chapter 20 Web Scraping

Web scraping is the term for using a program to download and process content from the web.

  • webbrowser: A default Python module to open a browser to specific page.
  • requests: A module to download files and web pages from the Internet.
  • bs4: A modile to parse HTML, i.e., the format that web pages are written in.
  • selenium: A module to launch and control a web browser (e.g., filling in forms, simulating mouse clicks.)

20.1 webbrowswer module

  • Create a python script with the following codes, named py-checkword.py
#! python3
import webbrowser, sys, pyperclip

if len(sys.argv) > 1:
  # Get input from the command line
  target = ' '.join(sys.argv[1:])
else:
  # Get input from the clipboard
  target = pyperclip.paste()
  
webbrowser.open('https://www.dictionary.com/browse/'+ target)
  • Run the python script in the terminal
python py-checkword.py beauty

Exercise 20.1 How to modify the py-checkword.py so that the user can attach a list of words separated by spaces for checking? For example, the modified script will be able to open three web browsers for beauty, internet, and national.

python py-checkword2.py beauty internet national

20.2 requests Module

  • The requests modules allow us to easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression.
import requests

res = requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')
type(res)

## Check status code to see if the download is successful
<class 'requests.models.Response'>
res.status_code == requests.codes.ok  ## `requests.codes.ok` == 200
True
len(res.text)
560045
print(res.text[:250])
The Project Gutenberg eBook of Grimms’ Fairy Tales, by Jacob Grimm and Wilhelm Grimm

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. 
  • Prepare potential errors during the file download
import requests

res = requests.get('https://www.gutenberg.org/file-that-does-not-exist.txt')

## Check status code to see if the download is successful
res.status_code == requests.codes.ok  ## `requests.codes.ok` == 200
False
len(res.text)
6392
print(res.text[:250])
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
 <meta charset="UTF-8"/>

<title>404 | Project Gutenberg</title>
 <link rel="stylesheet" href="/gutenberg/style.css?v=1.1">
 <link rel="stylesheet" href="/gutenberg/collapsible.css
res.raise_for_status()
Error: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://www.gutenberg.org/file-that-does-not-exist.txt
  • A better way to modify the codes is to make sure that the program stops as soon as some unexpected error happens.
  • Always call raise_for_status() after calling requests.get() because we need to make sure the file has been successfully downloaded before the program continues.
import requests

res = requests.get('https://www.gutenberg.org/file-that-does-not-exist.txt')
try:
  res.raise_for_status()
except Exception as exc:
  print('There was a problem with the link: %s' % (exc))
There was a problem with the link: 404 Client Error: Not Found for url: https://www.gutenberg.org/file-that-does-not-exist.txt
  • Usually we may want to scrape the texts from the web and save them on the Hard Drive.
import requests

res = requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')
try:
  res.raise_for_status()
except Exception as exc:
  print('There was a problem with the link: %s' % (exc))

with open('grimms.txt', 'w') as f:
  f.write(res.text)
  
560045

20.3 bs4 Module (Beautiful Soup)

Beautiful Soup is a module for extracting information from an HTML page. The package name is pip install -U beatifulsoup4 but in use, it is import bs4.

  • Each Word is a dict:
    • “headword”: head word string
    • “pronunciation”: IPA
    • “parts-of-speech”: A list of senses {“definition”: ’‘, “example”:’’}
import requests, bs4
target='individual'
res = requests.get('https://www.dictionary.com/browse/' + target)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, 'lxml')
entries = soup.select('.css-1avshm7') # entries

## Define the word structure (dict)

cur_word = {}
# for each entry
for i, entry in enumerate(entries):
  ## Include only the main entry of the page
  if len(entry.select('h1')) > 0:
      #print('Entry Number: ', i)
      ##  headword and pronunciations
      cur_headword = entry.select('h1')[0].getText()
      cur_spell = entry.select('.pron-spell-content')[0].getText()
      cur_ipa = entry.select('.pron-ipa-content')[0].getText().encode('utf-8').decode('utf-8')
      #print('Headword: ', cur_headword)
      #print('Pronunciation: ', cur_ipa)
      
      cur_word['headword'] = cur_headword
      cur_word['pronunciation'] = cur_ipa
      
      # for each POS type in the current entry
      for pos in entry.select('.css-pnw38j'): 
          cur_pos = pos.select('.luna-pos')[0].getText()
          #print('='*10)
          #print('POS: ', cur_pos.upper())
          cur_definitions = pos.select('div[value]')
          cur_sense_list =[]
          # for each definition in the current POS
          for sense in cur_definitions:
            #print('DEF: ' + sense.find(text=True, recursive=True))
            ## check if there's any example
            ex = sense.find(attrs={'class':'luna-example'})
            if ex is not None:
              cur_ex = ex.getText()
              _ = ex.extract()
            else:
              cur_ex = ''
            cur_def = sense.getText()
            #print('-'*10)
            #print('Definition: ' + cur_def)
            #print('Example: '+ cur_ex)
            cur_sense = {'definition': cur_def, 'example': cur_ex}
            cur_sense_list.append(cur_sense)
          cur_word[cur_pos] = cur_sense_list

import json

with open(target+'.json', 'w', encoding='utf-8') as f:
  json.dump(cur_word, f, ensure_ascii=False)
  
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
import json

with open('produce.json','r', encoding='utf-8') as f:
  cur_word = json.load(f)
cur_word.keys()
dict_keys(['headword', 'pronunciation', 'verb (used with object),', 'verb (used without object),', 'noun'])
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
{
    "headword": "produce",
    "pronunciation": "/ verb prəˈdus, -ˈdyus; noun ˈprɒd us, -yus, ˈproʊ dus, -dyus  /",
    "verb (used with object),": [
        {
            "definition": "to bring into existence; give rise to; cause: ",
            "example": "to produce steam."
        },
        {
            "definition": "to bring into existence by intellectual or creative ability: ",
            "example": "to produce a great painting."
        },
        {
            "definition": "to make or manufacture: ",
            "example": "to produce automobiles for export."
        },
        {
            "definition": "to bring forth; give birth to; bear: ",
            "example": "to produce a litter of puppies."
        },
        {
            "definition": "to provide, furnish, or supply; yield: ",
            "example": "a mine producing silver."
        },
        {
            "definition": "Finance. ",
            "example": ""
        },
        {
            "definition": "to cause to accrue: ",
            "example": "stocks producing unexpected dividends."
        },
        {
            "definition": "to bring forward; present to view or notice; exhibit: ",
            "example": "to produce one's credentials."
        },
        {
            "definition": "to bring (a play, movie, opera, etc.) before the public.",
            "example": ""
        },
        {
            "definition": "to extend or prolong, as a line.",
            "example": ""
        }
    ],
    "verb (used without object),": [
        {
            "definition": "to create, bring forth, or yield offspring, products,  etc.: ",
            "example": "Their mines are closed because they no longer produce."
        },
        {
            "definition": "Economics. ",
            "example": ""
        },
        {
            "definition": "to create economic value; bring crops, goods, etc., to a point at which they will command a price.",
            "example": ""
        }
    ],
    "noun": [
        {
            "definition": "something that is produced; yield; product. ",
            "example": ""
        },
        {
            "definition": "agricultural products collectively, especially vegetables and fruits.",
            "example": ""
        },
        {
            "definition": "offspring, especially of a female animal: ",
            "example": "the produce of a mare."
        }
    ]
}

Exercise 20.2 Now how to extend this short script to allow the users to perform searches of multiple words all at once, scrape all definitions and examples from the website of Dictionary.com, and save them to the Hard Drive as json files in a specific directory?

checkwords(targets=["individual", "wonderful"], outdir = 'dictionary_results/')
import json
outdir = 'dictionary_results/'
with open(outdir+'individual.json','r') as f:
  cur_word = json.load(f)
print(json.dumps(cur_word, sort_keys=False, indent=4, ensure_ascii=False))
{
    "headword": "individual",
    "pronunciation": "/ ˌɪn dəˈvɪdʒ u əl  /",
    "noun": [
        {
            "definition": "a single human being, as distinguished from a group.",
            "example": ""
        },
        {
            "definition": "a person: ",
            "example": "a strange individual."
        },
        {
            "definition": "a distinct, indivisible entity; a single thing, being, instance, or item.",
            "example": ""
        },
        {
            "definition": "a group considered as a unit.",
            "example": ""
        },
        {
            "definition": "Biology.  a single organism capable of independent existence. a member of a compound organism or colony.",
            "example": ""
        },
        {
            "definition": "Cards. a duplicate-bridge tournament in which each player plays the same number of hands in partnership with every other player, individual scores for each player being kept for each hand.",
            "example": ""
        }
    ],
    "adjective": [
        {
            "definition": "single; particular; separate: ",
            "example": "to number individual copies of a limited edition."
        },
        {
            "definition": "intended for the use of one person only: ",
            "example": "to serve individual portions of a pizza."
        },
        {
            "definition": "of, relating to, or characteristic of a particular person or thing: ",
            "example": "individual tastes."
        },
        {
            "definition": "distinguished by special, singular, or markedly personal characteristics; exhibiting unique or unusual qualities: ",
            "example": "a highly individual style of painting."
        },
        {
            "definition": "existing as a distinct, indivisible entity, or considered as such; discrete: ",
            "example": "individual parts of a tea set."
        },
        {
            "definition": "of which each is different or of a different design from the others: ",
            "example": "a set of individual coffee cups."
        }
    ]
}