9. Assignment IV: Neural Language Model#

9.1. Question 1#

Use the same simple dataset (as shown below) discussed in class and create a simple trigram-based neural language using sequence models.

# source text
data = """ Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n """

9.2. Question 2#

Use the same simple dataset and create a simple line-based neural language model using sequence models.

For example, given the first line, Jack and Jill went up the hill, you need to include all possible sequences from this line as inputs for your neural language model training, including all bigrams, trigrams, …, ngrams combinations (see below)

# source text
data = """Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n"""
## seqeunces as inputs from line 2
['To', 'fetch']
['To', 'fetch', 'a']
['To', 'fetch', 'a', 'pail']
['To', 'fetch', 'a', 'pail', 'of']
['To', 'fetch', 'a', 'pail', 'of', 'water']

9.3. Question 3#

Use the Brown corpus (nltk.corpus.brown) to create a trigram-based neural language model.

Please use the language model to generate 50-word text sequences using the seed text “The news”. Provide a few examples from your trained model.

A few important notes in data preprocessing:

  • When preparing the input sequences of trigrams for model training, please make sure the trigram does not span across “sentence boundaries”. You can utilize the sentence tokenization annotations provided by the ntlk.corpus.brown.sents().

  • The neural language model will be trained based on all trigrams that fulfill the above criterion in the entire Brown corpus.

Warning

Even though the brown corpus is a small-size one, the input sequences of all observed trigrams may still require considerable amount of memory for processing. Some of you may not have enough memory space to store the entire inputs and outputs data. Please find your own solutions if you encounter this memory issue. (Hint: Use generator()).

Warning

When you use your trigram-based neural language model to generate sequences, please add randomness to the sampling of the next word. If you always ask the language model to choose the next word of highest predicted probability value, your text would be very repetitive. Please consult Ch. 8.1.4 of François Chollet’s Deep Learning with Python (one of the course textbooks) for the implementation of temperature for this exercise.

  • Examples of the 50-word text sequences created by the language model:

Hide code cell source
[print(s+'\n'*2) for s in generated_sequences]
The news that was the first time was that the public interest in the first time he was '' and the in the of the state to the of the world of these theories '' and a few days '' he said that a note of the characteristics of the time of


The news of rayburn's commitment well known that mine '' he said '' he said he was in his own life and of the most part of the women have been the of her and mother '' said mrs buck have not been as a result of a group of the and


The news that is the basic truth in the next day to relax the emotional stimulation and fear that the author of the western world '' and said it was not a little more than the most of the state of the quarrel obtained a qualification that most of these forces as


The news and a little of the time we are never trying to find out what he has a small boy and a series of a new crisis the book was not a tax bill was not at the time of the white house would be to the extent to which he


The news of the church must be well to the extent of the most important element of the '' the end of the whole world '' he said he was in the of the '' of the and of the state of the is the of his new ideas that had been
[None, None, None, None, None]