Language Model#

Training a Trigram Language Model using Reuters#

%%time

# code courtesy of https://nlpforhackers.io/language-models/

from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance  
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count
CPU times: user 10.3 s, sys: 830 ms, total: 11.1 s
Wall time: 11.1 s

Check Language Model#

sorted(dict(model["the","news"]).items(), key=lambda x:-1*x[1])
[('conference', 0.25),
 ('of', 0.125),
 ('.', 0.125),
 ('with', 0.08333333333333333),
 (',', 0.08333333333333333),
 ('agency', 0.08333333333333333),
 ('that', 0.08333333333333333),
 ('brought', 0.041666666666666664),
 ('about', 0.041666666666666664),
 ('broke', 0.041666666666666664),
 ('on', 0.041666666666666664)]

Text Generation Using the Trigram Model#

  • Using the trigram model to predict the next word.

  • The prediction is based on the predicted probability distribution of the next words: words above a predefined cut-off are randomly selected.

  • The text generator ends when two consecutuve None’s are predicted (signaling the end of the sentence).

# code courtesy of https://nlpforhackers.io/language-models/
import random

# starting words
text = ["the", "news"]
sentence_finished = False
 
while not sentence_finished:
  # select a random probability threshold  
  r = random.random()
  accumulator = .0

  for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # select words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

  if text[-2:] == [None, None]:
      sentence_finished = True
 
print (' '.join([t for t in text if t]))
the news conference that any rights Dixon might have to return to a faulty start this spring .

Issues of Ngram Language Model#

  • The ngram size is of key importance. The higher the order of the ngram, the better the prediction. But it comes with the issues of computation overload and data sparceness.

  • Unseen ngrams are always a concern.

  • Probability smoothing issues.

Neural Language Model#

  • Neural language model based on deep learning may provide a better alternative to model the probabilistic relationships of linguistic units.