Emsemble Learning#

This section will talk about a hybrid method in machine learning–Emsemble Learning.

If you collect the predictions of a group of predictors (models), you will very often get better predictions than with the best individual predictor. This is often referred to as the wisdom of the crowd, i.e., Two heads are better than one (三個臭皮匠,勝過一個諸葛亮!).

In this section, we will briefly review the most popular ensemble methods:

  • Voting classifiers

  • Bagging and pasting emsembles

  • Random Forests

  • Boosting

  • Stacking

What is Emsemble Learning?#

Ensemble learning is a machine learning technique that combines multiple individual models to improve the overall performance and accuracy of predictions. It leverages the wisdom of crowds by aggregating the predictions of multiple models rather than relying on a single model.

Dataset#

import nltk, random
from nltk.corpus import movie_reviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
print(len(movie_reviews.fileids()))
print(movie_reviews.categories())
print(movie_reviews.words()[:100])
print(movie_reviews.fileids()[:10])
2000
['neg', 'pos']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.seed(123)
random.shuffle(documents)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## Label Encoder
le = LabelEncoder()

X_texts = [' '.join(words) for (words, label) in documents]
y_labels = [label for (words, label) in documents]

## Labels Transform
le.fit(y_labels)
y = le.transform(y_labels)

## Text Vectorization
X_train, X_test, y_train, y_test = train_test_split(X_texts, y, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')

## Fit on training set
X_train_bow = tfidf_vec.fit_transform(X_train) # fit train

## Transform the testing set
X_test_bow = tfidf_vec.transform(X_test) # transform test
print(X_train_bow.shape)
print(X_test_bow.shape)
(1500, 6625)
(500, 6625)

Voting Classifer#

  • A voting classifier combines the predictions of multiple base classifiers and predicts the class that receives the most votes (mode). It can be used for both classification and regression tasks.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', DecisionTreeClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

voting_clf.fit(X_train_bow, y_train)
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', DecisionTreeClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test_bow, y_test))
lr = 0.792
rf = 0.634
svc = 0.82
voting_clf.predict(X_test_bow[4]) ## best vote
array([0])
[clf.predict(X_test_bow[4]) for clf in voting_clf.estimators_] ## individual predictions
[array([0]), array([1]), array([0])]
voting_clf.score(X_test_bow, y_test)
0.814
## Soft voting

voting_clf.voting = 'soft'
voting_clf.named_estimators['svc'].probability = True
voting_clf.fit(X_train_bow, y_train)
voting_clf.score(X_test_bow, y_test)
0.762

Bagging and Pasting#

This approach is to use the same training algorithm for every predictor but train on different random subsets of the training set.

  • When the sampling is performed with replacement, it is called bagging (i.e., bootstrap aggregating)

  • When the sampling is performed without replacement, it is called pasting.

Both methods aim to reduce variance and improve generalization by averaging the predictions of individual models.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            max_samples=1000, n_jobs=-1, random_state=42)
bag_clf.fit(X_train_bow, y_train)
BaggingClassifier(estimator=DecisionTreeClassifier(), max_samples=1000,
                  n_estimators=500, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
bag_clf.score(X_test_bow, y_test)
0.754
## another method to evaluate
from sklearn.metrics import accuracy_score
accuracy_score(y_test, bag_clf.predict(X_test_bow))
0.754
  • Out-of-bag evaluation

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train_bow, y_train)
BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
bag_clf.oob_score_
0.7573333333333333
bag_clf.score(X_test_bow, y_test)
0.75
  • Random patches method: you can randomly sample the training instances and the training features at the same time (bootstrap_features=True and max_features=1.0)

  • Pasting mathod: boostra=False will use a pasting method.

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            max_samples=1000, n_jobs=-1, random_state=42,
                            bootstrap_features=True, max_features=0.8)
bag_clf.fit(X_train_bow, y_train)
BaggingClassifier(bootstrap_features=True, estimator=DecisionTreeClassifier(),
                  max_features=0.8, max_samples=1000, n_estimators=500,
                  n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
bag_clf.score(X_test_bow, y_test)
0.786

Random Forests#

A random forest itself is an emsemble of decision trees, which are often trained via the bagging method with max_samples set to the size of the training set.

Alternatively, Random Forests is a popular ensemble learning method that builds multiple decision trees during training. Each tree is trained on a random subset of the training data and a random subset of features. The final prediction is obtained by averaging the predictions of all individual trees (for regression) or taking a majority vote (for classification).

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train_bow, y_train)

rnd_clf.score(X_test_bow, y_test)
0.788
## Random Forest Bagging Equivalant

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

bag_clf.fit(X_train_bow, y_train)
bag_clf.score(X_test_bow, y_test)
0.78
  • Feature Importance Analysis

for score, name in zip(rnd_clf.feature_importances_, tfidf_vec.get_feature_names_out()):
    if abs(score) > 0.001:
        print(round(score, 5), name)
0.00118 about
0.00131 action
0.00232 allows
0.00122 already
0.00794 also
0.00104 always
0.00122 american
0.01651 and
0.00105 anger
0.00842 any
0.00113 anything
0.00154 anyway
0.00276 as
0.00215 at
0.00194 attempt
0.0013 attempts
0.01079 awful
0.00171 b
0.02183 bad
0.00289 be
0.00211 best
0.00536 better
0.00119 between
0.00233 big
0.00224 bland
0.01345 boring
0.00441 both
0.0013 brilliant
0.00121 bunch
0.00134 called
0.00118 classic
0.00129 clich
0.00124 could
0.00133 d
0.00129 definitely
0.00191 dialogue
0.0013 didn
0.0037 different
0.00187 director
0.0023 do
0.00256 doesn
0.00471 don
0.00304 dull
0.00288 effective
0.00164 either
0.00147 else
0.0023 embarrassing
0.00153 era
0.0032 especially
0.00721 even
0.00205 excellent
0.00122 except
0.00105 extremely
0.00296 fails
0.00146 falls
0.00242 family
0.00108 feel
0.00164 female
0.001 figure
0.0016 film
0.00103 finest
0.00121 force
0.00196 get
0.00119 give
0.00248 gives
0.00212 going
0.00334 great
0.00105 guess
0.00209 had
0.00254 hard
0.0109 have
0.00165 here
0.00184 highly
0.00296 hilarious
0.0012 him
0.00561 his
0.00106 horrible
0.00261 i
0.00573 if
0.00125 inept
0.0026 interesting
0.00484 is
0.00143 isn
0.00663 just
0.00115 know
0.00886 lame
0.00118 laughable
0.00567 least
0.00604 life
0.00224 like
0.00693 looks
0.00206 ludicrous
0.00223 made
0.00165 make
0.00217 many
0.0013 material
0.00214 maybe
0.00104 me
0.00249 memorable
0.00384 mess
0.00155 might
0.00727 minute
0.00295 minutes
0.00239 most
0.01068 movie
0.00143 much
0.00105 my
0.00967 no
0.00555 none
0.00816 nothing
0.00314 obvious
0.00101 obviously
0.00268 of
0.00394 off
0.00337 on
0.00773 only
0.00488 or
0.00152 others
0.00118 out
0.00505 outstanding
0.00209 overall
0.00744 perfect
0.00543 perfectly
0.00344 performance
0.00312 performances
0.00104 period
0.01101 plot
0.00185 point
0.00415 pointless
0.00546 poor
0.0023 poorly
0.00127 portrayal
0.00131 predictable
0.00148 pretty
0.00136 problem
0.00103 quite
0.00492 re
0.00104 realistic
0.00774 reason
0.00704 ridiculous
0.00117 save
0.00104 saved
0.00558 script
0.00157 sex
0.00118 should
0.00641 so
0.0012 some
0.00222 sometimes
0.00693 stupid
0.00209 subtle
0.00814 supposed
0.01285 t
0.00107 talent
0.00286 terrible
0.00139 terrific
0.00105 that
0.00319 the
0.0036 then
0.00863 there
0.00232 they
0.00126 thin
0.00237 thing
0.00212 this
0.0012 to
0.00261 too
0.00398 tries
0.00187 true
0.00204 tv
0.00707 unfortunately
0.00114 unfunny
0.00145 unique
0.00121 up
0.00339 very
0.00165 war
0.00238 wasn
0.00833 waste
0.00777 wasted
0.00132 watching
0.00552 well
0.00137 what
0.00134 whatever
0.00152 while
0.00873 why
0.00102 wife
0.00157 will
0.00262 with
0.0017 wonderful
0.00327 wonderfully
0.00245 world
0.00219 worse
0.01611 worst
0.00473 would
0.00117 yet

Boosting#

Boosting is an ensemble technique that trains multiple weak learners sequentially, with each subsequent model focusing on the examples that previous models struggled with. It aims to improve the performance of the overall model by emphasizing the difficult-to-classify instances.

The general principle is to train classifiers sequentially, each trying to correct its predecessor. Two boosting methods include:

  • AdaBoost (Adaptive Bossting): It corrects its predeccessor by paying a bit more attention to the training instances that the predeccessor underfits. Specifically, it tweaks the instance weights at every iteration.

  • Gradient Boosting: It corrects its predeccessor by fitting the new predictor to the residual errors made by the previous predictor.

from sklearn.ensemble import AdaBoostClassifier

## AdaBoost
ada_clf = AdaBoostClassifier(
    LogisticRegression(),
    n_estimators=500,
    learning_rate=0.5, random_state=42
)

ada_clf.fit(X_train_bow, y_train)

ada_clf.score(X_test_bow, y_test)
0.744
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=3, n_estimators=500,
                                  learning_rate=0.05, n_iter_no_change=10,
                                  random_state=42)

gbrt.fit(X_train_bow, y_train)
gbrt.score(X_test_bow, y_test)
0.794

Stacking#

Stacking method is to train a model to perform the aggregation of different classifiers.

Stacking combines multiple base models by training a meta-model on their predictions. Instead of using a simple voting scheme, stacking learns how to best combine the base models’ outputs to make predictions.

Sometimes, it involves splitting the training data into multiple folds, training the base models on different folds, and then using the predictions as features to train the meta-model. Stacking often leads to improved performance compared to individual models or simple ensembles.

from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', DecisionTreeClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=42),
    cv=5
)

stacking_clf.fit(X_train_bow, y_train)
stacking_clf.score(X_test_bow, y_test)
0.804

Stacking (Different Feature Sets)#

It is also possible to create different classifiers based on different feature sets, and then create an ensemble model by stacking the individual classifiers.

In this part, we demonstrate how to create two classifiers for polarity sentiment analysis: one based on simple bag of words, and the other based on the emotion scores from NRC Lexicon. Then we adopt the stacking method to create an ensemble model.

Warning

The package NRCLex installed by pip install NRCLex somehow includes a small error when doing the sentiment analysis. I manually fixed the dict part in the nrclex.py. This is a note to myself. Hopefully, it will be fixed soon.

from nrclex import NRCLex
import numpy

# Create an instance of NRCLex using the input text
print(X_train[921])
emotions = NRCLex(X_train[921]).affect_frequencies

print(emotions)
print([e for e in emotions.values()])
# Print the emotions and their frequencies
for emotion, frequency in emotions.items():
    print(emotion, frequency)
print(len(emotions))
this film is extraordinarily horrendous and i ' m not going to waste any more words on it .
{'fear': 0.0, 'anger': 0.25, 'anticipation': 0.0, 'trust': 0.0, 'surprise': 0.0, 'positive': 0.0, 'negative': 0.5, 'sadness': 0.0, 'disgust': 0.25, 'joy': 0.0}
[0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.25, 0.0]
fear 0.0
anger 0.25
anticipation 0.0
trust 0.0
surprise 0.0
positive 0.0
negative 0.5
sadness 0.0
disgust 0.25
joy 0.0
10

The idea is simple. We create two feature sets: one based on the TfidfVectorizer() and the other based on the NRCLex(). And then we concatenate two feature matrices into one by columns.

import pandas as pd

## Create BOW + SENTIMENT features for training set
X_train_emotions = numpy.array([list(NRCLex(x).affect_frequencies.values()) for x in X_train])
X_train_all = numpy.concatenate((X_train_bow.toarray(),X_train_emotions), axis=1)
print(X_train_all.shape)
X_train_all=pd.DataFrame(X_train_all,
                              columns=tfidf_vec.get_feature_names_out().tolist()+['FEAR','ANGER','ANTICIPATION','TRUST', 'SURPRISE', 'POSITIVE','NEGATIVE',
                                       'SADNESS','DISGUST','JOY'])

## Create BOW + SENTIMENT features for testing set
X_test_emotions = numpy.array([list(NRCLex(x).affect_frequencies.values()) for x in X_test])
X_test_all = numpy.concatenate((X_test_bow.toarray(),X_test_emotions), axis=1)
print(X_test_all.shape)
X_test_all=pd.DataFrame(X_test_all,
                              columns=tfidf_vec.get_feature_names_out().tolist()+['FEAR','ANGER','ANTICIPATION','TRUST', 'SURPRISE', 'POSITIVE','NEGATIVE',
                                       'SADNESS','DISGUST','JOY'])
(1500, 6635)
(500, 6635)
print(X_train_bow.shape)
print(X_train_all.shape)
(1500, 6625)
(1500, 6635)
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
bow_cols = list(range(0,6625))
sentiment_cols = list(range(6625,6635))

bow_lg =  LogisticRegression(random_state=42)
sentiment_lg = LogisticRegression(random_state=42)
clf_meta = LogisticRegression(random_state=42)

pipe_bow = Pipeline([
    ('select', ColumnTransformer([('sel', 'passthrough', bow_cols)], remainder='drop')),  # remainder='drop' is the default, but I've included it for clarity
    ('clf', bow_lg)
])
pipe_sentiment = Pipeline([
    ('select', ColumnTransformer([('sel', 'passthrough', sentiment_cols)], remainder='drop')),
    ('clf', sentiment_lg)
])
stack = StackingClassifier(
    estimators=[
        ('bow', pipe_bow),
        ('sentiment', pipe_sentiment),
    ],
    final_estimator=clf_meta
)

stack.fit(X_train_all, y_train)
stack.score(X_test_all, y_test)
0.798

We can examine the contributions of the two main sub-models, i.e., the bag-of-word model and the senitment-based model, by looking at the coefficients of the final estimators of the stack model.

## Contribution of the two sub-models
stack.final_estimator_.coef_[0]
array([10.52045605,  3.92126497])

We can also further examine the feature importance of the sentiment-based model by looking at the coefficients of the ten sentiment/emotion types in the sentiment-based logistic regression model.

## SENTIMENT model coefficients
for score, type in zip(stack.estimators_[1]['clf'].coef_[0], X_train_all.columns[6625:6635]):
    print(type ,":", score)
FEAR : 0.43024616785724173
ANGER : -0.70332154214246
ANTICIPATION : 0.6882554148089459
TRUST : 1.7007132512139056
SURPRISE : 0.6816809764302725
POSITIVE : 3.112328247221864
NEGATIVE : -4.970771280439475
SADNESS : -1.8391338529657337
DISGUST : -1.9377960708637092
JOY : 2.8382044480406603