Emsemble Learning

Emsemble Learning#

This section will talk about a hybrid method in machine learning–Emsemble Learning.

If you collect the predictions of a group of predictors (models), you will very often get better predictions than with the best individual predictor. This is often referred to as the wisdom of the crowd, i.e., Two heads are better than one (三個臭皮匠，勝過一個諸葛亮！).

In this section, we will briefly review the most popular ensemble methods:

Voting classifiers
Bagging and pasting emsembles
Random Forests
Boosting
Stacking

What is Emsemble Learning?#

Ensemble learning is a machine learning technique that combines multiple individual models to improve the overall performance and accuracy of predictions. It leverages the wisdom of crowds by aggregating the predictions of multiple models rather than relying on a single model.

Dataset#

import nltk, random
from nltk.corpus import movie_reviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

print(len(movie_reviews.fileids()))
print(movie_reviews.categories())
print(movie_reviews.words()[:100])
print(movie_reviews.fileids()[:10])

2000
['neg', 'pos']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.seed(123)
random.shuffle(documents)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## Label Encoder
le = LabelEncoder()

X_texts = [' '.join(words) for (words, label) in documents]
y_labels = [label for (words, label) in documents]

## Labels Transform
le.fit(y_labels)
y = le.transform(y_labels)

## Text Vectorization
X_train, X_test, y_train, y_test = train_test_split(X_texts, y, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')

## Fit on training set
X_train_bow = tfidf_vec.fit_transform(X_train) # fit train

## Transform the testing set
X_test_bow = tfidf_vec.transform(X_test) # transform test
print(X_train_bow.shape)
print(X_test_bow.shape)

(1500, 6625)
(500, 6625)

Voting Classifer#

A voting classifier combines the predictions of multiple base classifiers and predicts the class that receives the most votes (mode). It can be used for both classification and regression tasks.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', DecisionTreeClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

voting_clf.fit(X_train_bow, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', DecisionTreeClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test_bow, y_test))

lr = 0.792
rf = 0.634
svc = 0.82

voting_clf.predict(X_test_bow[4]) ## best vote

array([0])

[clf.predict(X_test_bow[4]) for clf in voting_clf.estimators_] ## individual predictions

[array([0]), array([1]), array([0])]

voting_clf.score(X_test_bow, y_test)

0.814

## Soft voting

voting_clf.voting = 'soft'
voting_clf.named_estimators['svc'].probability = True
voting_clf.fit(X_train_bow, y_train)
voting_clf.score(X_test_bow, y_test)

0.762

Bagging and Pasting#

This approach is to use the same training algorithm for every predictor but train on different random subsets of the training set.

When the sampling is performed with replacement, it is called bagging (i.e., bootstrap aggregating)
When the sampling is performed without replacement, it is called pasting.

Both methods aim to reduce variance and improve generalization by averaging the predictions of individual models.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            max_samples=1000, n_jobs=-1, random_state=42)
bag_clf.fit(X_train_bow, y_train)

BaggingClassifier(estimator=DecisionTreeClassifier(), max_samples=1000,
                  n_estimators=500, n_jobs=-1, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

bag_clf.score(X_test_bow, y_test)

0.754

## another method to evaluate
from sklearn.metrics import accuracy_score
accuracy_score(y_test, bag_clf.predict(X_test_bow))

0.754

Out-of-bag evaluation

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train_bow, y_train)

BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

bag_clf.oob_score_

0.7573333333333333

bag_clf.score(X_test_bow, y_test)

0.75

Random patches method: you can randomly sample the training instances and the training features at the same time (bootstrap_features=True and max_features=1.0)
Pasting mathod: boostra=False will use a pasting method.

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            max_samples=1000, n_jobs=-1, random_state=42,
                            bootstrap_features=True, max_features=0.8)
bag_clf.fit(X_train_bow, y_train)

BaggingClassifier(bootstrap_features=True, estimator=DecisionTreeClassifier(),
                  max_features=0.8, max_samples=1000, n_estimators=500,
                  n_jobs=-1, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

bag_clf.score(X_test_bow, y_test)

0.786

Random Forests#

A random forest itself is an emsemble of decision trees, which are often trained via the bagging method with max_samples set to the size of the training set.

Alternatively, Random Forests is a popular ensemble learning method that builds multiple decision trees during training. Each tree is trained on a random subset of the training data and a random subset of features. The final prediction is obtained by averaging the predictions of all individual trees (for regression) or taking a majority vote (for classification).

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train_bow, y_train)

rnd_clf.score(X_test_bow, y_test)

0.788

## Random Forest Bagging Equivalant

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

bag_clf.fit(X_train_bow, y_train)
bag_clf.score(X_test_bow, y_test)

0.78

Feature Importance Analysis

for score, name in zip(rnd_clf.feature_importances_, tfidf_vec.get_feature_names_out()):
    if abs(score) > 0.001:
        print(round(score, 5), name)

00118 about
00131 action
00232 allows
00122 already
00794 also
00104 always
00122 american
01651 and
00105 anger
00842 any
00113 anything
00154 anyway
00276 as
00215 at
00194 attempt
0013 attempts
01079 awful
00171 b
02183 bad
00289 be
00211 best
00536 better
00119 between
00233 big
00224 bland
01345 boring
00441 both
0013 brilliant
00121 bunch
00134 called
00118 classic
00129 clich
00124 could
00133 d
00129 definitely
00191 dialogue
0013 didn
0037 different
00187 director
0023 do
00256 doesn
00471 don
00304 dull
00288 effective
00164 either
00147 else
0023 embarrassing
00153 era
0032 especially
00721 even
00205 excellent
00122 except
00105 extremely
00296 fails
00146 falls
00242 family
00108 feel
00164 female
001 figure
0016 film
00103 finest
00121 force
00196 get
00119 give
00248 gives
00212 going
00334 great
00105 guess
00209 had
00254 hard
0109 have
00165 here
00184 highly
00296 hilarious
0012 him
00561 his
00106 horrible
00261 i
00573 if
00125 inept
0026 interesting
00484 is
00143 isn
00663 just
00115 know
00886 lame
00118 laughable
00567 least
00604 life
00224 like
00693 looks
00206 ludicrous
00223 made
00165 make
00217 many
0013 material
00214 maybe
00104 me
00249 memorable
00384 mess
00155 might
00727 minute
00295 minutes
00239 most
01068 movie
00143 much
00105 my
00967 no
00555 none
00816 nothing
00314 obvious
00101 obviously
00268 of
00394 off
00337 on
00773 only
00488 or
00152 others
00118 out
00505 outstanding
00209 overall
00744 perfect
00543 perfectly
00344 performance
00312 performances
00104 period
01101 plot
00185 point
00415 pointless
00546 poor
0023 poorly
00127 portrayal
00131 predictable
00148 pretty
00136 problem
00103 quite
00492 re
00104 realistic
00774 reason
00704 ridiculous
00117 save
00104 saved
00558 script
00157 sex
00118 should
00641 so
0012 some
00222 sometimes
00693 stupid
00209 subtle
00814 supposed
01285 t
00107 talent
00286 terrible
00139 terrific
00105 that
00319 the
0036 then
00863 there
00232 they
00126 thin
00237 thing
00212 this
0012 to
00261 too
00398 tries
00187 true
00204 tv
00707 unfortunately
00114 unfunny
00145 unique
00121 up
00339 very
00165 war
00238 wasn
00833 waste
00777 wasted
00132 watching
00552 well
00137 what
00134 whatever
00152 while
00873 why
00102 wife
00157 will
00262 with
0017 wonderful
00327 wonderfully
00245 world
00219 worse
01611 worst
00473 would
00117 yet

Boosting#

Boosting is an ensemble technique that trains multiple weak learners sequentially, with each subsequent model focusing on the examples that previous models struggled with. It aims to improve the performance of the overall model by emphasizing the difficult-to-classify instances.

The general principle is to train classifiers sequentially, each trying to correct its predecessor. Two boosting methods include:

AdaBoost (Adaptive Bossting): It corrects its predeccessor by paying a bit more attention to the training instances that the predeccessor underfits. Specifically, it tweaks the instance weights at every iteration.
Gradient Boosting: It corrects its predeccessor by fitting the new predictor to the residual errors made by the previous predictor.

from sklearn.ensemble import AdaBoostClassifier

## AdaBoost
ada_clf = AdaBoostClassifier(
    LogisticRegression(),
    n_estimators=500,
    learning_rate=0.5, random_state=42
)

ada_clf.fit(X_train_bow, y_train)

ada_clf.score(X_test_bow, y_test)

0.744

from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=3, n_estimators=500,
                                  learning_rate=0.05, n_iter_no_change=10,
                                  random_state=42)

gbrt.fit(X_train_bow, y_train)
gbrt.score(X_test_bow, y_test)

0.794

Stacking#

Stacking method is to train a model to perform the aggregation of different classifiers.

Stacking combines multiple base models by training a meta-model on their predictions. Instead of using a simple voting scheme, stacking learns how to best combine the base models’ outputs to make predictions.

Sometimes, it involves splitting the training data into multiple folds, training the base models on different folds, and then using the predictions as features to train the meta-model. Stacking often leads to improved performance compared to individual models or simple ensembles.

from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', DecisionTreeClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=42),
    cv=5
)

stacking_clf.fit(X_train_bow, y_train)
stacking_clf.score(X_test_bow, y_test)

0.804

Stacking (Different Feature Sets)#

It is also possible to create different classifiers based on different feature sets, and then create an ensemble model by stacking the individual classifiers.

In this part, we demonstrate how to create two classifiers for polarity sentiment analysis: one based on simple bag of words, and the other based on the emotion scores from NRC Lexicon. Then we adopt the stacking method to create an ensemble model.

Warning

The package NRCLex installed by pip install NRCLex somehow includes a small error when doing the sentiment analysis. I manually fixed the dict part in the nrclex.py. This is a note to myself. Hopefully, it will be fixed soon.

from nrclex import NRCLex
import numpy

# Create an instance of NRCLex using the input text
print(X_train[921])
emotions = NRCLex(X_train[921]).affect_frequencies

print(emotions)
print([e for e in emotions.values()])
# Print the emotions and their frequencies
for emotion, frequency in emotions.items():
    print(emotion, frequency)
print(len(emotions))

this film is extraordinarily horrendous and i ' m not going to waste any more words on it .
{'fear': 0.0, 'anger': 0.25, 'anticipation': 0.0, 'trust': 0.0, 'surprise': 0.0, 'positive': 0.0, 'negative': 0.5, 'sadness': 0.0, 'disgust': 0.25, 'joy': 0.0}
[0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.25, 0.0]
fear 0.0
anger 0.25
anticipation 0.0
trust 0.0
surprise 0.0
positive 0.0
negative 0.5
sadness 0.0
disgust 0.25
joy 0.0
10

The idea is simple. We create two feature sets: one based on the TfidfVectorizer() and the other based on the NRCLex(). And then we concatenate two feature matrices into one by columns.

import pandas as pd

## Create BOW + SENTIMENT features for training set
X_train_emotions = numpy.array([list(NRCLex(x).affect_frequencies.values()) for x in X_train])
X_train_all = numpy.concatenate((X_train_bow.toarray(),X_train_emotions), axis=1)
print(X_train_all.shape)
X_train_all=pd.DataFrame(X_train_all,
                              columns=tfidf_vec.get_feature_names_out().tolist()+['FEAR','ANGER','ANTICIPATION','TRUST', 'SURPRISE', 'POSITIVE','NEGATIVE',
                                       'SADNESS','DISGUST','JOY'])

## Create BOW + SENTIMENT features for testing set
X_test_emotions = numpy.array([list(NRCLex(x).affect_frequencies.values()) for x in X_test])
X_test_all = numpy.concatenate((X_test_bow.toarray(),X_test_emotions), axis=1)
print(X_test_all.shape)
X_test_all=pd.DataFrame(X_test_all,
                              columns=tfidf_vec.get_feature_names_out().tolist()+['FEAR','ANGER','ANTICIPATION','TRUST', 'SURPRISE', 'POSITIVE','NEGATIVE',
                                       'SADNESS','DISGUST','JOY'])

(1500, 6635)
(500, 6635)

print(X_train_bow.shape)
print(X_train_all.shape)

(1500, 6625)
(1500, 6635)

from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
bow_cols = list(range(0,6625))
sentiment_cols = list(range(6625,6635))

bow_lg =  LogisticRegression(random_state=42)
sentiment_lg = LogisticRegression(random_state=42)
clf_meta = LogisticRegression(random_state=42)

pipe_bow = Pipeline([
    ('select', ColumnTransformer([('sel', 'passthrough', bow_cols)], remainder='drop')),  # remainder='drop' is the default, but I've included it for clarity
    ('clf', bow_lg)
])
pipe_sentiment = Pipeline([
    ('select', ColumnTransformer([('sel', 'passthrough', sentiment_cols)], remainder='drop')),
    ('clf', sentiment_lg)
])

stack = StackingClassifier(
    estimators=[
        ('bow', pipe_bow),
        ('sentiment', pipe_sentiment),
    ],
    final_estimator=clf_meta
)

stack.fit(X_train_all, y_train)
stack.score(X_test_all, y_test)

0.798

We can examine the contributions of the two main sub-models, i.e., the bag-of-word model and the senitment-based model, by looking at the coefficients of the final estimators of the stack model.

## Contribution of the two sub-models
stack.final_estimator_.coef_[0]

array([10.52045605,  3.92126497])

We can also further examine the feature importance of the sentiment-based model by looking at the coefficients of the ten sentiment/emotion types in the sentiment-based logistic regression model.

## SENTIMENT model coefficients
for score, type in zip(stack.estimators_[1]['clf'].coef_[0], X_train_all.columns[6625:6635]):
    print(type ,":", score)

FEAR : 0.43024616785724173
ANGER : -0.70332154214246
ANTICIPATION : 0.6882554148089459
TRUST : 1.7007132512139056
SURPRISE : 0.6816809764302725
POSITIVE : 3.112328247221864
NEGATIVE : -4.970771280439475
SADNESS : -1.8391338529657337
DISGUST : -1.9377960708637092
JOY : 2.8382044480406603

Emsemble Learning

Contents

Emsemble Learning#

What is Emsemble Learning?#

Dataset#

Voting Classifer#

Bagging and Pasting#

Random Forests#

Boosting#

Stacking#

Stacking (Different Feature Sets)#