Emsemble Learning#

This section will talk about a hybrid method in machine learning–Emsemble Learning.

If you collect the predictions of a group of predictors (models), you will very often get better predictions than with the best individual predictor. This is often referred to as the wisdom of the crowd, i.e., Two heads are better than one (三個臭皮匠,勝過一個諸葛亮!).

In this section, we will briefly review the most popular ensemble methods:

  • Voting classifiers

  • Bagging and pasting emsembles

  • Random Forests

  • Boosting

  • Stacking

What is Emsemble Learning?#

Ensemble learning is a machine learning technique that combines multiple individual models to improve the overall performance and accuracy of predictions. It leverages the wisdom of crowds by aggregating the predictions of multiple models rather than relying on a single model.


import nltk, random
from nltk.corpus import movie_reviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## Label Encoder
le = LabelEncoder()

X_texts = [' '.join(words) for (words, label) in documents]
y_labels = [label for (words, label) in documents]

## Labels Transform
y = le.transform(y_labels)

## Text Vectorization
X_train, X_test, y_train, y_test = train_test_split(X_texts, y, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')

## Fit on training set
X_train_bow = tfidf_vec.fit_transform(X_train) # fit train

## Transform the testing set
X_test_bow = tfidf_vec.transform(X_test) # transform test
Voting Classifer#

  • A voting classifier combines the predictions of multiple base classifiers and predicts the class that receives the most votes (mode). It can be used for both classification and regression tasks.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

voting_clf = VotingClassifier(
        ('lr', LogisticRegression(random_state=42)),
        ('rf', DecisionTreeClassifier(random_state=42)),
        ('svc', SVC(random_state=42))

voting_clf.fit(X_train_bow, y_train)
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', DecisionTreeClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test_bow, y_test))
voting_clf.predict(X_test_bow[4]) ## best vote
[clf.predict(X_test_bow[4]) for clf in voting_clf.estimators_] ## individual predictions
voting_clf.score(X_test_bow, y_test)
## Soft voting

voting_clf.voting = 'soft'
voting_clf.named_estimators['svc'].probability = True
voting_clf.fit(X_train_bow, y_train)
voting_clf.score(X_test_bow, y_test)

Bagging and Pasting#

This approach is to use the same training algorithm for every predictor but train on different random subsets of the training set.

  • When the sampling is performed with replacement, it is called bagging (i.e., bootstrap aggregating)

  • When the sampling is performed without replacement, it is called pasting.

Both methods aim to reduce variance and improve generalization by averaging the predictions of individual models.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            max_samples=1000, n_jobs=-1, random_state=42)
bag_clf.fit(X_train_bow, y_train)
BaggingClassifier(estimator=DecisionTreeClassifier(), max_samples=1000,
                  n_estimators=500, n_jobs=-1, random_state=42)
bag_clf.score(X_test_bow, y_test)
## another method to evaluate
from sklearn.metrics import accuracy_score
accuracy_score(y_test, bag_clf.predict(X_test_bow))
  • Out-of-bag evaluation

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train_bow, y_train)
BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True, random_state=42)
bag_clf.score(X_test_bow, y_test)
  • Random patches method: you can randomly sample the training instances and the training features at the same time (bootstrap_features=True and max_features=1.0)

  • Pasting mathod: boostra=False will use a pasting method.

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                            max_samples=1000, n_jobs=-1, random_state=42,
                            bootstrap_features=True, max_features=0.8)
bag_clf.fit(X_train_bow, y_train)
BaggingClassifier(bootstrap_features=True, estimator=DecisionTreeClassifier(),
                  max_features=0.8, max_samples=1000, n_estimators=500,
                  n_jobs=-1, random_state=42)
bag_clf.score(X_test_bow, y_test)

Random Forests#

A random forest itself is an emsemble of decision trees, which are often trained via the bagging method with max_samples set to the size of the training set.

Alternatively, Random Forests is a popular ensemble learning method that builds multiple decision trees during training. Each tree is trained on a random subset of the training data and a random subset of features. The final prediction is obtained by averaging the predictions of all individual trees (for regression) or taking a majority vote (for classification).

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train_bow, y_train)

rnd_clf.score(X_test_bow, y_test)
## Random Forest Bagging Equivalant

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

bag_clf.fit(X_train_bow, y_train)
bag_clf.score(X_test_bow, y_test)
  • Feature Importance Analysis

for score, name in zip(rnd_clf.feature_importances_, tfidf_vec.get_feature_names_out()):
    if abs(score) > 0.001:
        print(round(score, 5), name)
Boosting is an ensemble technique that trains multiple weak learners sequentially, with each subsequent model focusing on the examples that previous models struggled with. It aims to improve the performance of the overall model by emphasizing the difficult-to-classify instances.

The general principle is to train classifiers sequentially, each trying to correct its predecessor. Two boosting methods include:

  • AdaBoost (Adaptive Bossting): It corrects its predeccessor by paying a bit more attention to the training instances that the predeccessor underfits. Specifically, it tweaks the instance weights at every iteration.

  • Gradient Boosting: It corrects its predeccessor by fitting the new predictor to the residual errors made by the previous predictor.

from sklearn.ensemble import AdaBoostClassifier

## AdaBoost
ada_clf = AdaBoostClassifier(
    learning_rate=0.5, random_state=42

ada_clf.fit(X_train_bow, y_train)

ada_clf.score(X_test_bow, y_test)
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=3, n_estimators=500,
                                  learning_rate=0.05, n_iter_no_change=10,

gbrt.fit(X_train_bow, y_train)
gbrt.score(X_test_bow, y_test)


Stacking method is to train a model to perform the aggregation of different classifiers.

Stacking combines multiple base models by training a meta-model on their predictions. Instead of using a simple voting scheme, stacking learns how to best combine the base models’ outputs to make predictions.

Sometimes, it involves splitting the training data into multiple folds, training the base models on different folds, and then using the predictions as features to train the meta-model. Stacking often leads to improved performance compared to individual models or simple ensembles.

from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
        ('lr', LogisticRegression(random_state=42)),
        ('rf', DecisionTreeClassifier(random_state=42)),
        ('svc', SVC(random_state=42))

stacking_clf.fit(X_train_bow, y_train)
stacking_clf.score(X_test_bow, y_test)

Stacking (Different Feature Sets)#

It is also possible to create different classifiers based on different feature sets, and then create an ensemble model by stacking the individual classifiers.

In this part, we demonstrate how to create two classifiers for polarity sentiment analysis: one based on simple bag of words, and the other based on the emotion scores from NRC Lexicon. Then we adopt the stacking method to create an ensemble model.


The package NRCLex installed by pip install NRCLex somehow includes a small error when doing the sentiment analysis. I manually fixed the dict part in the nrclex.py. This is a note to myself. Hopefully, it will be fixed soon.

from nrclex import NRCLex
import numpy

# Create an instance of NRCLex using the input text
emotions = NRCLex(X_train[921]).affect_frequencies

print([e for e in emotions.values()])
# Print the emotions and their frequencies
for emotion, frequency in emotions.items():
    print(emotion, frequency)
fear 0.0
anger 0.25
anticipation 0.0
trust 0.0
surprise 0.0
positive 0.0
negative 0.5
sadness 0.0
disgust 0.25
joy 0.0

The idea is simple. We create two feature sets: one based on the TfidfVectorizer() and the other based on the NRCLex(). And then we concatenate two feature matrices into one by columns.

import pandas as pd

## Create BOW + SENTIMENT features for training set
X_train_emotions = numpy.array([list(NRCLex(x).affect_frequencies.values()) for x in X_train])
X_train_all = numpy.concatenate((X_train_bow.toarray(),X_train_emotions), axis=1)
                              columns=tfidf_vec.get_feature_names_out().tolist()+['FEAR','ANGER','ANTICIPATION','TRUST', 'SURPRISE', 'POSITIVE','NEGATIVE',

## Create BOW + SENTIMENT features for testing set
X_test_emotions = numpy.array([list(NRCLex(x).affect_frequencies.values()) for x in X_test])
X_test_all = numpy.concatenate((X_test_bow.toarray(),X_test_emotions), axis=1)
                              columns=tfidf_vec.get_feature_names_out().tolist()+['FEAR','ANGER','ANTICIPATION','TRUST', 'SURPRISE', 'POSITIVE','NEGATIVE',
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
bow_cols = list(range(0,6625))
sentiment_cols = list(range(6625,6635))

bow_lg =  LogisticRegression(random_state=42)
sentiment_lg = LogisticRegression(random_state=42)
clf_meta = LogisticRegression(random_state=42)

pipe_bow = Pipeline([
    ('select', ColumnTransformer([('sel', 'passthrough', bow_cols)], remainder='drop')),  # remainder='drop' is the default, but I've included it for clarity
    ('clf', bow_lg)
pipe_sentiment = Pipeline([
    ('select', ColumnTransformer([('sel', 'passthrough', sentiment_cols)], remainder='drop')),
    ('clf', sentiment_lg)
stack = StackingClassifier(
        ('bow', pipe_bow),
        ('sentiment', pipe_sentiment),

stack.fit(X_train_all, y_train)
stack.score(X_test_all, y_test)

We can examine the contributions of the two main sub-models, i.e., the bag-of-word model and the senitment-based model, by looking at the coefficients of the final estimators of the stack model.

## Contribution of the two sub-models
We can also further examine the feature importance of the sentiment-based model by looking at the coefficients of the ten sentiment/emotion types in the sentiment-based logistic regression model.

## SENTIMENT model coefficients
for score, type in zip(stack.estimators_[1]['clf'].coef_[0], X_train_all.columns[6625:6635]):
    print(type ,":", score)
FEAR : 0.43024616785724173
ANGER : -0.70332154214246
ANTICIPATION : 0.6882554148089459
TRUST : 1.7007132512139056
SURPRISE : 0.6816809764302725
POSITIVE : 3.112328247221864
NEGATIVE : -4.970771280439475
SADNESS : -1.8391338529657337
DISGUST : -1.9377960708637092
JOY : 2.8382044480406603