Bayesian Learning

2024-04-03

Background

We want to find out the relation between two events. For example, when one thing happens, what is the probability that another thing happens. For this, we introduce the Bayesian Learning.

Bayesian Theorem

stands for the posterior probability for .
stands for the prior probability for .
stands for the prior probability for .
stands for the probability for given .

- Assumptions are mutually exclusive.
- H Hypothesis Space is fully exhaustive.
- stands for one of the sample set for all the possible data.
- and are independent of each other.
- can be ignored when only comparing different assumptions.
likelihood
- log likelihood

Max A Posterior (MAP)

From the equation, we can see that is a constant when we only consider . So, if we only need to decide which event has the highest probability, we could ignore . So, can be expressed as:

Maximum Likelihood (ML)

If we are completely unaware of the probability distribution, or we know that all the assumptions happened are with equal probability, then MAP is equivalent to ML.

The relation between ML and Least Square Estimation (LSE)

Let’s express the training set as .
In the LSE, we want to find an expression f to let and find the minimum .
We assume that .
So, we can get:

We can see that is equivalent to if the variables are random, independent and the error obey

Naive Bayes

Naive Bayesian Assumption

We assume that each is independent. Then, we can get:

Naive Bayesian Classifier

If MAP’s attributes also satisfy independence, then .

展开全文 >>

Ensemble Learning

2024-04-03

Ensemble Learning mainly combines two different models into a larger model in order to further boost the accuracy.

Brief Introduction

Intuition

Combining some different predictions may have better accuracy for the same problem.

Reasons:

It’s quite easy to find a very accurate ‘rules of thumb’, but it’s hard to find a single high-accuracy global rule.
If the Sample Space is small and the Hypothesis Space is big, then there may be some assumptions obtaining the same accuracy. Choosing only one assumption may perform terribly on the test set.
Algorithms may converge to a local optimal solution. Combining different assumptions may reduce the risk of converging to a bad local optimal solution.

Learners

Strong Learners: Learning algorithms having high accuracy
Weak Learners: Have slightly higher accuracy than randomly predicting. (error = 1/2 - )
Question: Could we turn a Weak Learner into a Strong Learner or turn some Weak Learners into a Strong Learner?

Sometimes, a single classifier doesn’t perform well, but the combination of them perform well. So, we can ensemble the weights of them together to predict.

Ensemble Strategy

Average

Simple Mean
Weighted Mean

Vote

Majority Voting Method
Weighted Voting Method

Learning

Weighted Majority Algorithm
Stacking

Algorithms

AdaBoost (Adaptive Boosting)

Initialization: AdaBoost starts by assigning equal weights to all training examples.
Sequential Training: AdaBoost iteratively trains a series of weak learners on the training data. In each iteration, the algorithm focuses on the examples that were misclassified in the previous iterations by assigning higher weights to them. This way, AdaBoost gives more emphasis to the difficult examples that the weak learners struggle to classify correctly.
Weighted Voting: After training each weak learner, AdaBoost assigns a weight to it based on its performance. The weight of each weak learner is determined by its accuracy in classifying the training examples. The more accurate a weak learner is, the higher its weight in the final model.
Final Model: The final model in AdaBoost is a weighted combination of all the weak learners, where the weights are determined based on the accuracy of each learner. During prediction, the weak learners’ outputs are combined through a weighted voting scheme to make the final prediction.

Notes
- The performance depends on the data and the weak learners.
- In the following situations, the Adaboost Algorithm will lose efficacy:
  - The Weak Learners are too complicated. (Over-fitting)
  - The Weak Learners are too weak
- Previous experiments show that Adaboost seems to be easily affected by noise.

Bagging (Bootstrap Aggregating)

Bootstrap Sampling:
- Randomly select multiple subsets of the original dataset with replacement. Each subset is of the same size as the original dataset.
Base Learner Training:
- Train a base learner (e.g., decision tree) on each bootstrap sample independently. (Each base learner learns from a different subset of the data, which introduces diversity in the models.)
Parallel Training:
- The base learners are trained in parallel, meaning that each learner is trained independently of the others.
Prediction Aggregation:
- Once all base learners are trained, predictions are made on new data points using each individual model. For regression tasks, the final prediction is often the average of the predictions from all base learners. For classification tasks, the final prediction is determined by majority voting among the base learners.
Final Prediction:
- The final prediction of the Bagging ensemble model is the aggregated result of all base learners.
Evaluation:
- The performance of the Bagging ensemble model is evaluated using metrics such as accuracy, mean squared error, or other relevant evaluation metrics.

展开全文 >>

Decision Tree

2024-04-03

DT solves the classification problem with discrete features.(If some features are continuous, then discretizing them.) To decide which feature to use in each node, we calculate the Entropy or Gini Hybridity. Having the hybridity, we can calculate the information gain and decide when to stop.

Entropy

For Entropy, we define

Since, is , which is quite difficult to use. We define .
In the information theory, entropy measures the purity or uncertainty of the information. This can be used in deciding which feature to use.

Gini Hybridity

For Gini Hybridity, we define

Intuitively, it measures the mixing level of the information.

Information Gain(IG)

We define

Pruning

Pruning is a very important technique to avoid overfitting. It is divided into two cases ---- Pre-Pruning and Post-Pruning.

Pre-Pruning

Normally, there are two conditions when we stop pruning.

Set a fixed number of layers.
Set a fixed number of information gain. If the value goes below this, we stop classifying.

Post-Pruning

Normally, we use validation set to help do the post-pruning. We greedily delete the nodes when deleting them improve the performance in the validation set.
Or, we could use rule post-pruning technique. C4.5 uses it.

展开全文 >>

K-Nearest Neighbors Algorithm

2024-04-03

Background

We find that the data with the same labels will always obtain the similar inputs. Therefore, we want to implement an algorithm to calculate the likelihood the inputs and give an output.

Algorithm

Basic Thoughts

We can take the number of features as dimensions and put the data into a coordinate system. In this way, the distance between them can clearly describe their similarities. It’s reasonable to say that the closer the distance between two points, the more similar they are. So, if we want to predict the label of a specific point, we only could measure all the distances between it and the other points, and calculate the similarity in some way.
However, it’s not efficient since for each point, we need to calculate the distance between all the other points, which needs time complexity each time. So, we need to find some methods to decrease the time complexity.

Improved Algorithm

Now, we can think of a question:
The answer is no. Their contributions are too small compared to the points closer to this point. So, we only need to find some nearest points to determine the label of this specific point. That’s where the name of the algorithm comes from.

There a data structure called KD-Tree.

KD-Tree

Construction
- The Dimension of Split: We split the points from the broadest dimension.
- Split Value: We split the points from their median in order to maximize the distinction between those points.
- Stopping Time: When we find that the remaining data points is less than a given value or the width of the area is small enough, we stop splitting.
Query
- We query the branch of the tree closest to the data point.
- After reaching the leaf nodes, we calculate the distances between the specific point and the points in the leaf nodes.
- We dynamically update the minimum distance and compare it with the border information of each node. This helps us prune unnecessary branches.

Some Specifics

First, we can notice that some characters will have larger value than others. To equalize the impact of all the characters, we need to normalize the data.
Second, there are many ways to measure the distance. We list some of them here:

Chebyshev Distance:
Euclidean Distance:
Bray-Curtis Distance:
Canberra Distance:
Third, we can use distance weighted KNN, Kernel Function or other techniques.

展开全文 >>

Regression Analysis

2024-04-03

Regression Analysis is a statistic analysis method to describe the relationship between variables.

Linear Regression

Underlying Assumptions

There is a linear relationship between the independent and dependent variables.
Independence between data points.
No covariance between independent variables, independent of each other.
The residuals are independent, equal variance, and normally distributed.

Target

Find a line Y=WX+b to fit the data.

Loss Function

Least Square Method

To minimize the loss function, we implement least square method.
We transform this into a convex optimization problem.

Combining these two equations, we get:

Gradient Descent

In addition to least squares, the intercept and slope can be iteratively updated using gradient-based methods.

Initialize
Repeat the following process

Multiple Linear Regression

When there are multiple independent variables, we use matrices to express:

So, and we want to minimize
Solving for the partial derivative, we get

So

We can see that, if is a singular matrix, we can’t calculate . So, using gradient descent is more commonly used for such questions.

Correlation Coefficient for Linear Regression

We define correlation coefficient r as

stands for the standard deviation of X. ()
It describes the degree of linear correlation between the two variables.

Coefficient of Determination

Coefficient of determination , also known as coefficient of determination, goodness of fit, is defined as:

It measures the degree to which the model is sturdy to the data.

What percentage of fluctuations in y can be described by fluctuations in x.
The closer is to 1, the better the independent variable explains the dependent variable in the regression analysis.

展开全文 >>

Fundamentals of Deep Learning

2024-04-03

Definition

Narrow Sense: Neural Network with Deep Layers
Broad Sense: Hierarchical Machine Learning Models

MLP (Multilayer Perceptron)

![[Screenshot 2024-03-21 at 14.39.29.png]]

展开全文 >>