Background
We find that the data with the same labels will always obtain the similar inputs. Therefore, we want to implement an algorithm to calculate the likelihood the inputs and give an output.
Algorithm
Basic Thoughts
We can take the number of features as dimensions and put the data into a coordinate system. In this way, the distance between them can clearly describe their similarities. It’s reasonable to say that the closer the distance between two points, the more similar they are. So, if we want to predict the label of a specific point, we only could measure all the distances between it and the other points, and calculate the similarity in some way.
However, it’s not efficient since for each point, we need to calculate the distance between all the other points, which needs
Improved Algorithm
Now, we can think of a question:
The answer is no. Their contributions are too small compared to the points closer to this point. So, we only need to find some nearest points to determine the label of this specific point. That’s where the name of the algorithm comes from.
There a data structure called KD-Tree.
KD-Tree
- Construction
- The Dimension of Split: We split the points from the broadest dimension.
- Split Value: We split the points from their median in order to maximize the distinction between those points.
- Stopping Time: When we find that the remaining data points is less than a given value or the width of the area is small enough, we stop splitting.
- Query
- We query the branch of the tree closest to the data point.
- After reaching the leaf nodes, we calculate the distances between the specific point and the points in the leaf nodes.
- We dynamically update the minimum distance and compare it with the border information of each node. This helps us prune unnecessary branches.
Some Specifics
First, we can notice that some characters will have larger value than others. To equalize the impact of all the characters, we need to normalize the data.
Second, there are many ways to measure the distance. We list some of them here:
- Chebyshev Distance:
- Euclidean Distance:
- Bray-Curtis Distance:
- Canberra Distance:
Third, we can use distance weighted KNN, Kernel Function or other techniques.