Non-parametric model is a statistical model that does not make any assumptions about the underlying data distributions, meaning it does not require specifying functional form for the relationships between variables, instead learning directly from the data points without pre-defined parameters.
\(K-\)Nearest Neighbors (KNN) Algorithm
K-Nearest Neighbors (KNN) is one of the simplest yet effective algorithms used in supervised learning for both classification and regression problems. It’s a lazy learner—meaning it does not perform any specific training of a model but memorizes the training dataset and makes predictions based on proximity in feature space.
We are given a set of data points \((\bar{x}_i,y_i)\) with \(\bar{x}_i\in \mathbb{R}^d\) and \(y_i\in \mathbb{R}\)
1. Choose the number of neighbors \(K\)
2. Compute the distance between the new data point and all the training samples
3. Select the \(K\) nearest neighbors based on distance.
4. For classification, the output is the most common class among the \(K\) neighbors.
5. For regression, the output is the average of the target values of \(K\) neighbors
\(K-\)Nearest Neighbors Classification
The KNN classification algorithm can be summarized with the following steps:
Given:
\(X_{train} = [x_1, x_2, \ldots, x_n]\) (the training data features)
\(y_{train} = [y_1, y_2, \ldots, y_n]\) (the training data labels)
\(x_{test}\) (the new data point for which we want to predict the class)
Steps
1. Compute Distance: For each training point \(x_i\), calculate the distance \(d(x_i, x_{test})\) using a distance metric like Euclidean distance: \[
d(x_i, x_{test}) = \sqrt{\sum_{j=1}^{m} (x_{i,j} - x_{test,j})^2}
\] where \(m\) is the number of features.
2. Find K Nearest Neighbors: Sort the distances and pick the K closest points.
3. Majority Voting: Look at the labels \(y_i\) of the K nearest neighbors. The predicted label for \(x_{test}\) is the most frequent label (majority vote) among the neighbors.
For example, let’s say our data looks like this
Training Data
area
bedroom
bathroom
price
condition
7420
4
2
1300000
1
7520
3
3
1450000
1
6420
2
1
1110000
0
5423
3
2
1363400
0
5423
3
1
1263400
1
Test Data
area
bedroom
bathroom
price
condition
5420
3
2.5
1302000
7120
5
4
1453000
For the data points \(x_i\) from the training set and a single test data point \(xt=[5420,3,2.5,1302000]\)
If we sort the above distances, we get \(d_1<d_2<d_3<d_5<d_4\) and if we choose \(K=3\) nearest neighbors, then \(d_1<d_2<d_3\) and
Data point \(x_1\) has class label condition\(=1\)
Data point \(x_2\) has class label condition\(=1\)
Data point \(x_3\) has class label condition\(=0\)
We can clearly see that the majority class (2 out of 3) is condition\(=1\). Therefore, for the given test data, the label would be also condition\(=1\).
KNN Classifier Using Python
Here’s how to implement KNN for classification in Python from scratch:
import numpy as npimport pandas as pdfrom collections import Counterclass CustomKNNclassifier:def__init__(self, k=3):self.k = kdef fit(self, X, Y):self.X = Xself.Y = Ydef predict(self, X): predictions = [self._predict(x) for x in X.to_numpy()] return np.array(predictions)def _predict(self, x):# Compute the Euclidean distances distances = [np.linalg.norm(x - X_train) for X_train inself.X.to_numpy()]# Get the indices of the k nearest neighbors k_indices = np.argsort(distances)[:self.k]# Get the labels of k nearest neighbors k_nearest_neighbors = [self.Y[i] for i in k_indices]# Return the most common label common_label = Counter(k_nearest_neighbors).most_common(1)[0][0]return common_label# Example usagetrain_data = pd.DataFrame( {'area': [7420, 7520, 6420, 5423, 5423],'bedroom': [4, 3, 2, 3, 3],'bathroom': [2, 3, 1, 2, 1],'price': [1300000, 1450000, 1110000, 1363400, 1263400],'condition': [1, 1, 0, 0, 1] })test_data = pd.DataFrame( {'area': [5420, 7120],'bedroom': [3, 5],'bathroom': [2.5, 4],'price': [1302000, 1453000] })X_train = train_data.drop('condition', axis=1)y_train = train_data['condition']X_test = test_data# Initialize and train the KNN modelclassifier = CustomKNNclassifier(k=3)classifier.fit(X_train, y_train)# Predict on test datapredictions = classifier.predict(X_test)print(predictions)
[1 1]
So the complete test set would be
area
bedroom
bathroom
price
condition
5420
3
2.5
1302000
1
7120
5
4
1453000
1
Note: We did not scale the data before applying the classifier. If we scaled, the result might have been different (?). In practice, we need to scale the data before applying KNN algorithm. Because computing a large number of distances with big numbers may get us wrong order and also time cosuming.
\(K-\)Nearest Neighbors Regression
KNN regression is slightly different from classification. Instead of taking a majority vote, we predict the output by averaging the values of the K nearest neighbors.
Given:
\(X_{train} = [x_1, x_2, \ldots, x_n]\) (the training data features)
\(y_{train} = [y_1, y_2, \ldots, y_n]\) (the continuous target values)
\(x_{test}\) (the new data point for which we want to predict the value)
Step-by-Step:
1. Compute Distance: Calculate the Euclidean distance between \(x_{test}\) and each training point \(x_i\). 2. Find K Nearest Neighbors: Sort the distances and select the K nearest points. 3. Averaging: The predicted value for \(x_{test}\) is the average of the target values \(y_i\) of the K nearest neighbors:
But this time, the order is \(d_4<d_5<d_2<d_1<d_3\) and for \(k=3\) we have \(d_4<d_5<d_2\). The price for this distances
For data point \(x_4\), the price\(=1363400\)
For data point \(x_5\), the price\(=1263400\)
For data point \(x_2\), the price\(=1450000\)
So the predicted price should be the average of this three prices, that for \(xt=[5420,3,2.5,1]\) the price we expect \[
price = \frac{1363400+1263400+1450000}{3}=1358933.33
\]
Here’s how to implement KNN for regression in Python from scratch and we see if we get the same as the hand calculation.
from sklearn.preprocessing import StandardScalerclass CustomKNNRegressor:def__init__(self, k=3):self.k = kdef fit(self, X_train, y_train):self.X_train = X_trainself.y_train = y_train.to_numpy()def predict(self, X_test): predictions = [self._predict(x) for x in X_test]return np.array(predictions)def _predict(self, x): distances = [np.linalg.norm(x-x_train) for x_train inself.X_train] k_indices = np.argsort(distances)[:self.k] k_nearest_values = [self.y_train[i] for i in k_indices]return np.mean(k_nearest_values)X_train = train_data.drop('price', axis=1)y_train = train_data['price']test_data = pd.DataFrame( {'area': [5420, 7120],'bedroom': [3, 5],'bathroom': [2.5, 4],'condition': [1, 1] })X_test = test_datascaler = StandardScaler()X_train_sc = scaler.fit_transform(X_train)X_test_sc = scaler.transform(X_test)# Initialize and train the KNN regressorregressor = CustomKNNRegressor(k=3)regressor.fit(X_train_sc, y_train)# Predict on test datapredictions = regressor.predict(X_test_sc)print(np.round(predictions,2))
[1358933.33 1371133.33]
Choosing the Value of K
The value of K significantly affects the performance of the KNN algorithm:
Small K: If K is too small, the model is sensitive to noise, and the predictions can be unstable.
Large K: If K is too large, the model becomes more biased, and the predictions may be overly smoothed.
A typical way to choose K is by trying different values and using cross-validation to see which value yields the best performance.
Distance Metrics
The default metric for KNN is Euclidean distance, but depending on the dataset, other metrics like Manhattan distance or Minkowski distance might be more suitable.
CRIM 20
ZN 20
INDUS 20
CHAS 20
NOX 0
RM 0
AGE 20
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 20
MEDV 0
dtype: int64
NOX
RM
DIS
RAD
TAX
PTRATIO
B
MEDV
0
0.538
6.575
4.0900
1
296
15.3
396.90
24.0
1
0.469
6.421
4.9671
2
242
17.8
396.90
21.6
2
0.469
7.185
4.9671
2
242
17.8
392.83
34.7
3
0.458
6.998
6.0622
3
222
18.7
394.63
33.4
4
0.458
7.147
6.0622
3
222
18.7
396.90
36.2
The data looks clean and ready to implement to the KNNRegressor. Note that, for predictive modeling we need a lot of things, such as exporatory data analysis (EDA), feature engineering, preprocessing and others. However, we will simply apply the KNNRegressor that we built from scratch and built-in library function from scikit-learn to explore the algorithm and find the optimal \(K\).
K-Nearest Neighbors is a simple, intuitive algorithm that can be highly effective in both classification and regression problems. Its simplicity comes from the fact that it doesn’t make any assumptions about the underlying data distribution (it’s non-parametric). However, its performance can be sensitive to the choice of K and the distance metric.
Although it’s easy to implement, KNN can become computationally expensive for large datasets, as it requires calculating distances between the test point and all training samples.
If you need an efficient version, it’s always possible to use optimized libraries like scikit-learn, but writing the algorithm from scratch helps build a solid understanding.
When to Use KNN Over Linear Regression?
We would consider using KNN regression over linear regression in the following situations:
Non-linear relationships: When the data shows non-linear patterns or complex relationships between features and target variables that cannot be captured by a straight line.
Local behavior: When data has local patterns or clusters, and you believe that predictions should rely on the nearest data points.
Minimal assumptions: If you do not want to assume a specific relationship between the features and target, KNN’s non-parametric nature might be more appropriate.
Smaller datasets: KNN works well with smaller datasets and lower-dimensional data where calculating distances is feasible and efficient.
However, KNN becomes less efficient and struggles in high dimensions or when the dataset is large. In those cases, linear regression or other more scalable models may be more appropriate
References
KNN Regressor Overview:
Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2019. This book provides an in-depth explanation of KNN, including its behavior in non-linear data and high-dimensionality challenges.
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006. This book covers non-parametric methods like KNN, highlighting the “curse of dimensionality” and distance-based approaches.
KNN vs. Linear Regression (Model Assumptions & Complexity of Data):
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. This source discusses the assumptions behind linear regression and the flexibility of non-parametric models like KNN.
Kuhn, Max, and Johnson, Kjell. Applied Predictive Modeling. Springer, 2013. The comparison between parametric (like linear regression) and non-parametric models (like KNN) is elaborated in this book.
Interpretability:
Molnar, Christoph. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2019. This book emphasizes the trade-offs between interpretable models like linear regression and more black-box models like KNN.
Murdoch, W. James, et al. “Definitions, methods, and applications in interpretable machine learning.” Proceedings of the National Academy of Sciences 116.44 (2019): 22071-22080.
Sensitivity to Outliers:
Aggarwal, Charu C. Data Classification: Algorithms and Applications. Chapman and Hall/CRC, 2014. This discusses the impact of outliers on different models, including linear regression and KNN.
Friedman, Jerome, et al. The Elements of Statistical Learning. Springer Series in Statistics, 2001. Sensitivity to outliers is compared across various regression techniques, including KNN.
Handling High-Dimensional Data:
Domingos, Pedro. “A few useful things to know about machine learning.” Communications of the ACM 55.10 (2012): 78-87. This paper discusses challenges like the curse of dimensionality in models like KNN.
Verleysen, Michel, and François, Damien. “The curse of dimensionality in data mining and time series prediction.” International Work-Conference on Artificial Neural Networks. Springer, 2005.
Training and Prediction Time:
Shalev-Shwartz, Shai, and Ben-David, Shai. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Provides insights into the computational cost differences between linear and non-parametric models like KNN.
Li, Zhe, et al. “Fast k-nearest neighbor search using GPU.” International Conference on Image and Graphics. Springer, 2015. This paper discusses computational complexity related to KNN.
Overfitting and Flexibility:
Yao, Ying, et al. “Overfitting and Underfitting: A Visual Explanation.” Towards Data Science, 2019. Offers a visual and intuitive explanation of the bias-variance tradeoff in KNN and linear models.
Rasmussen, Carl E., and Williams, Christopher KI. Gaussian Processes for Machine Learning. MIT Press, 2006. Discusses overfitting in KNN due to small values of k and regularization techniques for linear models.
---title: "K Nearest Neighbors Regression"date: "2024-08-29"author: Rafiq Islamcategories: [Data Science, Machine Learning, Artificial Intelligence]citation: trueimage: knn.jpegsearch: truelightbox: truelisting: contents: "/../../dsandml" max-items: 3 type: grid categories: true date-format: full fields: [image, date, title, author, reading-time]format: html: default ipynb: default docx: toc: true adsense: enable-ads: false epub: toc: true adsense: enable-ads: false pdf: toc: true pdf-engine: pdflatex adsense: enable-ads: false number-sections: false colorlinks: true cite-method: biblatextoc-depth: 4---## Introduction: Non-parametric Models <p style="text-align: justify"> Non-parametric model is a statistical model that does not make any assumptions about the underlying data distributions, meaning it does not require specifying functional form for the relationships between variables, instead learning directly from the data points without pre-defined parameters. </p>## $K-$Nearest Neighbors (KNN) Algorithm <p style="text-align:justify">K-Nearest Neighbors (KNN) is one of the simplest yet effective algorithms used in supervised learning for both classification and regression problems. It's a **lazy learner**—meaning it does not perform any specific training of a model but memorizes the training dataset and makes predictions based on proximity in feature space. </p>We are given a set of data points $(\bar{x}_i,y_i)$ with $\bar{x}_i\in \mathbb{R}^d$ and $y_i\in \mathbb{R}$ 1. Choose the number of neighbors $K$ 2. Compute the distance between the new data point and all the training samples 3. Select the $K$ nearest neighbors based on distance. 4. For **classification**, the output is the most common class among the $K$ neighbors. 5. For **regression**, the output is the average of the target values of $K$ neighbors### $K-$Nearest Neighbors Classification The KNN classification algorithm can be summarized with the following steps:Given: - $X_{train} = [x_1, x_2, \ldots, x_n]$ (the training data features) - $y_{train} = [y_1, y_2, \ldots, y_n]$ (the training data labels) - $x_{test}$ (the new data point for which we want to predict the class)**Steps** **1. Compute Distance**: For each training point $x_i$, calculate the distance $d(x_i, x_{test})$ using a distance metric like **Euclidean distance**: $$ d(x_i, x_{test}) = \sqrt{\sum_{j=1}^{m} (x_{i,j} - x_{test,j})^2} $$ where $m$ is the number of features. **2. Find K Nearest Neighbors**: Sort the distances and pick the **K** closest points. **3. Majority Voting**: Look at the labels $y_i$ of the **K** nearest neighbors. The predicted label for $x_{test}$ is the most frequent label (majority vote) among the neighbors. For example, let's say our data looks like this ::: {#tb1-panel layout-ncol=2}| area | bedroom | bathroom | price | condition | |:------:|:---------:|:----------:|:---------:|:-----------:| | 7420 | 4 | 2 | 1300000 | 1 | | 7520 | 3 | 3 | 1450000 | 1 | | 6420 | 2 | 1 | 1110000 | 0 | | 5423 | 3 | 2 | 1363400 | 0 | | 5423 | 3 | 1 | 1263400 | 1 | : Training Data | area | bedroom | bathroom | price | condition | |:------:|:---------:|:----------:|:---------:|:-----------:| | 5420 | 3 | 2.5 | 1302000 | | | 7120 | 5 | 4 | 1453000 | | : Test Data ::: For the data points $x_i$ from the training set and a single test data point $xt=[5420,3,2.5,1302000]$\begin{align*} d(x_1, xt) & = \sqrt{(x_{11}-xt_1)^2 + (x_{12}-xt_2)^2 + (x_{13}-xt_3)^2 + (x_{14}-xt_4)^2}\\ & = \sqrt{(7420-5420)^2 + (4-5)^2 + (2-2.5)^2 + (1300000-1302000)^2} \approx 2828.43\\ d(x_2,xt) & = \sqrt{(x_{21}-xt_1)^2 + (x_{22}-xt_2)^2 + (x_{23}-xt_3)^2 + (x_{24}-xt_4)^2} \\ & = \sqrt{(7520-5420)^2 + (3-5)^2 + (3-2.5)^2 + (1450000-1302000)^2} \approx 14805.92\\ d(x_3,xt) & = \sqrt{(x_{31}-xt_1)^2 + (x_{32}-xt_2)^2 + (x_{33}-xt_3)^2 + (x_{34}-xt_4)^2} \\ & = \sqrt{(6420-5420)^2 + (2-5)^2 + (1-2.5)^2 + (1110000-1302000)^2} \approx 19209.38\\ d(x_4,xt) & = \sqrt{(x_{41}-xt_1)^2 + (x_{42}-xt_2)^2 + (x_{43}-xt_3)^2 + (x_{44}-xt_4)^2}\\ & = \sqrt{(6420-5420)^2 + (2-5)^2 + (1-2.5)^2 + (1110000-1302000)^2} \approx 19209.38\\ d(x_5,xt) & = \sqrt{(x_{51}-xt_1)^2 + (x_{52}-xt_2)^2 + (x_{53}-xt_3)^2 + (x_{54}-xt_4)^2} \\ & = \sqrt{(5423-5420)^2 + (3-5)^2 + (1-2.5)^2 + (1263400-1302000)^2} \approx 38602.95\end{align*}So the distances - $d_1=d(x_1, xt) \approx 2828.43$- $d_2=d(x_2, xt) \approx 14805.92$- $d_3=d(x_3, xt) \approx 19209.38$- $d_4=d(x_4, xt) \approx 61405.03$- $d_5=d(x_5, xt) \approx 38602.95$If we sort the above distances, we get $d_1<d_2<d_3<d_5<d_4$ and if we choose $K=3$ nearest neighbors, then $d_1<d_2<d_3$ and - Data point $x_1$ has class label `condition`$=1$ - Data point $x_2$ has class label `condition`$=1$ - Data point $x_3$ has class label `condition`$=0$ We can clearly see that the majority class (2 out of 3) is `condition`$=1$. Therefore, for the given test data, the label would be also `condition`$=1$.#### KNN Classifier Using Python Here’s how to implement KNN for classification in Python from scratch:```{python}#| code-fold: falseimport numpy as npimport pandas as pdfrom collections import Counterclass CustomKNNclassifier:def__init__(self, k=3):self.k = kdef fit(self, X, Y):self.X = Xself.Y = Ydef predict(self, X): predictions = [self._predict(x) for x in X.to_numpy()] return np.array(predictions)def _predict(self, x):# Compute the Euclidean distances distances = [np.linalg.norm(x - X_train) for X_train inself.X.to_numpy()]# Get the indices of the k nearest neighbors k_indices = np.argsort(distances)[:self.k]# Get the labels of k nearest neighbors k_nearest_neighbors = [self.Y[i] for i in k_indices]# Return the most common label common_label = Counter(k_nearest_neighbors).most_common(1)[0][0]return common_label# Example usagetrain_data = pd.DataFrame( {'area': [7420, 7520, 6420, 5423, 5423],'bedroom': [4, 3, 2, 3, 3],'bathroom': [2, 3, 1, 2, 1],'price': [1300000, 1450000, 1110000, 1363400, 1263400],'condition': [1, 1, 0, 0, 1] })test_data = pd.DataFrame( {'area': [5420, 7120],'bedroom': [3, 5],'bathroom': [2.5, 4],'price': [1302000, 1453000] })X_train = train_data.drop('condition', axis=1)y_train = train_data['condition']X_test = test_data# Initialize and train the KNN modelclassifier = CustomKNNclassifier(k=3)classifier.fit(X_train, y_train)# Predict on test datapredictions = classifier.predict(X_test)print(predictions)```So the complete test set would be | area | bedroom | bathroom | price | condition | |:------:|:---------:|:----------:|:---------:|:-----------:| | 5420 | 3 | 2.5 | 1302000 | 1 | | 7120 | 5 | 4 | 1453000 | 1 | [Note: We did not scale the data before applying the classifier. If we scaled, the result might have been different (?). In practice, we need to scale the data before applying KNN algorithm. Because computing a large number of distances with big numbers may get us wrong order and also time cosuming.]{style="color:red"}### $K-$Nearest Neighbors RegressionKNN regression is slightly different from classification. Instead of taking a majority vote, we predict the output by averaging the values of the **K** nearest neighbors.Given: - $X_{train} = [x_1, x_2, \ldots, x_n]$ (the training data features)- $y_{train} = [y_1, y_2, \ldots, y_n]$ (the continuous target values)- $x_{test}$ (the new data point for which we want to predict the value)**Step-by-Step:** **1. Compute Distance**: Calculate the Euclidean distance between $x_{test}$ and each training point $x_i$. **2. Find K Nearest Neighbors**: Sort the distances and select the **K** nearest points. **3. Averaging**: The predicted value for $x_{test}$ is the average of the target values $y_i$ of the **K** nearest neighbors: $$\hat{y}_{test} = \frac{1}{K} \sum_{i=1}^{K} y_i$$#### KNN Regressor Using PythonNow we use the same training data and test data for this regression. But this time, our target variable is `price` and test data looks like this | area | bedroom | bathroom | Condition | price | |:------:|:---------:|:----------:|:---------:|:-----------:| | 5420 | 3 | 2.5 | 1 | | | 7120 | 5 | 4 | 1 | | After scaling the data looks like this ::: {#tb1-panel layout-ncol=2}| area | bedroom | bathroom | condition | price ||---------|---------|----------|-----------|---------|| 1.213 | 1.414 | 0.267 | 0.730 | 1300000 || 1.336 | 0.000 | 1.603 | 0.730 | 1450000 || -0.026 | -1.414 | -1.336 | -1.095 | 1110000 || -1.261 | 0.000 | 0.267 | -1.095 | 1363400 || -1.261 | 0.000 | -1.336 | 0.730 | 1263400 |: Training Data | area | bedroom | bathroom | condition | price ||---------|---------|----------|-----------|-------|| -1.266 | 0.000 | 0.803 | 0.730 | || 0.854 | 2.828 | 3.876 | 0.730 | |: Test Data ::: Now we see that \begin{align*} d_1=d(x_1, x_t) & = \sqrt{(1.213 - (-1.266))^2 + (1.414 - 0)^2 + (0.267 - 0.803)^2 + (0.730 - 0.730)^2} \approx 2.904\\ d_2=d(x_2, x_t) & = \sqrt{(1.336 - (-1.266))^2 + (0.000 - 0)^2 + (1.603 - 0.803)^2 + (0.730 - 0.730)^2} \approx 2.721\\ d_3=d(x_3, x_t) & = \sqrt{(-0.026 - (-1.266))^2 + (-1.414 - 0)^2 + (-1.336 - 0.803)^2 + (-1.095 - 0.730)^2} \approx 3.382\\ d_4=d(x_4, x_t) & = \sqrt{(-1.261 - (-1.266))^2 + (0.000 - 0)^2 + (0.267 - 0.803)^2 + (-1.095 - 0.730)^2} \approx 1.902\\ d_5=d(x_5, x_t) & = \sqrt{(-1.261 - (-1.266))^2 + (0.000 - 0)^2 + (-1.336 - 0.803)^2 + (0.730 - 0.730)^2} \approx 2.140 \end{align*}But this time, the order is $d_4<d_5<d_2<d_1<d_3$ and for $k=3$ we have $d_4<d_5<d_2$. The price for this distances - For data point $x_4$, the `price`$=1363400$ - For data point $x_5$, the `price`$=1263400$ - For data point $x_2$, the `price`$=1450000$ So the predicted price should be the average of this three prices, that for $xt=[5420,3,2.5,1]$ the price we expect $$price = \frac{1363400+1263400+1450000}{3}=1358933.33$$ Here’s how to implement KNN for regression in Python from scratch and we see if we get the same as the hand calculation.```{python}#| code-fold: falsefrom sklearn.preprocessing import StandardScalerclass CustomKNNRegressor:def__init__(self, k=3):self.k = kdef fit(self, X_train, y_train):self.X_train = X_trainself.y_train = y_train.to_numpy()def predict(self, X_test): predictions = [self._predict(x) for x in X_test]return np.array(predictions)def _predict(self, x): distances = [np.linalg.norm(x-x_train) for x_train inself.X_train] k_indices = np.argsort(distances)[:self.k] k_nearest_values = [self.y_train[i] for i in k_indices]return np.mean(k_nearest_values)X_train = train_data.drop('price', axis=1)y_train = train_data['price']test_data = pd.DataFrame( {'area': [5420, 7120],'bedroom': [3, 5],'bathroom': [2.5, 4],'condition': [1, 1] })X_test = test_datascaler = StandardScaler()X_train_sc = scaler.fit_transform(X_train)X_test_sc = scaler.transform(X_test)# Initialize and train the KNN regressorregressor = CustomKNNRegressor(k=3)regressor.fit(X_train_sc, y_train)# Predict on test datapredictions = regressor.predict(X_test_sc)print(np.round(predictions,2))```---### Choosing the Value of **K**The value of **K** significantly affects the performance of the KNN algorithm: - **Small K**: If **K** is too small, the model is sensitive to noise, and the predictions can be unstable. - **Large K**: If **K** is too large, the model becomes more biased, and the predictions may be overly smoothed.A typical way to choose **K** is by trying different values and using cross-validation to see which value yields the best performance.---### Distance MetricsThe default metric for KNN is **Euclidean distance**, but depending on the dataset, other metrics like **Manhattan distance** or **Minkowski distance** might be more suitable.- **Euclidean Distance** (L2 Norm): $$ d(x_i, x_j) = \sqrt{\sum_{k=1}^{m} (x_{i,k} - x_{j,k})^2} $$- **Manhattan Distance** (L1 Norm): $$ d(x_i, x_j) = \sum_{k=1}^{m} |x_{i,k} - x_{j,k}| $$### KNN Implementation In this section we use KNN regression for `Boston Housing` dataset and find the optimal $K$ using the `KFold` cross-validation. ```{python}#| code-fold: falsedf = pd.read_csv('HousingData.csv')```Next we see if there is any missing values. If we have any, we will skip those observations. ```{python}#| code-fold: falseprint(df.isnull().sum())df.dropna(axis=1,inplace=True)df.head()```<p style="text-align:justify">The data looks clean and ready to implement to the KNNRegressor. Note that, for predictive modeling we need a lot of things, such as exporatory data analysis (EDA), feature engineering, preprocessing and others. However, we will simply apply the `KNNRegressor` that we built from scratch and built-in library function from `scikit-learn` to explore the algorithm and find the optimal $K$.</p> ```{python}#| code-fold: falsefrom sklearn.model_selection import KFold, train_test_splitfrom sklearn.metrics import mean_squared_error,r2_scoreimport matplotlib.pyplot as plt X = df.drop('MEDV',axis=1)y = df['MEDV']X_train, X_test, y_train, y_test = train_test_split( X,y, test_size=0.30, random_state=123)scaler = StandardScaler()X_train_sc = scaler.fit_transform(X_train)X_test_sc = scaler.transform(X_test)k_values = [5,15,30,40]kfold = KFold(n_splits=7, shuffle=True, random_state=123)mses = np.zeros((7,4))for i,(train_index,test_index) inenumerate(kfold.split(X_train_sc)): X_train_train = X_train_sc[train_index] X_train_holdout = X_train_sc[test_index] y_train_train = y_train.iloc[train_index] y_train_holdout = y_train.iloc[test_index]for j,k inenumerate(k_values): regressor1 = CustomKNNRegressor(k=k) regressor1.fit(X_train_train, y_train_train) preds = regressor1.predict(X_train_holdout) mses[i,j] = mean_squared_error(preds, y_train_holdout)plt.scatter(np.zeros(7),mses[:,0], s=60, c='white', edgecolors='black', label='Single Split')plt.scatter(np.ones(7),mses[:,1],s=60, c='white', edgecolors='black')plt.scatter(2*np.ones(7),mses[:,2],s=60, c='white', edgecolors='black')plt.scatter(3*np.ones(7),mses[:,3],s=60, c='white', edgecolors='black')plt.scatter([0,1,2,3], np.mean(mses, axis=0), s=60,c='r', marker='X', label='Mean')plt.legend(loc='upper right')plt.xticks([0,1,2,3],['K=5','K=15','K=30','K=40'])plt.ylabel('MSE')plt.gca().set_facecolor('#f4f4f4')plt.gcf().patch.set_facecolor('#f4f4f4')plt.show()```So, $K=5$ seems optimal based on our custom built regressor. Now if we do the same thing using the `scikit-learn` library ```{python}#| code-fold: falsefrom sklearn.neighbors import KNeighborsRegressormses = np.zeros((7,4))for i,(train_index,test_index) inenumerate(kfold.split(X_train_sc)): X_train_train = X_train_sc[train_index] X_train_holdout = X_train_sc[test_index] y_train_train = y_train.iloc[train_index] y_train_holdout = y_train.iloc[test_index]for j,k inenumerate(k_values): regressor2 = KNeighborsRegressor(k) regressor2.fit(X_train_train, y_train_train) preds = regressor2.predict(X_train_holdout) mses[i,j] = mean_squared_error(preds, y_train_holdout)plt.scatter(np.zeros(7),mses[:,0], s=60, c='white', edgecolors='black', label='Single Split')plt.scatter(np.ones(7),mses[:,1],s=60, c='white', edgecolors='black')plt.scatter(2*np.ones(7),mses[:,2],s=60, c='white', edgecolors='black')plt.scatter(3*np.ones(7),mses[:,3],s=60, c='white', edgecolors='black')plt.scatter([0,1,2,3], np.mean(mses, axis=0), s=60,c='r', marker='X', label='Mean')plt.legend(loc='upper right')plt.xticks([0,1,2,3],['K=5','K=15','K=30','K=40'])plt.ylabel('MSE')plt.gca().set_facecolor('#f4f4f4')plt.gcf().patch.set_facecolor('#f4f4f4')plt.show()```In both method, we got $K=5$ is the optimal number of neighbors for KNN regression. Let's apply this in our test dataset ```{python}#| code-fold: falseregressor = CustomKNNRegressor(k=5)regressor.fit(X_train_sc, y_train)predictions = regressor.predict(X_test_sc)mse = mean_squared_error(predictions,y_test)rsquared = r2_score(predictions,y_test)print('MSE = {}'.format(np.round(mse,2)),' and R-square = {}'.format(np.round(rsquared,2)))```### Conclusion<p style="text-align: justify">K-Nearest Neighbors is a simple, intuitive algorithm that can be highly effective in both classification and regression problems. Its simplicity comes from the fact that it doesn't make any assumptions about the underlying data distribution (it's non-parametric). However, its performance can be sensitive to the choice of **K** and the distance metric.<br><br>Although it's easy to implement, KNN can become computationally expensive for large datasets, as it requires calculating distances between the test point and all training samples.<br><br>If you need an efficient version, it's always possible to use optimized libraries like scikit-learn, but writing the algorithm from scratch helps build a solid understanding.</p>### When to Use KNN Over Linear Regression?We would consider using KNN regression over linear regression in the following situations: - **Non-linear relationships**: When the data shows non-linear patterns or complex relationships between features and target variables that cannot be captured by a straight line. - **Local behavior**: When data has local patterns or clusters, and you believe that predictions should rely on the nearest data points. - **Minimal assumptions**: If you do not want to assume a specific relationship between the features and target, KNN’s non-parametric nature might be more appropriate. - **Smaller datasets**: KNN works well with smaller datasets and lower-dimensional data where calculating distances is feasible and efficient.However, KNN becomes less efficient and struggles in high dimensions or when the dataset is large. In those cases, linear regression or other more scalable models may be more appropriate --- ## References 1. **KNN Regressor Overview:** - Géron, Aurélien. *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems*. O'Reilly Media, 2019. This book provides an in-depth explanation of KNN, including its behavior in non-linear data and high-dimensionality challenges. - Bishop, Christopher M. *Pattern Recognition and Machine Learning*. Springer, 2006. This book covers non-parametric methods like KNN, highlighting the "curse of dimensionality" and distance-based approaches.2. **KNN vs. Linear Regression (Model Assumptions & Complexity of Data):** - Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. Springer, 2009. This source discusses the assumptions behind linear regression and the flexibility of non-parametric models like KNN. - Kuhn, Max, and Johnson, Kjell. *Applied Predictive Modeling*. Springer, 2013. The comparison between parametric (like linear regression) and non-parametric models (like KNN) is elaborated in this book.3. **Interpretability:** - Molnar, Christoph. *Interpretable Machine Learning: A Guide for Making Black Box Models Explainable*. 2019. This book emphasizes the trade-offs between interpretable models like linear regression and more black-box models like KNN. - Murdoch, W. James, et al. "Definitions, methods, and applications in interpretable machine learning." *Proceedings of the National Academy of Sciences* 116.44 (2019): 22071-22080.4. **Sensitivity to Outliers:** - Aggarwal, Charu C. *Data Classification: Algorithms and Applications*. Chapman and Hall/CRC, 2014. This discusses the impact of outliers on different models, including linear regression and KNN. - Friedman, Jerome, et al. *The Elements of Statistical Learning*. Springer Series in Statistics, 2001. Sensitivity to outliers is compared across various regression techniques, including KNN.5. **Handling High-Dimensional Data:** - Domingos, Pedro. "A few useful things to know about machine learning." *Communications of the ACM* 55.10 (2012): 78-87. This paper discusses challenges like the curse of dimensionality in models like KNN. - Verleysen, Michel, and François, Damien. "The curse of dimensionality in data mining and time series prediction." *International Work-Conference on Artificial Neural Networks*. Springer, 2005.6. **Training and Prediction Time:** - Shalev-Shwartz, Shai, and Ben-David, Shai. *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press, 2014. Provides insights into the computational cost differences between linear and non-parametric models like KNN. - Li, Zhe, et al. "Fast k-nearest neighbor search using GPU." *International Conference on Image and Graphics*. Springer, 2015. This paper discusses computational complexity related to KNN.7. **Overfitting and Flexibility:** - Yao, Ying, et al. "Overfitting and Underfitting: A Visual Explanation." Towards Data Science, 2019. Offers a visual and intuitive explanation of the bias-variance tradeoff in KNN and linear models. - Rasmussen, Carl E., and Williams, Christopher KI. *Gaussian Processes for Machine Learning*. MIT Press, 2006. Discusses overfitting in KNN due to small values of `k` and regularization techniques for linear models.---**Share on** <div id="fb-root"></div><script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v20.0"></script><div class="share-buttons"><div class="fb-share-button" data-href="https://mrislambd.github.io/dsandml/knn/"data-layout="button_count" data-size="small"><a target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fmrislambd.github.io%2Fdsandml%2Fknn%2F&src=sdkpreparse" class="fb-xfbml-parse-ignore">Share</a></div><script src="https://platform.linkedin.com/in.js" type="text/javascript">lang: en_US</script><script type="IN/Share" data-url="https://mrislambd.github.io/dsandml/knn/"></script> <a href="https://twitter.com/share?ref_src=twsrc%5Etfw" class="twitter-share-button" data-url="https://mrislambd.github.io/dsandml/knn/" data-show-count="true">Tweet</a><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div><div class="fb-comments" data-href="https://mrislambd.github.io/dsandml/knn/" data-width="" data-numposts="5"></div>**You may also like**