When you work in a specific field long enough, certain lessons, concepts, and teachers leave a lasting impression. It’s common to look back on these formative moments, especially when they spark a lifelong passion.

Take my mother, a teacher, for instance. She still remembers the substitute teacher who made her fall in love with philosophy. My Tae Kwon Do master, in a similar vein, fondly recalls the excitement of his first martial arts class as a child—an experience that shaped his career and love for Tae Kwon Do.

As for me, I’m a Machine Learning Engineer. The subject that I know best, and which has captured my professional enthusiasm, is machine learning. One moment that stands out is my first encounter with regression and classification during a lecture in my undergraduate studies. My professor made the difference between classification and regression crystal clear with examples. Classifying whether an email is spam or not is a typical classification task, whereas predicting the price of a house based on various features—like location or size—is a regression task.

In both classification and regression, we typically deal with a dataset represented by a matrix \(X\) (input features), which has \(n\) rows (data points) and \(k\) columns (features). The output of our model is a vector \(y\) with \(n\) elements, corresponding to the rows in \(X\). In classification, the values in \(y\) are discrete labels (e.g., spam = 1, not spam = 0). In regression, the values are continuous, real numbers (e.g., house prices).

Regression tasks can be approached in many ways—perhaps too many. The rise of AI and the explosion of “ultimate solution” claims have made it hard to discern which method is best for a given dataset. The truth is that no single algorithm solves all problems. The best approach depends on the characteristics of the data and the desired outcome.

In this article, we’ll explore how to select the appropriate regression algorithm by considering:

  • The linearity or polynomial nature of the dataset.
  • The complexity of the dataset.
  • The dimensionality of the input data.
  • The need for probabilistic output.

For simplicity, we’ll focus on traditional machine learning methods, excluding neural networks, and work primarily with smaller, synthetic datasets.

Ready to dive in? Let’s explore the world of regression.

1. A Taxonomy of Regression Methods

To guide us through this, we can use a simple “rule of thumb” taxonomy to choose a suitable regression approach:

Key Questions

  1. Is the relationship between \(X\) and \(y\) linear or polynomial?
    • Yes:
      • Is the number of features \(k\) small?
        • Yes: Use Linear or Polynomial Regression.
        • No: Use Principal Component Regression.
    • No:
      • Is a probabilistic output needed?
        • Yes: Use Gaussian Process Regression (GPR).
        • No: Use Support Vector Regression (SVR).

This taxonomy provides a foundation for navigating regression tasks. In the following sections, we’ll explore each of these methods in detail—both theoretically and with code examples.


2. Linear and Polynomial Regression

2.1 Explanation

Linear regression is one of the simplest and most widely known methods for regression tasks. It models the relationship between input features \(X\) and the target \(y\) as a linear function. Mathematically, we can express the model as:

\[y = XW + w_0\]

Where:

  • \(X\) is the input matrix of size \(n \times k\),
  • \(W\) is the weights matrix (one weight for each feature),
  • \(w_0\) is the intercept term (bias).

Linear regression assumes that the output is a weighted sum of the input features. When we perform Ordinary Least Squares (OLS) regression, the weights are calculated by minimizing the sum of squared errors between the predicted and actual values:

\[W = (X^TX)^{-1} X^T y\]

In polynomial regression, the principle is the same, but we transform the input matrix \(X\) into a new matrix of polynomial features. For example, for a second-degree polynomial, the transformed matrix \(X_\text{poly}\) contains columns for each feature raised to the powers of 1, 2, and possibly higher.

2.2 Code Example

Here’s how we can implement linear regression in Python using the popular library scikit-learn:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Simulate data
X = np.linspace(1, 20, 100).reshape(-1, 1)
y = 3*X + np.random.randn(100, 1) * 5

# Fit linear regression model
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, label='Data')
plt.plot(X, y_pred, color='red', label='Linear Regression')
plt.legend()
plt.show()

For polynomial regression, we add a step to transform the input matrix $X$ into polynomial features:

# Polynomial features of degree 3
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# Fit the polynomial regression model
model_poly = LinearRegression()
model_poly.fit(X_poly, y)

# Predictions
y_poly_pred = model_poly.predict(X_poly)

# Plot
plt.scatter(X, y, label='Data')
plt.plot(X, y_poly_pred, color='green', label='Polynomial Regression (degree 3)')
plt.legend()
plt.show()

2.3 When to use

Linear and polynomial regression work best when there is a clear linear or polynomial relationship between the input features and the target variable. For datasets with few features ($k$ is small), these methods are efficient and interpretable.

3. Principal Component Regression (PCR)

3.1 Explanation

When dealing with high-dimensional datasets (i.e., datasets with many features), linear regression might suffer from overfitting, especially when some features are highly correlated. Principal Component Analysis (PCA) can be used to reduce the dimensionality of the input data, creating uncorrelated features (principal components) that still retain most of the variability in the data.

Principal Component Regression combines PCA with linear regression. First, we transform the input matrix $X$ into a set of principal components. Then, we use these principal components to predict the target $y$.

3.2 Code Example

Let’s see how this works in practice:

from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

# Simulate data with 5 features
X = np.random.rand(100, 5)
y = 2*X[:, 0] + 3*X[:, 1] + np.random.randn(100)

# Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 principal components
X_pca = pca.fit_transform(X)

# Fit regression model on principal components
model_pcr = LinearRegression()
model_pcr.fit(X_pca, y)

# Predictions
y_pcr_pred = model_pcr.predict(X_pca)

3.3 When to Use

Principal Component Regression is ideal when you have many features, especially if they are highly correlated. By reducing the dimensionality, you simplify the model and avoid overfitting, while still capturing the essential patterns in the data.

4. Gaussian Process Regression (GPR)

4.1 Explanation

Gaussian Process Regression (GPR) is a powerful tool when you need not only a prediction but also an estimate of uncertainty in that prediction. GPR models the target as a random variable following a Gaussian distribution. Given a new data point $x$, GPR provides not just a single prediction but a distribution of possible outcomes, which can be extremely useful when dealing with uncertain or noisy data.

The beauty of GPR lies in its ability to offer both mean predictions and confidence intervals, allowing for uncertainty quantification. This makes it a robust choice for applications like geospatial modeling, stock price prediction, or any domain where data uncertainty is high.

4.2 Code Example

Here’s a simple implementation using scikit-learn’s GPR module:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

# Simulate non-linear data
X = np.linspace(1, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(100) * 0.2

# Define the kernel (RBF kernel)
kernel = C(1.0, (1e-4, 1e1)) * RBF(1.0, (1e-4, 1e1))
gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

# Fit the model
gpr.fit(X, y)

# Make predictions
y_pred, sigma = gpr.predict(X, return_std=True)

# Plot
plt.figure()
plt.plot(X, y, 'r.', markersize=10, label='Observations')
plt.plot(X, y_pred, 'b-', label='Prediction')
plt.fill_between(X.ravel(), y_pred - 1.96 * sigma, y_pred + 1.96 * sigma,
                 alpha=0.2, color='gray', label='95% confidence interval')
plt.legend()
plt.show()

4.3 When to Use

GPR is well-suited for small datasets where uncertainty is critical. It works best when the dataset is moderately complex but not excessively large, as GPR can become computationally expensive for larger datasets.

5. Support Vector Regression (SVR)

5.1 Explanation

Support Vector Regression (SVR) is a versatile and powerful regression technique, especially when dealing with nonlinear relationships. Unlike traditional regression methods that minimize the error directly, SVR tries to fit the best margin around the target values. It uses a concept known as a “kernel trick” to map input features into higher-dimensional spaces, making it easier to find a linear separation in complex datasets.

5.2 Code Example

Here’s an example of SVR in Python:

from sklearn.svm import SVR

# Simulate non-linear data
X = np.linspace(0, 2 * np.pi, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(100) * 0.1

# Fit SVR model with RBF kernel
svr = SVR(kernel='rbf', C=100, gamma=0.1)
svr.fit(X, y)

# Predictions
y_svr_pred = svr.predict(X)

# Plot
plt.scatter(X, y, label='Data')
plt.plot(X, y_svr_pred, color='orange', label='SVR Prediction')
plt.legend()
plt.show()

5.3 When to Use

SVR shines when the relationship between the input and output is nonlinear and complex. It’s particularly useful for high-dimensional datasets, as it can efficiently handle large feature spaces through the kernel trick.

6. Other Regression Methods

Although the methods we’ve discussed so far cover a wide range of regression tasks, there are several other popular approaches worth mentioning:

  • XGBoost: A gradient boosting method that sequentially builds models to correct errors made by previous models. XGBoost is highly efficient and often used in competitive machine learning challenges.

  • Random Forests and Decision Trees: Decision trees split the dataset into smaller subsets based on feature values. Random forests improve upon this by averaging the predictions of multiple trees to enhance accuracy and avoid overfitting.

  • Neural Networks: Although outside the scope of this article, neural networks are extremely popular for solving regression tasks, especially when the relationship between features and the target variable is highly complex.

7. Final Thoughts

Regression tasks are a fundamental part of machine learning, offering a wide range of methods to model relationships between variables. Whether you’re working with simple linear data or complex, nonlinear datasets, there is a regression method suited to your needs.

In this article, we:

  • Explored the distinction between classification and regression.
  • Introduced a decision framework for selecting the right regression method based on dataset complexity and needs.
  • Provided detailed explanations and code examples for linear, polynomial, principal component, Gaussian process, and support vector regression.
  • Briefly touched upon other popular regression techniques such as XGBoost and Random Forests.

Remember, the key to effective regression modeling lies in understanding your dataset and choosing a method that aligns with its structure and your objectives.