Sunday, July 6, 2025

Machine Learning Baselines

Share

Introduction to Baseline Models

Baseline models are simple models used as a starting point for comparison when building more complex models. They help evaluate the performance of a model by providing a benchmark to measure against. In this article, we will explore different types of baseline models for various tasks, including classification, regression, time series forecasting, clustering, dimensionality reduction, and anomaly detection.

Classification Tasks

Classification is used to predict discrete classes, such as spam vs. not spam. There are several baseline models for classification tasks, including:

  • Most frequent class: Predict the majority class every time.
  • Stratified random: Predict based on class distribution.
  • Uniform random: Guess randomly.
    These baseline models can be implemented using the DummyClassifier from the sklearn library. For example:

    from sklearn.dummy import DummyClassifier
    dummy = DummyClassifier(strategy="most_frequent")
    dummy.fit(X_train, y_train)
    print(dummy.score(X_test, y_test))

    Use these baseline models when working with imbalanced datasets or as a minimum bar before using logistic regression, SVM, etc.

Regression Tasks

Regression is used to predict continuous values, such as house prices. There are several baseline models for regression tasks, including:

  • Mean regressor: Predict the mean of training targets.
  • Median regressor: Predict median.
  • Quantile regressor: Predict specific quantiles (for skewed data).
    These baseline models can be implemented using the DummyRegressor from the sklearn library. For example:

    from sklearn.dummy import DummyRegressor
    dummy = DummyRegressor(strategy="mean")
    dummy.fit(X_train, y_train)
    print(dummy.score(X_test, y_test))  # R² score

    Use these baseline models when starting with linear models or when trying to understand if features contribute meaningfully.

Time Series Forecasting

Time Series Forecasting is used to predict future values based on past observations. There are several baseline models for time series forecasting, including:

  • Naive forecast: Next value = last value
  • Seasonal naive: Repeat last season’s value
  • Moving average: Predict the average over a sliding window
  • Drift method: Linear trend from past data
    For example:

    def naive_forecast(series):
    return series.shift(1)
    pred = naive_forecast(test_series)
    mae = mean_absolute_error(test_series[1:], pred[1:])

    Use these baseline models as a benchmark for ARIMA, LSTM, and Prophet, or to justify the use of seasonality or trend modeling.

Clustering Tasks

In clustering, we group similar items (unsupervised). To evaluate clustering models, use internal metrics like the Silhouette score, or external validation if ground truth is available (Adjusted Rand Index).

Dimensionality Reduction

Dimensionality reduction compresses data into fewer dimensions (e.g., PCA, t-SNE). Instead of using baseline models, evaluate the model through:

  • Variance retained (PCA)
  • Visualization clarity (t-SNE, UMAP)
  • Downstream task performance

Anomaly Detection

Anomaly detection detects rare or unusual patterns (often unsupervised). This task often lacks labeled data, and class imbalance makes “random” baselines meaningless. Possible baselines include:

  • Constant threshold (e.g., treat top 1% as anomaly)
  • Random scores
    Evaluate via synthetic datasets.

Conclusion

Baseline models are essential in machine learning as they provide a benchmark to measure the performance of more complex models. By understanding the different types of baseline models for various tasks, including classification, regression, time series forecasting, clustering, dimensionality reduction, and anomaly detection, we can better evaluate and improve our models. Remember to choose the right baseline model for your task and use it as a starting point for comparison to build more accurate and effective models.

Latest News

Related News