I tested 5 ML algorithms across 7 datasets. The “best” one didn’t exist.

Every machine learning tutorial gives you the same advice: “Try Random Forest first.”

I followed that advice for years. Then I spent a year building a tool to test it properly — and the answer turned out to be uncomfortable.

The setup

For my master’s thesis, co-supervised between the University of Maribor and UPC Barcelona, I built a Python tool that benchmarks five classification algorithms — Logistic Regression, Decision Tree, Random Forest, SVM, and KNN — across seven datasets, scoring each on seven metrics (Accuracy, Precision, Recall, F1, Cohen’s Kappa, Hamming Loss, Jaccard) plus training and prediction time.

I expected a clear winner. I didn’t get one.

Screenshot of the "Algorithm Performance Comparison" desktop tool built for the master's thesis — a Python GUI where you pick a dataset (e.g. Iris) and an algorithm (e.g. Logistic Regression), hit Run Algorithm, and it fills in Recall, Precision, Accuracy, F1 Score, Cohen's Kappa, Hamming Loss, Jaccard Similarity Score, and training Time, with a Generate Graph button at the bottom.

What actually happened

On Iris and Digits, every algorithm performed beautifully. Pick any one.
On Wine and Breast Cancer, Logistic Regression and Random Forest dominated; SVM and KNN trailed badly.
On Olivetti Faces, SVM jumped to the front.
On Covertype, KNN was the best — the same KNN that was nearly the worst elsewhere. SVM, the hero of the previous dataset, collapsed.

Bar chart of Jaccard Score, Accuracy, Cohen's Kappa, and Hamming Loss across Logistic Regression, Decision Tree, Random Forest, SVM, and KNN on the Covertype dataset — KNN tops every correctness metric, Logistic Regression and SVM trail badly.

If I’d run this thesis on three of those datasets, I would have left convinced Random Forest was the answer. On the other four, I’d have walked away convinced of something completely different.

Why this actually matters

We talk about ML algorithms like they have personalities — “use this one for X, this one for Y.” But the dataset has the personality. Algorithm choice is a conversation between the math and the data in front of you, not a rule you memorize once from a blog post.

In production, this means: the “best” model is the one you measured on your own data. Cheap, tedious benchmarking beats expensive reasoning from authority.

There’s a second layer the metrics taught me. Accuracy and F1 don’t always agree. Cohen’s Kappa can flag a model that’s beating random chance by a hair while accuracy says it’s “great.” A model can win on one metric and lose on every other measure that actually matters for an imbalanced dataset. You don’t know which metric matters until you know what a wrong prediction costs you.

What it taught me

Build the thing that measures, not the thing that argues. The tool I shipped was simpler than any of the algorithms it tested — and it answered the question better than any blog post could.

Also: track time, not just accuracy. A model that’s 1% more accurate and 10× slower to train is the wrong choice more often than people admit.

The real takeaway, looking back: the discipline of measuring is more valuable than the model you measure.

If you want the full version — methodology, all seven datasets, every chart, the comparison tool — the full master’s thesis is open access on UPCommons.

What’s your default algorithm when you start a new classification problem? Curious where the consensus sits today.

#MachineLearning #DataScience #Python #ScikitLearn #MastersThesis #PredictiveAnalytics #Classification

I tested 5 ML algorithms across 7 datasets. The "best" one didn't exist.

I tested 5 ML algorithms across 7 datasets. The “best” one didn’t exist.

The setup

What actually happened

Why this actually matters

What it taught me