Overfitting and Underfitting

Edition #132| May 02, 2025

May 02, 2025

Master AI Agents & Build Fully Autonomous Web Interactions!

Join our AI Agents Certification Program and learn to develop AI agents that plan, reason, and automate tasks independently. A hands-on, 4-week intensive program with expert-led live sessions.

📅 Starts: 24st May | Early Bird: $1190 (Limited Spots! Price Increases to $2490 in 7 Days)
🔗 Enroll now & unlock exclusive bonuses! (Worth 500$+)

👉 Secure your spot today!

Hello!!
Welcome to the new edition of Business Analytics Review!

Today, we’re tackling two fundamental challenges in building effective models: overfitting and underfitting. These terms might sound like jargon, but they’re critical to ensuring your models make accurate predictions in the real world. Let’s dive in with an analogy to make things relatable.

Overfitting and Underfitting: Visually Explained Like You're Five | by Axel Thevenot 🐣 | TDS Archive | Medium

A chef crafts a signature dish praised by family but rejected by restaurant patrons, illustrating overfitting: tailoring too closely to specific preferences (training data noise) harms generalization to new customers (unseen data). Conversely, an oversimplified recipe that neglects critical flavors leads to underfitting, where models are too basic to capture essential patterns. Both scenarios highlight the need to balance complexity and adaptability in model design.

What Is Overfitting and Underfitting?

To grasp these concepts, let’s break them down clearly, with a nod to the bias-variance tradeoff, a key idea in machine learning.

Overfitting: This happens when a model learns not just the underlying patterns in the training data but also its noise and random fluctuations. Think of a student who memorizes test answers without understanding the subject - they ace the practice test but flop on a new exam. An overfitted model performs brilliantly on training data but poorly on new, unseen data, failing to generalize. In technical terms, it has low bias (it fits the training data well) but high variance (it’s overly sensitive to training data fluctuations).
Underfitting: This occurs when a model is too simplistic to capture the data’s complexity. It’s like using a straight line to model a curved relationship - it doesn’t fit well anywhere. An underfitted model performs poorly on both training and new data because it misses key patterns. It has high bias (due to oversimplified assumptions) and low variance (it’s not sensitive to training data changes).

The goal is to find a sweet spot, a model that balances bias and variance to generalize well to new data.

Causes of Overfitting and Underfitting

Understanding why these issues arise is the first step to preventing them. Here’s a breakdown:

Overfitting

Excessive Model Complexity: Using a model with too many parameters or features, like a high-degree polynomial for a simple trend.
Limited Training Data: With too little data, the model may memorize specifics rather than learn general patterns.
Noisy Data: If the training data includes outliers or errors, the model might fit these irrelevant details.

Underfitting

Overly Simple Model: Choosing a model that’s not powerful enough, like a linear model for nonlinear data.
Inadequate Features: Missing key variables that explain the data’s patterns.

Excessive Regularization: Applying too much constraint (e.g., strong L1/L2 penalties) can limit the model’s ability to learn.

Detecting Overfitting and Underfitting

Spotting these issues early can save time and resources. Here’s how:

Performance Metrics:
- Overfitting: High accuracy on training data but low accuracy on validation or test data. For example, a model might achieve 95% accuracy on training but only 70% on validation.
- Underfitting: Low accuracy on both training and validation data, indicating the model hasn’t learned enough.
Learning Curves: Plotting training and validation errors over time can reveal issues:
- In overfitting, training error is low, but validation error is high and doesn’t improve.
- In underfitting, both errors are high, showing the model isn’t capturing the data’s patterns.
Cross-Validation: Using techniques like k-fold cross-validation helps assess how well the model generalizes to unseen data, highlighting overfitting or underfitting.

Our PRO newsletter is Active . Learn More Here
You can enjoy the daily premium content at the cost of a single coffee.

Mitigating Overfitting:

Simplify the Model: Reduce complexity by using fewer features or a less intricate algorithm (e.g., switch from a deep neural network to a simpler one).
Increase Training Data: More data helps the model learn general patterns rather than memorizing specifics.
Regularization: Techniques like L1 (Lasso) or L2 (Ridge) regularization add penalties to prevent the model from fitting noise.
Early Stopping: Monitor validation error during training and stop when it starts to increase, preventing over-optimization on training data.
Cross-Validation: Use k-fold cross-validation to tune hyperparameters and ensure robust performance.
Dropout (for Neural Networks): Randomly disable neurons during training to reduce reliance on specific patterns (Dropout Guide).

Mitigating Underfitting

Increase Model Complexity: Opt for a more sophisticated algorithm or add layers to a neural network.
Enhance Features: Perform feature engineering to include more relevant variables (Feature Engineering).
Reduce Regularization: Lower regularization parameters to allow the model to fit the data better.
Train Longer: Increase the number of training epochs or iterations to ensure the model learns adequately (Epochs).

Real-World Example: Credit Scoring in Finance

Let’s bring this to life with a business scenario. Imagine a financial institution developing a credit scoring model to predict loan default risks, using historical data on income, credit history, and employment status.

Overfitting Scenario: The model might latch onto specific patterns in the training data, like outliers (e.g., a few high-income defaulters). It achieves near-perfect accuracy on historical data but misclassifies new applicants, approving risky loans or rejecting creditworthy ones. This could lead to significant financial losses or missed opportunities.
Underfitting Scenario: If the model is too simple—say, a basic linear regression with only income as a feature—it fails to capture complex factors like credit history or job stability. It performs poorly on both historical and new data, resulting in unreliable risk assessments.

By balancing model complexity, using cross-validation, and applying regularization, the institution can build a robust model that accurately predicts defaults, supporting sound lending decisions and minimizing financial risks.

Trending in AI and Data Science

Let’s catch up on some of the latest happenings in the world of AI and Data Science:

Trump officials eye changes to Biden's AI chip export rule, sources say
Trump officials consider revising Biden's AI chip export tiers to a global licensing system, easing restrictions on some countries
Meta introduces Llama API to attract AI developers
Meta launches Llama API to attract AI developers, offering customizable, cost-efficient AI models with broad access and competitive advantages
Meta Platforms launches standalone AI assistant app
Meta launches the Meta AI app with Llama 4, offering personalized, voice-driven AI assistant experiences across devices and platforms

Trending AI Tool: Evidently AI

Evidently AI | Product Hunt launch dashboard (413 upvotes | 114 comments)

Evidently AI is an open-source tool that helps data scientists and ML engineers evaluate, test, and monitor machine learning models, ensuring they perform reliably in real-world scenarios. With over 100+ built-in metrics, it offers features like data drift detection, model performance analysis, and interactive reports, making it a go-to for maintaining model health. Learn More

Thank you for joining us on this journey! Until next time, happy analyzing!

Upskill yourself with these Courses

- Python for Data Analysis
- SQL for Data Analysis
- Prompt Engineering : Foundations to Advanced Techniques
- Essentials of Marketing Analytics

Get 50% Off with the coupon code - BA50DKJNV7861 . (For BAR readers)
Click Here to Explore More
(reply to this mail if you face any difficulty)

Yoni Leitersdorf

May 2

Great article! The analogy of the chef and the signature dish really makes the concepts of overfitting and underfitting relatable. It's a challenge we constantly address at Solid, ensuring our AI models are both accurate and adaptable to real-world scenarios. Thanks for sharing such insightful content!

Expand full comment

Business Analytics Review