Machine Learning Fundamentals

Machine learning is just function approximation with extra steps. You feed data to an algorithm, it finds patterns, and it spits out a function that (hopefully) generalizes to data it hasn't seen before. That's it. Everything else — the jargon, the frameworks, the GPU clusters — is scaffolding around that core idea.

Let's break it down from a practitioner's perspective.

Learning Paradigms: The Four Flavors

Supervised Learning

This is the most straightforward paradigm and the one you'll encounter most in production. You have labeled data — inputs paired with known outputs — and the model learns the mapping between them. Think of it like training with an answer key.

Two main task types here:

Classification: the output is a category. Spam or not spam. Cat or dog. Fraud or legitimate. The model draws decision boundaries in feature space.
Regression: the output is a continuous number. House price. Temperature tomorrow. Expected revenue. The model fits a curve through your data points.

Common algorithms: linear/logistic regression, decision trees, random forests, gradient boosting (XGBoost, LightGBM), SVMs, neural networks. In practice, gradient boosting dominates tabular data and neural nets dominate unstructured data (text, images, audio).

Unsupervised Learning

No labels. You're asking the model to find structure in the data on its own. The two big use cases:

Clustering: grouping similar items together. Customer segmentation, document grouping, anomaly detection (the point that doesn't fit any cluster is suspicious). K-means, DBSCAN, hierarchical clustering.
Dimensionality reduction: compressing many features into fewer ones while preserving meaningful variance. PCA is the classic. Useful for visualization, noise reduction, and preprocessing before supervised learning.

Unsupervised learning is trickier to evaluate because there's no ground truth to compare against. You're often eyeballing results or using proxy metrics like silhouette scores.

Self-Supervised Learning

This is how foundation models (FMs) and large language models (LLMs) are pre-trained. The model creates its own labels from the data itself. For a language model, you mask a word and ask it to predict the missing token. For vision models, you might mask patches of an image.

The key insight: self-supervised learning lets you leverage massive amounts of unlabeled data — which is cheap and abundant — to learn general representations. Those representations then transfer to downstream tasks with minimal labeled data. This is why GPT can write code despite never being explicitly trained on "input: prompt, output: code" pairs.

Reinforcement Learning

The model (agent) takes actions in an environment and receives rewards or penalties. It learns a policy that maximizes cumulative reward. This is how game-playing AIs work, and it's relevant to ML practitioners primarily through RLHF (Reinforcement Learning from Human Feedback) — the technique used to fine-tune LLMs to follow instructions and be helpful rather than just autocompleting text.

In RLHF, human raters rank model outputs, a reward model is trained on those preferences, and then the LLM is fine-tuned against that reward model using reinforcement learning (typically PPO). This is why ChatGPT feels different from a raw GPT model — RLHF is the secret sauce.

The Training Pipeline: What Actually Happens

Data Splitting

You never evaluate a model on the data it trained on. That's like grading a student using the exact questions from their homework. Standard practice:

Training set (~70-80%): the model learns from this.
Validation set (~10-15%): you use this to tune hyperparameters and monitor training. The model never directly trains on this, but your decisions are influenced by it.
Test set (~10-15%): held completely aside until the very end. This gives you an unbiased estimate of real-world performance.

If you tune your hyperparameters obsessively against the validation set, you'll eventually overfit to it too — which is why the test set exists as a final sanity check.

Parameters vs. Hyperparameters

This distinction trips up a lot of engineers:

Parameters are learned by the model during training. Weights in a neural network, coefficients in linear regression. You don't set these — the algorithm does via gradient descent (or whatever optimization it uses).
Hyperparameters are set by you before training. Learning rate, batch size, number of layers, regularization strength, number of trees in a random forest. These control how the model learns, not what it learns.

Hyperparameter tuning is essentially a search problem. Grid search, random search, Bayesian optimization — pick your poison. In AWS land, SageMaker has built-in hyperparameter tuning jobs that handle this.

Overfitting vs. Underfitting: The Central Tension

This is the most important concept in applied ML. Get it wrong and nothing else matters.

Overfitting (High Variance)

Your model memorized the training data instead of learning generalizable patterns. It performs brilliantly on training data and terribly on anything new. It's like a student who memorized every practice problem but can't solve a novel one.

Signs: huge gap between training accuracy and validation accuracy. Training loss keeps dropping but validation loss starts climbing.

Fixes:

More data: the single most effective cure. Harder to memorize a million examples than a thousand.
Regularization: L1 (Lasso) or L2 (Ridge) penalties that discourage overly complex models by penalizing large weights.
Dropout: randomly zeroing out neurons during training, forcing the network to not rely on any single path. Like training a team where random members call in sick each day — the team gets more resilient.
Early stopping: stop training when validation loss starts increasing, even if training loss is still decreasing.
Simpler model: fewer layers, fewer trees, lower degree polynomial. Sometimes the brute-force answer is to just use a less powerful model.
Data augmentation: artificially expanding your dataset through transformations (rotating images, adding noise, etc.).

Underfitting (High Bias)

Your model is too simple to capture the underlying patterns. It performs poorly on both training and validation data. It's like trying to model a curve with a straight line.

Signs: low accuracy on both training and validation. The model basically never learned anything useful.

Fixes:

More complex model: more layers, more trees, higher degree polynomial.
More features: give the model more information to work with.
Longer training: sometimes the model just needs more epochs.
Less regularization: you might be constraining the model too aggressively.

The Bias-Variance Tradeoff

This is the theoretical framework behind overfitting/underfitting. Bias is systematic error from overly simplistic assumptions (underfitting). Variance is sensitivity to fluctuations in training data (overfitting). You want the sweet spot — a model complex enough to capture real patterns but constrained enough to not memorize noise.

In practice, modern deep learning somewhat breaks this classical tradeoff (very large models can generalize well despite having the capacity to memorize), but the intuition still holds for most ML work.

Neural Networks: The 60-Second Version

A neural network is layers of weighted connections between nodes (neurons). Each neuron:

Takes inputs
Multiplies by weights
Adds a bias
Passes through an activation function (ReLU, sigmoid, etc.)

Training uses backpropagation + gradient descent: compute the loss (how wrong the model is), calculate gradients (which direction to adjust weights), and nudge weights to reduce the loss. Repeat for many iterations (epochs) over the training data.

Deep learning = neural networks with many layers. Depth lets the network learn hierarchical features — early layers detect edges, middle layers detect shapes, later layers detect objects.

Key hyperparameters: learning rate (step size for weight updates — too high and you overshoot, too low and training takes forever), batch size (how many examples to process before updating weights), number of epochs, architecture (how many layers, how wide).

AWS Context: Where This All Lives

On AWS, the ML fundamentals map to specific services:

SageMaker is the full-lifecycle platform. Data prep (Data Wrangler), labeling (Ground Truth), feature management (Feature Store), training, hyperparameter tuning, deployment, and monitoring — all under one roof.
SageMaker JumpStart gives you pre-trained models and foundation models so you don't start from scratch.
SageMaker Canvas is the no-code option for business analysts who want to build models without writing code.
SageMaker Clarify handles bias detection and model explainability — critical for production ML.

The key architectural decision: Bedrock vs. SageMaker. If you want to use existing foundation models via API (the managed, serverless path), that's Bedrock. If you need to train your own models, bring custom algorithms, or have full control over the ML lifecycle, that's SageMaker. JumpStart bridges the gap — it lives in SageMaker but offers pre-trained models similar to what Bedrock provides.

The Customization Ladder

When you need a model to do something specific, always start cheap and escalate:

Prompt engineering — free, instant. Rewrite your prompt to get better results.
RAG — inject relevant documents into the context. No retraining needed.
Fine-tuning — supervised learning on labeled task-specific data. The model adjusts its weights. Costs compute and requires good labeled data.
Continued pre-training — feed the model large amounts of unlabeled domain data (medical papers, legal documents). Broadens its knowledge base.
Train from scratch — nuclear option. Rarely needed unless you have a truly novel domain and massive compute budget.

This ladder is critical. Engineers reflexively jump to fine-tuning when prompt engineering or RAG would solve the problem at 1% of the cost. Always try the cheaper option first.

Practical Takeaways

If you remember nothing else:

Supervised = labeled data, unsupervised = find structure, self-supervised = FM pretraining, reinforcement = reward signals (RLHF)
Overfitting is the most common failure mode in production ML. More data and regularization are your friends.
Always split your data into train/validation/test. Never evaluate on training data.
Parameters are learned; hyperparameters are set by you.
Start simple on the customization ladder. Prompt engineering before RAG before fine-tuning before training from scratch.
The bias-variance tradeoff is the lens through which you should view every model performance issue.

ML isn't magic. It's optimization. Understand the fundamentals, and the managed services, frameworks, and infrastructure all start making a lot more sense.