A multi-layer perceptron (MLP) is one of the simplest and most important neural network architectures. It is not flashy. It is not state of the art for language or vision by itself. But if you do not understand MLPs, a lot of modern deep learning stays blurry.

MLPs teach the core structure of neural networks:

  • inputs become vectors
  • layers apply learned linear transforms
  • activations add nonlinearity
  • deeper layers build more useful internal representations

They also still matter in practice. Even transformers contain MLP blocks. Recommendation systems, tabular models, and many small classifiers still use dense networks directly.

What Is a Multi-Layer Perceptron?

A multi-layer perceptron is a feedforward neural network built from stacked dense layers.

The word choices matter:

  • multi-layer means more than one learnable layer
  • perceptron comes from the historical single-neuron model
  • feedforward means information moves from input to output without cycles
  • dense or fully connected means every neuron in one layer connects to every neuron in the next

A typical MLP looks like this:

Input layer
  -> Hidden layer 1
  -> Hidden layer 2
  -> ...
  -> Output layer

Why a Single Neuron Is Not Enough

A single neuron can only compute a simple weighted combination of inputs followed by an activation.

Even if you use an activation, one neuron has very limited capacity. It cannot express rich hierarchical structure. More importantly, a single linear layer without nonlinear activation can only represent linear decision boundaries.

That is why we stack neurons into layers and layers into networks.

The Three Kinds of Layers

1. Input Layer

This is where the raw features enter the model.

Examples:

  • for house-price prediction: square footage, number of rooms, age
  • for tabular fraud detection: transaction amount, location, merchant category
  • for toy text experiments: bag-of-words or embedding vectors

The input layer usually does not perform computation by itself. It just provides the starting vector.

2. Hidden Layers

These are the computational core of the MLP.

Each hidden layer applies:

a = f(Wx + b)

The output of one hidden layer becomes the input to the next.

Why “hidden”? Because we do not directly observe or supervise these intermediate values. The network learns them internally.

3. Output Layer

The final layer maps the learned internal representation to the task output.

Typical output choices:

  • linear output for regression
  • sigmoid for binary classification
  • softmax for multi-class classification

The output layer is task-dependent in a way that hidden layers usually are not.

What Fully Connected Actually Means

Suppose one layer has 3 inputs and the next layer has 4 neurons.

In a fully connected layer:

  • neuron 1 sees all 3 inputs
  • neuron 2 also sees all 3 inputs
  • neuron 3 also sees all 3 inputs
  • neuron 4 also sees all 3 inputs

That means the weight matrix has shape:

4 x 3

Each row belongs to one neuron.

This dense connectivity makes MLPs flexible, but it also makes them parameter-heavy when input dimensions become large.

A Small MLP Example

Imagine a network with:

  • input size = 3
  • hidden layer 1 = 4 neurons
  • hidden layer 2 = 4 neurons
  • output size = 2

The forward pass is:

x         shape (3,)
 -> z1 = W1x + b1    shape (4,)
 -> a1 = ReLU(z1)
 -> z2 = W2a1 + b2   shape (4,)
 -> a2 = ReLU(z2)
 -> z3 = W3a2 + b3   shape (2,)
 -> y_hat

This is already a meaningful neural network.

Why Hidden Layers Matter

The real power of an MLP comes from composition.

Earlier layers can learn simpler patterns. Later layers can combine those into more abstract ones.

For example, in a toy credit-risk model:

  • the first layer might respond to income level, debt ratio, and past defaults
  • the next layer might combine those into broader “financial stability” signals
  • the output layer turns that representation into a probability of default

This is the general representation-learning story behind deep learning.

Width vs Depth

There are two obvious ways to make an MLP larger:

  • add more neurons per layer (width)
  • add more layers (depth)

Wide Networks

A wider network has more neurons in each layer.

Pros:

  • more capacity per layer
  • sometimes easier to optimize in small settings

Cons:

  • more parameters quickly
  • may be inefficient compared to deeper structures

Deep Networks

A deeper network has more stacked layers.

Pros:

  • can learn hierarchical features
  • often more parameter-efficient than making one shallow layer huge

Cons:

  • harder to train
  • more sensitive to gradient flow and initialization issues

Modern deep learning tends to prefer depth, but only when the architecture and optimization setup support it well.

The Universal Approximation Idea

You will often hear that a feedforward network with one hidden layer can approximate any continuous function, given enough neurons.

That statement is mathematically important, but it is often misunderstood.

What it does mean:

  • shallow networks can be extremely expressive in theory

What it does not mean:

  • they are easy to train
  • they are efficient
  • they are the best design in practice

A huge shallow network may be far less practical than a well-designed deeper one.

Why MLPs Struggle with Images and Sequences

MLPs connect everything to everything. That is fine for low-dimensional tabular data. It becomes awkward for structured data.

For images

A 100 x 100 image has 10,000 pixels. Flattening that into one vector and connecting it densely to even one hidden layer produces a massive number of parameters.

This is why convolutional networks were such a breakthrough for vision.

For sequences

Text and time series have order and local structure. Plain MLPs do not model that structure naturally. That is why RNNs, CNNs, and especially transformers became more suitable for those domains.

So MLPs are foundational, but not universal best-practice architectures.

Why MLPs Still Matter in Modern Models

It would be a mistake to think MLPs are obsolete.

They still appear everywhere:

  • tabular classification and regression
  • simple baseline models
  • smaller production systems
  • recommendation and ranking stacks
  • the feed-forward blocks inside transformers

In a transformer block, after attention mixes information across tokens, an MLP applies a per-token nonlinear transformation. So even GPT-style models still rely on dense-network ideas constantly.

Parameter Count in a Dense Layer

One dense layer with:

  • input size m
  • output size n

has:

  • m * n weights
  • n biases

So total parameters:

m * n + n

Example:

  • input size = 512
  • hidden size = 2048

Then one layer has:

512 * 2048 + 2048 = 1,050,624 parameters

This is why dense layers become expensive fast.

A Minimal PyTorch MLP

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(3, 4),
    nn.ReLU(),
    nn.Linear(4, 4),
    nn.ReLU(),
    nn.Linear(4, 2)
)

x = torch.tensor([[0.2, -1.1, 3.0]])
logits = model(x)

print(logits)

This is a small MLP:

  • input dimension = 3
  • two hidden dense layers
  • output dimension = 2

The code is short because frameworks hide the tensor plumbing, but the architecture is still the same stack of dense layers and activations.

Design Choices That Matter

When building an MLP, the main decisions are:

Number of layers

Too few and the network may underfit. Too many and optimization becomes harder.

Hidden size

Larger hidden layers increase capacity but also parameter count and compute cost.

Activation function

ReLU is a common default. GELU and related functions are also popular in modern models.

Output activation

This must match the task:

  • regression: often no activation
  • binary classification: sigmoid
  • multi-class classification: softmax

Regularization

Dropout, weight decay, and early stopping can help prevent overfitting.

When an MLP Is a Good Choice

MLPs are still a strong option when:

  • the data is tabular
  • feature count is moderate
  • local spatial structure is not the main signal
  • you need a strong baseline quickly

They are often the right “boring” choice before reaching for something more complex.

When an MLP Is the Wrong Choice

An MLP is usually not the best first choice when:

  • the input is a large image
  • the input is long text or code
  • the task depends heavily on locality, order, or long-range interactions
  • parameter efficiency is critical

In those cases, CNNs, transformers, or hybrid architectures are often better matched to the data.

Summary

  • A multi-layer perceptron is a feedforward neural network made of stacked dense layers.
  • Fully connected means every neuron in one layer connects to every neuron in the next.
  • Hidden layers let the model build increasingly useful internal representations.
  • Width increases neurons per layer; depth increases number of layers.
  • MLPs are foundational, still widely useful, and still embedded inside modern architectures.
  • They are strongest on lower-dimensional or tabular problems and weaker on highly structured inputs like images and long sequences.

If you understand MLPs well, you understand the basic grammar of deep learning. Most other architectures are not replacements for that grammar. They are specialized extensions of it.