Activation Functions (ReLU, Sigmoid, Softmax)

Why activations are needed

Without activations, layers collapse into a single linear transformation.

Activations add non-linearity so networks can learn complex functions.

ReLU(x) = max(0, x)ReLU(x) = max(0, x)

Common choice for hidden layers.

Pros:

Maps to (0, 1):

Used for:

Turns raw scores (logits) into a probability distribution over classes.

Used for:

false

  flowchart LR
  Z[Logits] --> S[Softmax] --> P[Class probabilities]

If your model is predicting 10 classes, what activation is typical in the last layer?

(Softmax.)

If this helped you, consider buying me a coffee ☕