Activation Functions (ReLU, Sigmoid, Softmax)
Why activations are needed
Without activations, layers collapse into a single linear transformation.
Activations add non-linearity so networks can learn complex functions.
ReLU
ReLU(x) = max(0, x)ReLU(x) = max(0, x)
Common choice for hidden layers.
Pros:
- simple
- helps with vanishing gradient (compared to sigmoid)
Sigmoid
Maps to (0, 1):
Used for:
- binary classification output
Softmax
Turns raw scores (logits) into a probability distribution over classes.
Used for:
- multiclass classification output
false
flowchart LR Z[Logits] --> S[Softmax] --> P[Class probabilities]
false
Typical choices
- hidden layers: ReLU
- binary output: sigmoid
- multiclass output: softmax
Mini-checkpoint
If your model is predicting 10 classes, what activation is typical in the last layer?
(Softmax.)
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
