GitHub
deepbox/optim

Optimizers

Optimization algorithms that update model parameters to minimize the loss function. All optimizers take model.parameters() and a learning rate.

SGD

extends Optimizer

Stochastic Gradient Descent. The simplest optimizer: w ← w − lr · ∇L. Optionally supports momentum (accelerates convergence), weight decay (L2 regularization), and Nesterov momentum.

Adam

extends Optimizer

Adaptive Moment Estimation. Maintains per-parameter running averages of first moment (mean) and second moment (uncentered variance) of gradients. The default choice for most deep learning tasks. Combines benefits of AdaGrad and RMSProp.

AdamW

extends Optimizer

Adam with decoupled weight decay. Fixes the weight decay implementation in Adam by applying it directly to parameters rather than through the gradient. Recommended over Adam when using weight decay.

RMSprop

extends Optimizer

Root Mean Square Propagation. Divides the gradient by a running average of its magnitude. Adapts learning rate per parameter. Originally proposed for training RNNs.

Adagrad

extends Optimizer

Adaptive Gradient. Adapts learning rate per parameter based on accumulated squared gradients. Well-suited for sparse data. Learning rate decays monotonically.

AdaDelta

extends Optimizer

Extension of Adagrad that seeks to reduce its monotonically decreasing learning rate. Uses a window of accumulated past gradients instead of all past gradients.

Nadam

extends Optimizer

Nesterov-accelerated Adam. Combines Adam's adaptive learning rates with Nesterov momentum for faster convergence.

Optimizer API (Common Methods)

  • new Optimizer(params, { lr, ...opts }) — Create optimizer with model parameters
  • .step() — Update all parameters using computed gradients
  • .zeroGrad() — Reset all parameter gradients to zero (call before each forward pass)

SGD

w ← w − lr · ∇L

Where:

  • lr = Learning rate
  • ∇L = Gradient of loss

SGD + Momentum

v ← μv + ∇L; w ← w − lr · v

Where:

  • μ = Momentum coefficient (default: 0.9)

Adam

m ← β₁m + (1−β₁)g; v ← β₂v + (1−β₂)g²; w ← w − lr · m̂ / (√v̂ + ε)

Where:

  • β₁, β₂ = Decay rates (0.9, 0.999)
  • m̂, v̂ = Bias-corrected moments
optimizers.ts
import { SGD, Adam, AdamW } from "deepbox/optim";import { Sequential, Linear, ReLU } from "deepbox/nn";import { parameter, GradTensor } from "deepbox/ndarray";const model = new Sequential(  new Linear(2, 16),  new ReLU(),  new Linear(16, 1));// Adam optimizer (default choice)const optimizer = new Adam(model.parameters(), { lr: 0.01 });// Training dataconst input = parameter([[1, 2], [3, 4], [5, 6]]);const target = parameter([[1], [0], [1]]);// Training loopfor (let epoch = 0; epoch < 100; epoch++) {  optimizer.zeroGrad();                   // Reset gradients  const output = model.forward(input);    // Forward pass (returns GradTensor)  const diff = (output as GradTensor).sub(target);  const loss = diff.mul(diff).mean();     // MSE loss via GradTensor ops  loss.backward();                        // Backpropagation  optimizer.step();                       // Update weights}// SGD with momentumconst sgd = new SGD(model.parameters(), { lr: 0.01, momentum: 0.9 });// AdamW with weight decayconst adamw = new AdamW(model.parameters(), { lr: 0.001, weightDecay: 0.01 });

Choosing an Optimizer

  • Adam — Default for most tasks. Good out-of-the-box without much tuning.
  • AdamW — When using weight decay (recommended over Adam for regularization).
  • SGD + Momentum — Often achieves better final accuracy than Adam with proper tuning (used in many research papers).
  • RMSprop — Good for RNNs and non-stationary objectives.
  • Adagrad — Sparse data (e.g., NLP with large vocabularies).