Attention & Transformer Layers

The Transformer architecture, based on self-attention, revolutionized NLP and now dominates many ML domains. This example demonstrates MultiheadAttention and TransformerEncoderLayer.

Deepbox Modules Used

deepbox/ndarraydeepbox/nn

What You Will Learn

MultiheadAttention splits dModel into nHeads — each head learns different patterns
Query/Key/Value are projections of the input — self-attention uses the same input for all three
TransformerEncoderLayer = SelfAttention + FFN + LayerNorm + Residual connections
Attention output preserves sequence length and model dimension

Source Code

29-attention-transformer/index.ts

1import { GradTensor, tensor } from "deepbox/ndarray";2import { MultiheadAttention, TransformerEncoderLayer } from "deepbox/nn";34console.log("=== Attention & Transformer Layers ===\n");56// ---------------------------------------------------------------------------7// Part 1: Multi-Head Attention8// ---------------------------------------------------------------------------9console.log("--- Part 1: Multi-Head Attention ---");1011// MultiheadAttention(embedDim, numHeads)12// embedDim must be divisible by numHeads13const mha = new MultiheadAttention(8, 2);14console.log("MultiheadAttention(embedDim=8, numHeads=2)");15console.log("  Each head has dimension 8/2 = 4\n");1617// Input: (batch, seqLen, embedDim)18// Self-attention: query = key = value = same input19const seqData = tensor([20  [21    [1, 0, 1, 0, 1, 0, 1, 0],22    [0, 1, 0, 1, 0, 1, 0, 1],23    [1, 1, 0, 0, 1, 1, 0, 0],24  ],25]);26console.log(`Input shape: [${seqData.shape.join(", ")}]  (batch=1, seq=3, embed=8)`);2728// Self-attention: Q=K=V=input29const attnOut = mha.forward(seqData, seqData, seqData);30const attnShape = attnOut instanceof GradTensor ? attnOut.tensor.shape : attnOut.shape;31console.log(`Output shape: [${attnShape.join(", ")}]`);32console.log("  Each position attends to all other positions\n");3334// ---------------------------------------------------------------------------35// Part 2: TransformerEncoderLayer36// ---------------------------------------------------------------------------37console.log("--- Part 2: TransformerEncoderLayer ---");3839// TransformerEncoderLayer combines:40//   MultiheadAttention + FeedForward + LayerNorm + Dropout41const encoderLayer = new TransformerEncoderLayer(8, 2, 16);42console.log("TransformerEncoderLayer(dModel=8, nHead=2, dimFeedforward=16)");43console.log(`Input shape: [${seqData.shape.join(", ")}]`);4445const encoderOut = encoderLayer.forward(seqData);46const encShape = encoderOut instanceof GradTensor ? encoderOut.tensor.shape : encoderOut.shape;47console.log(`Output shape: [${encShape.join(", ")}]`);48console.log("  Full transformer encoder block with residual connections\n");4950// ---------------------------------------------------------------------------51// Part 3: Parameter inspection52// ---------------------------------------------------------------------------53console.log("--- Part 3: Parameter Counts ---");54const mhaParams = Array.from(mha.parameters()).length;55const encParams = Array.from(encoderLayer.parameters()).length;56console.log(`MultiheadAttention params: ${mhaParams}`);57console.log(`TransformerEncoderLayer params: ${encParams}`);58console.log("  Encoder layer includes attention + feedforward + normalization");5960console.log("\n=== Attention & Transformer Complete ===");

Console Output

$ npx tsx 29-attention-transformer/index.ts

=== Attention & Transformer Layers ===

--- Part 1: Multi-Head Attention ---
MultiheadAttention(embedDim=8, numHeads=2)
  Each head has dimension 8/2 = 4

Input shape: [1, 3, 8]  (batch=1, seq=3, embed=8)
Output shape: [1, 3, 8]
  Each position attends to all other positions

--- Part 2: TransformerEncoderLayer ---
TransformerEncoderLayer(dModel=8, nHead=2, dimFeedforward=16)
Input shape: [1, 3, 8]
Output shape: [1, 3, 8]
  Full transformer encoder block with residual connections

--- Part 3: Parameter Counts ---
MultiheadAttention params: 8
TransformerEncoderLayer params: 16
  Encoder layer includes attention + feedforward + normalization

=== Attention & Transformer Complete ===

Recurrent Neural Network LayersPrevious Normalization & Dropout LayersNext