29
Attention
Transformer
Self-Attention
Neural Networks
Attention & Transformer Layers
The Transformer architecture, based on self-attention, revolutionized NLP and now dominates many ML domains. This example demonstrates MultiheadAttention and TransformerEncoderLayer.
Deepbox Modules Used
deepbox/ndarraydeepbox/nnWhat You Will Learn
- MultiheadAttention splits dModel into nHeads — each head learns different patterns
- Query/Key/Value are projections of the input — self-attention uses the same input for all three
- TransformerEncoderLayer = SelfAttention + FFN + LayerNorm + Residual connections
- Attention output preserves sequence length and model dimension
Source Code
29-attention-transformer/index.ts
1import { GradTensor, tensor } from "deepbox/ndarray";2import { MultiheadAttention, TransformerEncoderLayer } from "deepbox/nn";34console.log("=== Attention & Transformer Layers ===\n");56// ---------------------------------------------------------------------------7// Part 1: Multi-Head Attention8// ---------------------------------------------------------------------------9console.log("--- Part 1: Multi-Head Attention ---");1011// MultiheadAttention(embedDim, numHeads)12// embedDim must be divisible by numHeads13const mha = new MultiheadAttention(8, 2);14console.log("MultiheadAttention(embedDim=8, numHeads=2)");15console.log(" Each head has dimension 8/2 = 4\n");1617// Input: (batch, seqLen, embedDim)18// Self-attention: query = key = value = same input19const seqData = tensor([20 [21 [1, 0, 1, 0, 1, 0, 1, 0],22 [0, 1, 0, 1, 0, 1, 0, 1],23 [1, 1, 0, 0, 1, 1, 0, 0],24 ],25]);26console.log(`Input shape: [${seqData.shape.join(", ")}] (batch=1, seq=3, embed=8)`);2728// Self-attention: Q=K=V=input29const attnOut = mha.forward(seqData, seqData, seqData);30const attnShape = attnOut instanceof GradTensor ? attnOut.tensor.shape : attnOut.shape;31console.log(`Output shape: [${attnShape.join(", ")}]`);32console.log(" Each position attends to all other positions\n");3334// ---------------------------------------------------------------------------35// Part 2: TransformerEncoderLayer36// ---------------------------------------------------------------------------37console.log("--- Part 2: TransformerEncoderLayer ---");3839// TransformerEncoderLayer combines:40// MultiheadAttention + FeedForward + LayerNorm + Dropout41const encoderLayer = new TransformerEncoderLayer(8, 2, 16);42console.log("TransformerEncoderLayer(dModel=8, nHead=2, dimFeedforward=16)");43console.log(`Input shape: [${seqData.shape.join(", ")}]`);4445const encoderOut = encoderLayer.forward(seqData);46const encShape = encoderOut instanceof GradTensor ? encoderOut.tensor.shape : encoderOut.shape;47console.log(`Output shape: [${encShape.join(", ")}]`);48console.log(" Full transformer encoder block with residual connections\n");4950// ---------------------------------------------------------------------------51// Part 3: Parameter inspection52// ---------------------------------------------------------------------------53console.log("--- Part 3: Parameter Counts ---");54const mhaParams = Array.from(mha.parameters()).length;55const encParams = Array.from(encoderLayer.parameters()).length;56console.log(`MultiheadAttention params: ${mhaParams}`);57console.log(`TransformerEncoderLayer params: ${encParams}`);58console.log(" Encoder layer includes attention + feedforward + normalization");5960console.log("\n=== Attention & Transformer Complete ===");Console Output
$ npx tsx 29-attention-transformer/index.ts
=== Attention & Transformer Layers ===
--- Part 1: Multi-Head Attention ---
MultiheadAttention(embedDim=8, numHeads=2)
Each head has dimension 8/2 = 4
Input shape: [1, 3, 8] (batch=1, seq=3, embed=8)
Output shape: [1, 3, 8]
Each position attends to all other positions
--- Part 2: TransformerEncoderLayer ---
TransformerEncoderLayer(dModel=8, nHead=2, dimFeedforward=16)
Input shape: [1, 3, 8]
Output shape: [1, 3, 8]
Full transformer encoder block with residual connections
--- Part 3: Parameter Counts ---
MultiheadAttention params: 8
TransformerEncoderLayer params: 16
Encoder layer includes attention + feedforward + normalization
=== Attention & Transformer Complete ===