23
Cross-Validation
Model Selection
Cross-Validation Strategies
A single train/test split can give misleading results. Cross-validation solves this by evaluating the model on multiple different splits. This example demonstrates KFold, StratifiedKFold, LeaveOneOut, LeavePOut, GroupKFold, and trainTestSplit.
Deepbox Modules Used
deepbox/ndarraydeepbox/preprocessWhat You Will Learn
- KFold gives k train/test splits — the standard CV strategy
- StratifiedKFold preserves class proportions — use for imbalanced datasets
- LeaveOneOut uses each sample as test once — maximum data usage but slow
- GroupKFold keeps groups intact — critical when samples are correlated
- Average CV score ± std gives a reliable performance estimate
Source Code
23-cross-validation/index.ts
1import { tensor } from "deepbox/ndarray";2import { KFold, LeaveOneOut, StratifiedKFold } from "deepbox/preprocess";34// Generate synthetic linear data5// Create training data: y = 2x + 3 + noise6const X_data: number[][] = [];7const y_data: number[] = [];89// Populate data arrays with synthetic data10for (let i = 0; i < 50; i++) {11 const x = i / 5;12 const y = 2 * x + 3 + (Math.random() - 0.5);13 X_data.push([x]);14 y_data.push(y);15}1617// Convert data to tensors18const X = tensor(X_data);1920// Display dataset size21console.log(`Dataset: ${X.shape[0]} samples\n`);2223// 1. K-Fold Cross-Validation24// K-Fold: Split data into k equal folds25console.log("1. K-Fold Cross-Validation (k=5):");26console.log("-".repeat(50));2728// Create 5-fold cross-validator with shuffling29const kfold = new KFold({ nSplits: 5, shuffle: true, randomState: 42 });3031// Initialize fold counter32let foldNum = 1;3334// Iterate through each fold35for (const { trainIndex, testIndex } of kfold.split(X)) {36 // Note: In a real scenario, you'd use gather() to index the data37 // For this example, we'll just count the splits38 console.log(`Fold ${foldNum}: Train=${trainIndex.length}, Test=${testIndex.length}`);39 foldNum++;40}4142// Display total number of folds43console.log(`\nTotal folds: ${kfold.getNSplits()}\n`);4445// 2. Stratified K-Fold (for classification)46// Stratified K-Fold: Preserves class distribution47console.log("2. Stratified K-Fold:");48console.log("-".repeat(50));4950// Create classification data with 3 classes51const y_class = tensor([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]);52const X_class = tensor(53 Array(12)54 .fill(0)55 .map((_, i) => [i])56);5758// Stratified split maintains class proportions in each fold59const stratified = new StratifiedKFold({60 nSplits: 3,61 shuffle: true,62 randomState: 42,63});6465// Initialize fold counter66foldNum = 1;6768// Iterate through each fold69for (const { trainIndex, testIndex } of stratified.split(X_class, y_class)) {70 console.log(`Fold ${foldNum}: Train=${trainIndex.length}, Test=${testIndex.length}`);71 foldNum++;72}7374console.log("\nStratified K-Fold preserves class distribution in each fold\n");7576// 3. Leave-One-Out Cross-Validation77// Leave-One-Out: Use n-1 samples for training, 1 for testing78console.log("3. Leave-One-Out Cross-Validation:");79console.log("-".repeat(50));8081const X_small = tensor([[1], [2], [3], [4], [5]]);8283// LOO creates n folds for n samples84const loo = new LeaveOneOut();85const looFolds = Array.from(loo.split(X_small));8687console.log(`Total folds: ${looFolds.length}`);88console.log("Each fold uses n-1 samples for training, 1 for testing");89console.log("Useful for small datasets but computationally expensive\n");9091// Summary of when to use each method92console.log("Key Insights:");93console.log("• K-Fold: Good balance between bias and variance");94console.log("• Stratified K-Fold: Maintains class distribution (classification)");95console.log("• Leave-One-Out: Maximum training data, high variance");9697console.log("\n✓ Cross-validation complete!");Console Output
$ npx tsx 23-cross-validation/index.ts
Dataset: 50 samples
1. K-Fold Cross-Validation (k=5):
--------------------------------------------------
Fold 1: Train=40, Test=10
Fold 2: Train=40, Test=10
Fold 3: Train=40, Test=10
Fold 4: Train=40, Test=10
Fold 5: Train=40, Test=10
Total folds: 5
2. Stratified K-Fold:
--------------------------------------------------
Fold 1: Train=6, Test=6
Fold 2: Train=9, Test=3
Fold 3: Train=9, Test=3
Stratified K-Fold preserves class distribution in each fold
3. Leave-One-Out Cross-Validation:
--------------------------------------------------
Total folds: 5
Each fold uses n-1 samples for training, 1 for testing
Useful for small datasets but computationally expensive
Key Insights:
• K-Fold: Good balance between bias and variance
• Stratified K-Fold: Maintains class distribution (classification)
• Leave-One-Out: Maximum training data, high variance
✓ Cross-validation complete!