22
Datasets
Data Loading
Built-in Datasets
Deepbox ships with 24 real-world datasets and 6 synthetic data generators. The built-in datasets include classic ML benchmarks (Iris, Digits, Breast Cancer, Diabetes, Linnerud) and domain-specific datasets (Housing, Student Performance, Weather, Crop Yield, and more). Synthetic generators let you create datasets with known properties.
Deepbox Modules Used
deepbox/datasetsWhat You Will Learn
- 24 built-in datasets ready for immediate use — no downloads needed
- Classic benchmarks: Iris (classification), Digits (image), Diabetes (regression)
- Synthetic generators let you control difficulty, noise, and structure
- makeBlobs for clustering, makeCircles/makeMoons for non-linear classification
- makeRegression generates data with known true coefficients for verification
Source Code
22-datasets/index.ts
1import {2 loadBreastCancer,3 loadConcentricRings,4 loadCropYield,5 loadCustomerSegments,6 loadDiabetes,7 loadDigits,8 loadEnergyEfficiency,9 loadFitnessScores,10 loadFlowersExtended,11 loadFruitQuality,12 loadGaussianIslands,13 loadHousingMini,14 loadIris,15 loadLeafShapes,16 loadLinnerud,17 loadMoonsMulti,18 loadPerfectlySeparable,19 loadPlantGrowth,20 loadSeedMorphology,21 loadSensorStates,22 loadSpiralArms,23 loadStudentPerformance,24 loadTrafficConditions,25 loadWeatherOutcomes,26 makeBlobs,27 makeCircles,28 makeClassification,29 makeGaussianQuantiles,30 makeMoons,31 makeRegression,32} from "deepbox/datasets";3334console.log("=== Built-in Datasets ===\n");3536// ─── Classic Reference Datasets ─────────────────────────────────────────────3738console.log("--- Classic Reference Datasets ---\n");3940console.log("1. Iris Dataset:");41console.log("-".repeat(50));42const iris = loadIris();43console.log(`Samples: ${iris.data.shape[0]}`);44console.log(`Features: ${iris.data.shape[1]}`);45console.log(`Classes: ${iris.targetNames?.join(", ") || "N/A"}`);46console.log(`Features: ${iris.featureNames?.join(", ") || "N/A"}\n`);4748console.log("2. Digits Dataset:");49console.log("-".repeat(50));50const digits = loadDigits();51console.log(`Samples: ${digits.data.shape[0]}`);52console.log(`Features: ${digits.data.shape[1]} (8x8 images flattened)`);53console.log(`Classes: 10 (digits 0-9)\n`);5455console.log("3. Breast Cancer Dataset:");56console.log("-".repeat(50));57const cancer = loadBreastCancer();58console.log(`Samples: ${cancer.data.shape[0]}`);59console.log(`Features: ${cancer.data.shape[1]}`);60console.log(`Classes: ${cancer.targetNames?.join(", ") || "N/A"}\n`);6162console.log("4. Diabetes Dataset (Regression):");63console.log("-".repeat(50));64const diabetes = loadDiabetes();65console.log(`Samples: ${diabetes.data.shape[0]}`);66console.log(`Features: ${diabetes.data.shape[1]}`);67console.log(`Task: Regression (predict disease progression)\n`);6869console.log("5. Linnerud Dataset (Multi-Output):");70console.log("-".repeat(50));71const linnerud = loadLinnerud();72console.log(`Samples: ${linnerud.data.shape[0]}`);73console.log(`Features: ${linnerud.data.shape[1]}`);74console.log(`Targets: ${linnerud.target.shape[1]} (multi-output regression)\n`);7576// ─── Tabular Classification ─────────────────────────────────────────────────7778console.log("--- Tabular Classification Datasets ---\n");7980console.log("6. Flowers Extended (4-class Iris variant):");81console.log("-".repeat(50));82const flowers = loadFlowersExtended();83console.log(`Samples: ${flowers.data.shape[0]}, Features: ${flowers.data.shape[1]}`);84console.log(`Classes: ${flowers.targetNames?.join(", ") || "N/A"}\n`);8586console.log("7. Leaf Shapes (5-class morphology):");87console.log("-".repeat(50));88const leaves = loadLeafShapes();89console.log(`Samples: ${leaves.data.shape[0]}, Features: ${leaves.data.shape[1]}`);90console.log(`Classes: ${leaves.targetNames?.join(", ") || "N/A"}\n`);9192console.log("8. Fruit Quality:");93console.log("-".repeat(50));94const fruit = loadFruitQuality();95console.log(`Samples: ${fruit.data.shape[0]}, Features: ${fruit.data.shape[1]}`);96console.log(`Classes: ${fruit.targetNames?.join(", ") || "N/A"}\n`);9798console.log("9. Seed Morphology:");99console.log("-".repeat(50));100const seeds = loadSeedMorphology();101console.log(`Samples: ${seeds.data.shape[0]}, Features: ${seeds.data.shape[1]}`);102console.log(`Classes: ${seeds.targetNames?.join(", ") || "N/A"}\n`);103104// ─── Non-Linear Classification ──────────────────────────────────────────────105106console.log("--- Non-Linear Classification Datasets ---\n");107108console.log("10. Moons-Multi (3 interleaving crescents):");109console.log("-".repeat(50));110const moons = loadMoonsMulti();111console.log(`Samples: ${moons.data.shape[0]}, Features: ${moons.data.shape[1]}\n`);112113console.log("11. Concentric Rings:");114console.log("-".repeat(50));115const rings = loadConcentricRings();116console.log(`Samples: ${rings.data.shape[0]}, Features: ${rings.data.shape[1]}\n`);117118console.log("12. Spiral Arms:");119console.log("-".repeat(50));120const spirals = loadSpiralArms();121console.log(`Samples: ${spirals.data.shape[0]}, Features: ${spirals.data.shape[1]}\n`);122123console.log("13. Gaussian Islands (3D clusters):");124console.log("-".repeat(50));125const islands = loadGaussianIslands();126console.log(`Samples: ${islands.data.shape[0]}, Features: ${islands.data.shape[1]}\n`);127128// ─── Regression Datasets ────────────────────────────────────────────────────129130console.log("--- Regression Datasets ---\n");131132console.log("14. Plant Growth:");133console.log("-".repeat(50));134const plant = loadPlantGrowth();135console.log(`Samples: ${plant.data.shape[0]}, Features: ${plant.data.shape[1]}`);136console.log(`Features: ${plant.featureNames.join(", ")}\n`);137138console.log("15. Housing-Mini:");139console.log("-".repeat(50));140const housing = loadHousingMini();141console.log(`Samples: ${housing.data.shape[0]}, Features: ${housing.data.shape[1]}`);142console.log(`Features: ${housing.featureNames.join(", ")}\n`);143144console.log("16. Energy Efficiency:");145console.log("-".repeat(50));146const energy = loadEnergyEfficiency();147console.log(`Samples: ${energy.data.shape[0]}, Features: ${energy.data.shape[1]}\n`);148149console.log("17. Crop Yield:");150console.log("-".repeat(50));151const crop = loadCropYield();152console.log(`Samples: ${crop.data.shape[0]}, Features: ${crop.data.shape[1]}\n`);153154// ─── Clustering Datasets ────────────────────────────────────────────────────155156console.log("--- Clustering Datasets ---\n");157158console.log("18. Customer Segments:");159console.log("-".repeat(50));160const customers = loadCustomerSegments();161console.log(`Samples: ${customers.data.shape[0]}, Features: ${customers.data.shape[1]}`);162console.log(`Clusters: ${customers.targetNames?.join(", ") || "N/A"}\n`);163164console.log("19. Sensor States:");165console.log("-".repeat(50));166const sensors = loadSensorStates();167console.log(`Samples: ${sensors.data.shape[0]}, Features: ${sensors.data.shape[1]}`);168console.log(`Modes: ${sensors.targetNames?.join(", ") || "N/A"}\n`);169170// ─── Integer-Heavy Datasets ─────────────────────────────────────────────────171172console.log("--- Integer-Heavy Datasets ---\n");173174console.log("20. Student Performance:");175console.log("-".repeat(50));176const students = loadStudentPerformance();177console.log(`Samples: ${students.data.shape[0]}, Features: ${students.data.shape[1]}\n`);178179console.log("21. Traffic Conditions:");180console.log("-".repeat(50));181const traffic = loadTrafficConditions();182console.log(`Samples: ${traffic.data.shape[0]}, Features: ${traffic.data.shape[1]}\n`);183184// ─── Multi-Output Datasets ──────────────────────────────────────────────────185186console.log("--- Multi-Output Datasets ---\n");187188console.log("22. Fitness Scores (3 targets):");189console.log("-".repeat(50));190const fitness = loadFitnessScores();191console.log(`Samples: ${fitness.data.shape[0]}, Features: ${fitness.data.shape[1]}`);192console.log(`Targets: ${fitness.target.shape[1]} (${fitness.targetNames?.join(", ") || "N/A"})\n`);193194console.log("23. Weather Outcomes (2 targets):");195console.log("-".repeat(50));196const weather = loadWeatherOutcomes();197console.log(`Samples: ${weather.data.shape[0]}, Features: ${weather.data.shape[1]}`);198console.log(`Targets: ${weather.target.shape[1]} (${weather.targetNames?.join(", ") || "N/A"})\n`);199200// ─── Benchmark / Sanity-Check ───────────────────────────────────────────────201202console.log("--- Benchmark / Sanity-Check ---\n");203204console.log("24. Perfectly Separable:");205console.log("-".repeat(50));206const perfect = loadPerfectlySeparable();207console.log(`Samples: ${perfect.data.shape[0]}, Features: ${perfect.data.shape[1]}`);208console.log(`Classes: ${perfect.targetNames?.join(", ") || "N/A"}\n`);209210// ─── Synthetic Dataset Generators ───────────────────────────────────────────211212console.log("=== Synthetic Dataset Generators ===\n");213214console.log("25. Make Classification:");215console.log("-".repeat(50));216const classData = makeClassification({217 nSamples: 100,218 nFeatures: 4,219 nClasses: 2,220 randomState: 42,221});222const X_class = classData[0];223console.log(`Generated ${X_class.shape[0]} samples with ${X_class.shape[1]} features\n`);224225console.log("26. Make Regression:");226console.log("-".repeat(50));227const regData = makeRegression({228 nSamples: 100,229 nFeatures: 3,230 noise: 0.1,231 randomState: 42,232});233const X_reg = regData[0];234console.log(`Generated ${X_reg.shape[0]} samples with ${X_reg.shape[1]} features\n`);235236console.log("27. Make Blobs:");237console.log("-".repeat(50));238const blobsData = makeBlobs({239 nSamples: 150,240 nFeatures: 2,241 centers: 3,242 randomState: 42,243});244const X_blobs = blobsData[0];245console.log(`Generated ${X_blobs.shape[0]} samples in ${3} clusters\n`);246247console.log("28. Make Moons:");248console.log("-".repeat(50));249const moonsData = makeMoons({250 nSamples: 100,251 noise: 0.1,252 randomState: 42,253});254const X_moons = moonsData[0];255console.log(`Generated ${X_moons.shape[0]} samples (2 interleaving half circles)\n`);256257console.log("29. Make Circles:");258console.log("-".repeat(50));259const circlesData = makeCircles({260 nSamples: 100,261 noise: 0.05,262 randomState: 42,263});264const X_circles = circlesData[0];265console.log(`Generated ${X_circles.shape[0]} samples (concentric circles)\n`);266267console.log("30. Make Gaussian Quantiles:");268console.log("-".repeat(50));269const gaussData = makeGaussianQuantiles({270 nSamples: 100,271 nFeatures: 2,272 nClasses: 3,273 randomState: 42,274});275const X_gauss = gaussData[0];276console.log(`Generated ${X_gauss.shape[0]} samples in ${3} quantile-based classes\n`);277278console.log("✓ All 24 built-in datasets + 6 synthetic generators explored!");Console Output
$ npx tsx 22-datasets/index.ts
=== Built-in Datasets ===
--- Classic Reference Datasets ---
1. Iris Dataset:
--------------------------------------------------
Samples: 150
Features: 4
Classes: setosa, versicolor, virginica
Features: sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)
2. Digits Dataset:
--------------------------------------------------
Samples: 1797
Features: 64 (8x8 images flattened)
Classes: 10 (digits 0-9)
3. Breast Cancer Dataset:
--------------------------------------------------
Samples: 569
Features: 30
Classes: malignant, benign
4. Diabetes Dataset (Regression):
--------------------------------------------------
Samples: 442
Features: 10
Task: Regression (predict disease progression)
5. Linnerud Dataset (Multi-Output):
--------------------------------------------------
Samples: 20
Features: 3
Targets: 3 (multi-output regression)
--- Tabular Classification Datasets ---
6. Flowers Extended (4-class Iris variant):
--------------------------------------------------
Samples: 180, Features: 6
Classes: setosa, versicolor, virginica, chrysantha
7. Leaf Shapes (5-class morphology):
--------------------------------------------------
Samples: 150, Features: 8
Classes: maple, oak, birch, willow, ginkgo
8. Fruit Quality:
--------------------------------------------------
Samples: 150, Features: 5
Classes: apple, orange, banana
9. Seed Morphology:
--------------------------------------------------
Samples: 150, Features: 4
Classes: wheat, rice, sunflower
--- Non-Linear Classification Datasets ---
10. Moons-Multi (3 interleaving crescents):
... (43 lines truncated) ...
--------------------------------------------------
Samples: 180, Features: 6
Modes: normal, heating, fault
--- Integer-Heavy Datasets ---
20. Student Performance:
--------------------------------------------------
Samples: 150, Features: 3
21. Traffic Conditions:
--------------------------------------------------
Samples: 150, Features: 3
--- Multi-Output Datasets ---
22. Fitness Scores (3 targets):
--------------------------------------------------
Samples: 100, Features: 3
Targets: 3 (strength, endurance, flexibility)
23. Weather Outcomes (2 targets):
--------------------------------------------------
Samples: 150, Features: 3
Targets: 2 (rain probability, wind speed (km/h))
--- Benchmark / Sanity-Check ---
24. Perfectly Separable:
--------------------------------------------------
Samples: 100, Features: 4
Classes: class_0, class_1
=== Synthetic Dataset Generators ===
25. Make Classification:
--------------------------------------------------
Generated 100 samples with 4 features
26. Make Regression:
--------------------------------------------------
Generated 100 samples with 3 features
27. Make Blobs:
--------------------------------------------------
Generated 150 samples in 3 clusters
28. Make Moons:
--------------------------------------------------
Generated 100 samples (2 interleaving half circles)
29. Make Circles:
--------------------------------------------------
Generated 100 samples (concentric circles)
30. Make Gaussian Quantiles:
--------------------------------------------------
Generated 100 samples in 3 quantile-based classes
✓ All 24 built-in datasets + 6 synthetic generators explored!