A comprehensive machine learning study on diamond price prediction — spanning exploratory analysis, classification, regression & unsupervised clustering.
Explore the Project ↓The Diamond Listings (DML) dataset contains rich features across three categories, enabling both supervised and unsupervised learning approaches.
Deep dives into the diamond dataset revealing pricing patterns, shape relationships, and the premium value of fancy color diamonds.
Distribution across labs is imbalanced — Lab 0 dominates in volume. Labs 0 and 1 show overlapping price ranges on log scale, while Lab 2 tends to certify higher-value diamonds, with a higher median and upper quartile.
Both fancy and standard diamonds concentrate at VS2, VS1, SI1, SI2 clarity grades. No significant clarity shift exists between groups.
Round cuts show the most symmetric length/width ratio. Elongated cuts (Marquise, Pear, Oval) have much larger length vs width. Princess & Asscher approach square proportions.
Fancy color diamonds command a price-per-carat that is over 2× higher on average compared to standard diamonds. The entire distribution shifts upward — not just outliers.
Length/width ratio is consistent across all three labs (majority near 1:1). However, volume differs notably — Lab 2 handles larger diamonds on average.
Carat (Ukuran) shows the strongest correlation with price at r = 0.76. Price-per-carat follows at r = 0.68. Physical dimensions (width 0.51, length 0.49) also contribute significantly.
Columns Panjang, Lebar, Tebal, Persentase_Depth, Persentase_Table cannot be zero — replaced with NaN for proper imputation.
5 columns with >90% missing values dropped: Warna_Dominan_Fancy, Warna_Sekunder_Fancy, Overtone_Fancy, Intensitas_Warna_Fancy, Warna_Fluoresensi.
Skewed features → median imputation. Normally distributed features → mean imputation. Determined using absolute skew threshold of 1.
Categorical features imputed with mode. Ordinal features encoded with OrdinalEncoder respecting quality order (e.g., Cut: Fair → Ideal).
Extreme values capped at IQR bounds (Q1 − 1.5×IQR, Q3 + 1.5×IQR) instead of dropped — preserving sample integrity while neutralizing outlier influence.
Numeric features normalized to [0, 1] using MinMaxScaler: Ukuran, Panjang, Lebar, Tebal, Persentase_Depth, Persentase_Table, Sudut_Crown, Sudut_Pavilion, ratio_pl, volume.
Nominal categorical features encoded using pd.get_dummies with drop_first=True to avoid multicollinearity. Applied consistently across train, val, and test sets.
Categorical features with >60% missing filled with "Unknown" instead of being dropped — preserving feature availability while flagging uncertainty.
The PDF shows how IQR capping compresses extreme values without deleting records, making the modeling pipeline more stable while preserving sample count.
is_fancy
Created from Warna_Dominan_Fancy — detects whether a diamond is fancy color. EDA confirms fancy diamonds have price-per-carat 2× higher, making this binary signal critical for the model.
train['is_fancy'] = train['Warna_Dominan_Fancy'].notna().astype(int)
ratio_pl
Length-to-width ratio captures geometric shape information quantitatively. Elongated cuts (Marquise, Pear, Oval) produce high ratios; symmetric cuts (Round, Princess) cluster near 1.0.
train['ratio_pl'] = train['Panjang'] / train['Lebar']
volume
Combines length × width × depth into a single physical size estimator. Correlation with price is r = 0.58 — more representative than any individual dimension alone.
train['volume'] = train['Panjang'] * train['Lebar'] * train['Tebal']
Sudut_Crown, Sudut_Pavilion, Harga, Panjang, Lebar — removed to eliminate data leakage and redundancy after engineering.
| # | Model | F1 Macro | Precision | Recall | Rank |
|---|---|---|---|---|---|
| 1 | CatBoost ⭐ | 0.8142 | 0.763 | 0.876 | Best |
| 2 | LightGBM | 0.770 | 0.757 | 0.787 | 2nd |
| 3 | XGBoost | 0.770 | 0.820 | 0.730 | 2nd |
| 4 | Random Forest | 0.7593 | 0.7246 | 0.8341 | |
| 5 | Stacking (RF + LR) | 0.758 | 0.721 | 0.8474 | |
| 6 | KNN | 0.640 | 0.690 | 0.610 |
Ranked 1st on public leaderboard for classification task
| # | Model | R² | MAE | RMSE | Rank |
|---|---|---|---|---|---|
| 5 | Stacking (XGB+LGBM+CB+Ridge) ⭐ | 0.9888 | — | — | Best |
| 4 | XGBoostRegressor | 0.9796 | — | — | 2nd |
| 3 | HistGradientBoosting | 0.9792 | — | — | |
| 1 | CatBoostRegressor | 0.9032 | 981.57 | 8141.03 | |
| 2 | LightGBMRegressor | 0.8633 | 1100.47 | 9562.85 | |
| 7 | Multi Layer Perceptron | 0.802 | 166.13 | 3177.03 | |
| 6 | Linear Regression | -22.47 | — | 1574.39 | |
| 8 | Lasso Regression | -22.16 | 1458 | 16382 |
Ranked 1st on public leaderboard for regression task
CatBoost handles categorical features natively without extensive preprocessing, and its ordered boosting prevents target leakage — particularly useful for this imbalanced, multi-class lab prediction task.
No single model captures all nonlinear price patterns. Stacking XGBoost + LightGBM + CatBoost with a Ridge meta-learner achieves R² = 0.9888 by combining the strengths of diverse gradient boosting strategies.
Linear Regression and Lasso achieved negative R² values, confirming that diamond pricing is highly nonlinear — driven by complex interactions between carat, clarity, cut, color, and fancy status.
Four clustering algorithms applied to discover natural groupings in the diamond dataset, all converging on k=2 as the optimal structure.
Elbow method suggested candidates at k = 2, 3, 4, 6. Silhouette score peaked at k = 2 (0.320), though the dataset's high variance is reflected in the moderate score.
Dendrogram visualization clearly revealed two dominant vertical branches, confirming k = 2 as the most natural hierarchical split in the bottom-up clustering structure.
PCA-assisted exploration across eps values. At eps=2.0, 2 clusters formed with minimal noise (0.04%). Smaller eps values caused fragmentation; larger values merged all points.
Sharp BIC/AIC drop at n=2 confirmed two Gaussian components as optimal. GMM's soft clustering assigns probabilistic membership — suitable for diamonds straddling two market segments.
All four algorithms independently converge on k = 2 clusters, strongly suggesting the diamond market naturally segments into two groups — most likely aligned with the fancy color vs. non-fancy distinction uncovered in EDA. This unsupervised result validates the domain-driven is_fancy feature engineered for supervised models.
Physical size (carat, volume, dimensions) dominates price prediction with correlation up to 0.76.
Fancy diamonds command 2× price-per-carat. The is_fancy feature became one of the most valuable signals.
Tree-based ensemble models (CatBoost, XGBoost, LightGBM) outperformed linear and neural models on this structured dataset.
The ensemble stacking approach achieved R² = 0.9888, outperforming every individual regressor significantly.
Lab certification correlates weakly (r=0.06) with price — labs certify different diamond types, not different quality standards.
All clustering methods converge on k=2, validating the fancy/non-fancy market split through unsupervised learning.