Data ScienceTech Institute, 2025
Author: Nicolas Len
Task statement. Using the ozone dataset, treat maxO3 as the
response and all numerical predictors except obs as explanatory
variables. Build the unconstrained model, build the constrained model
under beta_T9 + beta_T12 + beta_T15 = 0, and compare both models.
## This is restriktor 0.6-30
## Please report any bugs to info@restriktor.org
## 'data.frame': 112 obs. of 14 variables:
## $ obs : int 601 602 603 604 605 606 607 610 611 612 ...
## $ maxO3 : int 87 82 92 114 94 80 79 79 101 106 ...
## $ T9 : num 15.6 17 15.3 16.2 17.4 17.7 16.8 14.9 16.1 18.3 ...
## $ T12 : num 18.5 18.4 17.6 19.7 20.5 19.8 15.6 17.5 19.6 21.9 ...
## $ T15 : num 18.4 17.7 19.5 22.5 20.4 18.3 14.9 18.9 21.4 22.9 ...
## $ Ne9 : int 4 5 2 1 8 6 7 5 2 5 ...
## $ Ne12 : int 4 5 5 1 8 6 8 5 4 6 ...
## $ Ne15 : int 8 7 4 0 7 7 8 4 4 8 ...
## $ Vx9 : num 0.695 -4.33 2.954 0.985 -0.5 ...
## $ Vx12 : num -1.71 -4 1.879 0.347 -2.954 ...
## $ Vx15 : num -0.695 -3 0.521 -0.174 -4.33 ...
## $ maxO3v: int 84 87 82 92 114 94 80 99 79 101 ...
## $ vent : chr "Nord" "Nord" "Est" "Nord" ...
## $ pluie : chr "Sec" "Sec" "Sec" "Sec" ...
## obs maxO3 T9 T12
## Min. :601.0 Min. : 42.00 Min. :11.30 Min. :14.00
## 1st Qu.:701.8 1st Qu.: 70.75 1st Qu.:16.20 1st Qu.:18.60
## Median :729.5 Median : 81.50 Median :17.80 Median :20.55
## Mean :763.2 Mean : 90.30 Mean :18.36 Mean :21.53
## 3rd Qu.:829.2 3rd Qu.:106.00 3rd Qu.:19.93 3rd Qu.:23.55
## Max. :930.0 Max. :166.00 Max. :27.00 Max. :33.50
## T15 Ne9 Ne12 Ne15
## Min. :14.90 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:19.27 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.00
## Median :22.05 Median :6.000 Median :5.000 Median :5.00
## Mean :22.63 Mean :4.929 Mean :5.018 Mean :4.83
## 3rd Qu.:25.40 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.00
## Max. :35.50 Max. :8.000 Max. :8.000 Max. :8.00
## Vx9 Vx12 Vx15 maxO3v
## Min. :-7.8785 Min. :-7.878 Min. :-9.000 Min. : 42.00
## 1st Qu.:-3.2765 1st Qu.:-3.565 1st Qu.:-3.939 1st Qu.: 71.00
## Median :-0.8660 Median :-1.879 Median :-1.550 Median : 82.50
## Mean :-1.2143 Mean :-1.611 Mean :-1.691 Mean : 90.57
## 3rd Qu.: 0.6946 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.:106.00
## Max. : 5.1962 Max. : 6.578 Max. : 5.000 Max. :166.00
## vent pluie
## Length:112 Length:112
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## obs maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15
## 0 0 0 0 0 0 0 0 0 0 0
## maxO3v vent pluie
## 0 0 0
##
## Call:
## lm(formula = maxO3 ~ ., data = ozone_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.566 -8.727 -0.403 7.599 39.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.24442 13.47190 0.909 0.3656
## T9 -0.01901 1.12515 -0.017 0.9866
## T12 2.22115 1.43294 1.550 0.1243
## T15 0.55853 1.14464 0.488 0.6266
## Ne9 -2.18909 0.93824 -2.333 0.0216 *
## Ne12 -0.42102 1.36766 -0.308 0.7588
## Ne15 0.18373 1.00279 0.183 0.8550
## Vx9 0.94791 0.91228 1.039 0.3013
## Vx12 0.03120 1.05523 0.030 0.9765
## Vx15 0.41859 0.91568 0.457 0.6486
## maxO3v 0.35198 0.06289 5.597 1.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.36 on 101 degrees of freedom
## Multiple R-squared: 0.7638, Adjusted R-squared: 0.7405
## F-statistic: 32.67 on 10 and 101 DF, p-value: < 2.2e-16
##
## Call:
## conLM.lm(object = object, constraints = constraints)
##
## Restriktor: restricted linear model:
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.5015 -9.2064 -0.7102 8.8056 43.4884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.443309 10.302499 5.4786 3.155e-07 ***
## T9 -2.551360 1.072558 -2.3788 0.019250 *
## T12 1.681563 1.562201 1.0764 0.284310
## T15 0.869797 1.249908 0.6959 0.488097
## Ne9 -3.328210 0.989748 -3.3627 0.001091 **
## Ne12 -1.086056 1.487654 -0.7300 0.467052
## Ne15 0.507342 1.094229 0.4637 0.643894
## Vx9 0.702988 0.996224 0.7057 0.482028
## Vx12 0.072475 1.154263 0.0628 0.950059
## Vx15 0.148640 0.999582 0.1487 0.882085
## maxO3v 0.501803 0.058793 8.5351 1.484e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.708 on 101 degrees of freedom
## Standard errors: standard
## Multiple R-squared reduced from 0.764 to 0.715
##
## Generalized order-restricted information criterion:
## Loglik Penalty goric
## -462.15 11.00 946.31
##
## Unconstrained Constrained
## (Intercept) 12.24441987 56.44330907
## T9 -0.01901425 -2.55136002
## T12 2.22115189 1.68156276
## T15 0.55853087 0.86979726
## Ne9 -2.18909215 -3.32820968
## Ne12 -0.42101517 -1.08605648
## Ne15 0.18373097 0.50734236
## Vx9 0.94790917 0.70298836
## Vx12 0.03119824 0.07247503
## Vx15 0.41859252 0.14863956
## maxO3v 0.35197646 0.50180286
## Model Comparison Results:
## ---------------------------------
## Unconstrained Model RSS: 20827.23
## Constrained Model RSS: 25168.66
## F-statistic (Improvement Ratio): 21.053
## p-value: < 0.001
## Conclusion: Reject H0 - Constraining coefficients significantly worsens model fit (p = 1.29e-05 )
Task statement. Using data_advanced, construct multiple models to
explain response Y, compare their performance, select the best model
for this dataset, and interpret the results.
## Loading required package: Matrix
## Loaded glmnet 4.1-10
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## corrplot 0.95 loaded
##
## Feature matrix dimensions: 77 200
##
## Response variable levels: -1 1
##
##
## Missing values check:
##
## Features missing values: 0
##
## Response missing values: 0
##
##
## Class balance:
##
## -1 1
## 0.5194805 0.4805195
##
##
## Scaling check:
##
## Feature mean range: -0.14 0.12
##
## Feature SD range: 0.92 1.08
##
##
## Highly correlated feature pairs (|r| > 0.8): 1
##
## Highly Correlated Feature Pair 1 :
## Feature 1: V2
## Feature 2: V3
## Correlation between V2 and V3 : 0.83
-1 and 1. The
class distribution is relatively balanced, with approximately
52% (-1) and 48% (1), reducing concerns about severe class
imbalance.-0.14 to 0.12) and standard deviation range
(0.92 to 1.08). This ensures that all features are on a
comparable scale. Nonetheless, we will do additional scaling on
the next steps.rpart package, this will serve as a baseline
tree-based model.##
## === Model Fitting Summary ===
## GLMNET Cross-Validation Results:
## Best lambda (lambda.min): 0.1
## Number of lambda values tested: 10
## Fold count:
## Maximum AUC achieved: 1
##
## Cross-Validation Performance:
## Lambda AUC SD
## 1 1.00000 0.5000 0.000
## 2 0.46416 0.5000 0.000
## 3 0.21544 0.9647 0.022
## 4 0.10000 1.0000 0.000
## 5 0.04642 1.0000 0.000
## 6 0.02154 1.0000 0.000
## 7 0.01000 1.0000 0.000
## 8 0.00464 1.0000 0.000
## 9 0.00215 1.0000 0.000
## 10 0.00100 1.0000 0.000
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 10 1
## Class1 0 5
##
## Accuracy: 0.9375
##
## Sensitivity (Recall): 0.8333
##
## Specificity: 1
##
## Test AUC: 0.9833
##
## === Non-Zero Coefficients ===
## Feature Coefficient
## (Intercept) 0.05748145
## V1 0.13305726
## V2 0.52362006
## V3 1.11549609
## V5 0.21888920
## V6 0.44056025
##
## Non-zero coefficients (including intercept): 6
##
## Non-zero coefficients (excluding intercept): 5
The model demonstrates high performance metrics, such as test AUC of 0.983, but these results are likely unreliable due to a very small test set (16 samples) and potential overfitting, as indicated by perfect cross-validation AUC scores (1) with no variance. The model selected 5 features out of 200, which indicates a considerable simplification of the model. Given the small dataset size (77 observations) and unstable lambda behavior, the model’s trustworthiness is questionable.
##
## === RIDGE Model Fitting Summary ===
## GLMNET Cross-Validation Results:
## Best lambda (lambda.min): 0.02154435
## Number of lambda values tested: 10
## Fold count:
## Maximum AUC achieved: 0.9708
##
## Cross-Validation Performance:
## Lambda AUC SD
## 1 1.00000 0.9596 0.0175
## 2 0.46416 0.9596 0.0175
## 3 0.21544 0.9652 0.0145
## 4 0.10000 0.9652 0.0171
## 5 0.04642 0.9652 0.0171
## 6 0.02154 0.9708 0.0185
## 7 0.01000 0.9652 0.0231
## 8 0.00464 0.9652 0.0231
## 9 0.00215 0.9652 0.0231
## 10 0.00100 0.9652 0.0231
##
## Feature Impact Summary:
## All features are retained in Ridge regression
## Number of features: 200
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 9 1
## Class1 1 5
##
## Accuracy: 0.875
##
## Sensitivity (Recall): 0.8333
##
## Specificity: 0.9
##
## Test AUC: 0.95
##
## === Coefficients (Ridge) ===
## Feature Coefficient
## V3 0.6677428454
## V2 0.5860708095
## V1 0.4950715372
## V5 0.4068465121
## V6 0.3731892803
## V4 0.3341601274
## V7 0.2920198379
## V67 0.2721345921
## V109 -0.2641883915
## V37 0.2391782741
## V191 -0.2346461267
## V38 -0.2317536534
## V50 -0.2254794338
## V85 -0.2193123418
## V81 -0.2082334705
## V159 0.2047373747
## V145 -0.1954945688
## V168 -0.1887311981
## V117 0.1840854312
## V27 -0.1831435201
## V54 -0.1784206504
## V52 -0.1720755301
## V17 0.1693929709
## V147 -0.1689209049
## V126 0.1670306351
## V31 -0.1668912737
## V200 -0.1640237921
## V36 -0.1565928771
## V111 -0.1562466283
## V131 0.1497727927
## V188 0.1485950231
## V94 0.1476397281
## V124 -0.1437858669
## V23 -0.1435621300
## V146 -0.1428531772
## V150 -0.1412585423
## V125 0.1411453961
## V139 -0.1408484776
## V116 -0.1386625111
## V75 -0.1361589862
## V194 0.1325592599
## V123 -0.1321479229
## V136 0.1308955711
## V105 0.1246050044
## V66 -0.1223955534
## V68 -0.1207932709
## V137 -0.1200504999
## V195 0.1194261582
## V96 -0.1172784298
## V61 -0.1166600245
## V138 -0.1144530636
## V65 0.1109986107
## V69 0.1109938148
## V167 0.1098788313
## V193 0.1098763241
## V24 -0.1080617215
## V181 0.1078681314
## V196 0.1064422597
## V162 -0.1064347551
## V127 0.1037018069
## V120 -0.1030130554
## V53 0.1021707772
## V9 0.1014884313
## V107 0.0999730306
## V189 -0.0997422896
## V175 0.0987917600
## V184 -0.0974261058
## V179 0.0962381191
## V113 -0.0946776360
## V98 0.0927937032
## V197 0.0904013518
## V122 -0.0897049174
## V32 -0.0892781827
## V11 -0.0889186596
## V84 0.0883454969
## V158 -0.0881920980
## V165 -0.0878977301
## V58 0.0872822089
## V19 -0.0872436390
## V47 0.0863894794
## V114 -0.0859337059
## V71 0.0851134514
## V143 -0.0816824749
## V177 0.0815772474
## V49 -0.0804136223
## V185 0.0786337645
## V60 0.0775342777
## V104 -0.0771089503
## V10 -0.0765588106
## V12 0.0758527839
## V176 -0.0756172907
## V97 0.0754903070
## V187 0.0752714788
## (Intercept) 0.0751850270
## V144 0.0737784990
## V33 0.0709921513
## V152 -0.0702442315
## V77 0.0700134826
## V160 0.0700073864
## V30 0.0689716717
## V46 -0.0668084422
## V35 0.0664126439
## V102 0.0663829258
## V173 -0.0654786466
## V163 -0.0628194298
## V166 -0.0626487611
## V133 0.0618620420
## V92 -0.0618296190
## V73 0.0608282573
## V178 0.0599533246
## V103 -0.0589951738
## V18 0.0574942826
## V157 0.0558728172
## V40 0.0535946836
## V16 -0.0534346300
## V64 0.0523381464
## V132 -0.0520977073
## V172 0.0518690034
## V190 0.0511431754
## V82 0.0507089708
## V112 0.0506479297
## V156 0.0502699103
## V106 0.0494566490
## V128 0.0482251106
## V89 -0.0480117180
## V108 -0.0477849734
## V45 -0.0474736275
## V169 0.0468537495
## V25 -0.0463550496
## V56 0.0461599060
## V151 -0.0456293477
## V86 0.0451717526
## V148 0.0446776833
## V192 -0.0434698925
## V51 -0.0430611776
## V199 -0.0407743827
## V155 -0.0399950718
## V135 0.0399852018
## V182 0.0396406708
## V90 -0.0389661776
## V93 0.0384831053
## V134 0.0374480140
## V44 0.0371137066
## V48 0.0370053848
## V14 -0.0369921720
## V79 0.0369465112
## V70 -0.0352342196
## V55 -0.0325822561
## V72 0.0309068784
## V63 0.0307590797
## V115 -0.0301433266
## V59 0.0298726016
## V149 0.0298484362
## V130 0.0295905899
## V83 0.0279211717
## V43 -0.0277938308
## V129 0.0256117776
## V80 -0.0244953809
## V164 0.0235916890
## V110 -0.0234100097
## V20 -0.0230345961
## V15 -0.0225357888
## V78 -0.0208546644
## V13 0.0207843563
## V91 -0.0202158718
## V21 0.0200253283
## V8 0.0196890975
## V34 0.0196293347
## V118 0.0192163008
## V141 0.0168501684
## V183 -0.0166909053
## V62 0.0165510488
## V95 0.0161147914
## V121 0.0155240870
## V41 0.0133041623
## V119 0.0130034770
## V42 -0.0120060759
## V180 -0.0115244297
## V140 0.0111927226
## V101 -0.0110880828
## V154 -0.0098983849
## V153 -0.0094471852
## V39 -0.0089896697
## V87 0.0089673599
## V171 -0.0088269542
## V74 -0.0088227296
## V76 -0.0079005151
## V186 -0.0075874230
## V29 0.0068840279
## V99 -0.0067686040
## V22 -0.0066317723
## V174 0.0059697575
## V57 0.0055038902
## V26 -0.0046159299
## V88 0.0045222519
## V170 0.0040890453
## V28 -0.0039710509
## V198 -0.0034603503
## V100 0.0017944609
## V142 -0.0012035609
## V161 -0.0004133932
##
## Coefficient Statistics:
## L2 Norm: 3.348907
## Maximum Absolute Coefficient: 0.6677428
## Minimum Absolute Coefficient: 0.0004133932
The Ridge model achieves strong performance (0.95 test AUC) while retaining all 200 features, avoiding the over-regularization concerns of Lasso but sacrificing interpretability. It shows less evidence of overfitting than Lasso, with no perfect cross-validation AUC and more stable lambda behavior, though the small dataset (77 samples, 200 features) and tiny test set (16 samples) still raise reliability concerns. While Ridge appears marginally more trustworthy due to its consistent regularization and avoidance of extreme sparsity, both models likely suffer from overfitting and require validation on a larger dataset. The Ridge model’s inclusion of all features may improve robustness but also increases complexity without clear gains in generalizability.
##
## === Elastic Net Tuning Progress ===
## Alpha: 0.2 | Best AUC: 1.0000
## Alpha: 0.5 | Best AUC: 1.0000
## Alpha: 0.8 | Best AUC: 1.0000
##
## === Cross-Validation Results Table ===
## Alpha Lambda AUC SD
## 1 0.2 0.001000000 0.9831382 0.017036565
## 2 0.2 0.002154435 1.0000000 0.000000000
## 3 0.2 0.004641589 1.0000000 0.000000000
## 4 0.2 0.010000000 1.0000000 0.000000000
## 5 0.2 0.021544347 1.0000000 0.000000000
## 6 0.2 0.046415888 1.0000000 0.000000000
## 7 0.2 0.100000000 1.0000000 0.000000000
## 8 0.2 0.215443469 1.0000000 0.000000000
## 9 0.2 0.464158883 1.0000000 0.000000000
## 10 0.2 1.000000000 1.0000000 0.000000000
## 11 0.5 0.001000000 0.5000000 0.000000000
## 12 0.5 0.002154435 0.9577674 0.025894199
## 13 0.5 0.004641589 1.0000000 0.000000000
## 14 0.5 0.010000000 1.0000000 0.000000000
## 15 0.5 0.021544347 1.0000000 0.000000000
## 16 0.5 0.046415888 1.0000000 0.000000000
## 17 0.5 0.100000000 1.0000000 0.000000000
## 18 0.5 0.215443469 0.9943794 0.005678855
## 19 0.5 0.464158883 0.9887588 0.011357710
## 20 0.5 1.000000000 0.9887588 0.011357710
## 21 0.8 0.001000000 0.5000000 0.000000000
## 22 0.8 0.002154435 0.9505074 0.018749019
## 23 0.8 0.004641589 0.9831382 0.017036565
## 24 0.8 0.010000000 1.0000000 0.000000000
## 25 0.8 0.021544347 1.0000000 0.000000000
## 26 0.8 0.046415888 1.0000000 0.000000000
## 27 0.8 0.100000000 1.0000000 0.000000000
## 28 0.8 0.215443469 1.0000000 0.000000000
## 29 0.8 0.464158883 1.0000000 0.000000000
## 30 0.8 1.000000000 1.0000000 0.000000000
##
## === Elastic Net Final Model ===
## Best alpha: 0.2
## Best lambda: 0.4641589
## Validation AUC: 1
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 10 1
## Class1 0 5
##
## Accuracy: 0.9375
##
## Sensitivity (Recall): 0.8333
##
## Specificity: 1
##
## Test AUC: 1
##
## === Elastic Net Coefficients ===
## Feature Coefficient
## V3 0.35771257
## V2 0.33151676
## V1 0.19741952
## V6 0.15318102
## V5 0.14636532
## V4 0.07730440
## (Intercept) 0.04231159
## V191 -0.02835118
## V159 0.02283345
## V109 -0.01887627
## V7 0.00000000
## V8 0.00000000
## V9 0.00000000
## V10 0.00000000
## V11 0.00000000
## V12 0.00000000
## V13 0.00000000
## V14 0.00000000
## V15 0.00000000
## V16 0.00000000
## V17 0.00000000
## V18 0.00000000
## V19 0.00000000
## V20 0.00000000
## V21 0.00000000
## V22 0.00000000
## V23 0.00000000
## V24 0.00000000
## V25 0.00000000
## V26 0.00000000
## V27 0.00000000
## V28 0.00000000
## V29 0.00000000
## V30 0.00000000
## V31 0.00000000
## V32 0.00000000
## V33 0.00000000
## V34 0.00000000
## V35 0.00000000
## V36 0.00000000
## V37 0.00000000
## V38 0.00000000
## V39 0.00000000
## V40 0.00000000
## V41 0.00000000
## V42 0.00000000
## V43 0.00000000
## V44 0.00000000
## V45 0.00000000
## V46 0.00000000
## V47 0.00000000
## V48 0.00000000
## V49 0.00000000
## V50 0.00000000
## V51 0.00000000
## V52 0.00000000
## V53 0.00000000
## V54 0.00000000
## V55 0.00000000
## V56 0.00000000
## V57 0.00000000
## V58 0.00000000
## V59 0.00000000
## V60 0.00000000
## V61 0.00000000
## V62 0.00000000
## V63 0.00000000
## V64 0.00000000
## V65 0.00000000
## V66 0.00000000
## V67 0.00000000
## V68 0.00000000
## V69 0.00000000
## V70 0.00000000
## V71 0.00000000
## V72 0.00000000
## V73 0.00000000
## V74 0.00000000
## V75 0.00000000
## V76 0.00000000
## V77 0.00000000
## V78 0.00000000
## V79 0.00000000
## V80 0.00000000
## V81 0.00000000
## V82 0.00000000
## V83 0.00000000
## V84 0.00000000
## V85 0.00000000
## V86 0.00000000
## V87 0.00000000
## V88 0.00000000
## V89 0.00000000
## V90 0.00000000
## V91 0.00000000
## V92 0.00000000
## V93 0.00000000
## V94 0.00000000
## V95 0.00000000
## V96 0.00000000
## V97 0.00000000
## V98 0.00000000
## V99 0.00000000
## V100 0.00000000
## V101 0.00000000
## V102 0.00000000
## V103 0.00000000
## V104 0.00000000
## V105 0.00000000
## V106 0.00000000
## V107 0.00000000
## V108 0.00000000
## V110 0.00000000
## V111 0.00000000
## V112 0.00000000
## V113 0.00000000
## V114 0.00000000
## V115 0.00000000
## V116 0.00000000
## V117 0.00000000
## V118 0.00000000
## V119 0.00000000
## V120 0.00000000
## V121 0.00000000
## V122 0.00000000
## V123 0.00000000
## V124 0.00000000
## V125 0.00000000
## V126 0.00000000
## V127 0.00000000
## V128 0.00000000
## V129 0.00000000
## V130 0.00000000
## V131 0.00000000
## V132 0.00000000
## V133 0.00000000
## V134 0.00000000
## V135 0.00000000
## V136 0.00000000
## V137 0.00000000
## V138 0.00000000
## V139 0.00000000
## V140 0.00000000
## V141 0.00000000
## V142 0.00000000
## V143 0.00000000
## V144 0.00000000
## V145 0.00000000
## V146 0.00000000
## V147 0.00000000
## V148 0.00000000
## V149 0.00000000
## V150 0.00000000
## V151 0.00000000
## V152 0.00000000
## V153 0.00000000
## V154 0.00000000
## V155 0.00000000
## V156 0.00000000
## V157 0.00000000
## V158 0.00000000
## V160 0.00000000
## V161 0.00000000
## V162 0.00000000
## V163 0.00000000
## V164 0.00000000
## V165 0.00000000
## V166 0.00000000
## V167 0.00000000
## V168 0.00000000
## V169 0.00000000
## V170 0.00000000
## V171 0.00000000
## V172 0.00000000
## V173 0.00000000
## V174 0.00000000
## V175 0.00000000
## V176 0.00000000
## V177 0.00000000
## V178 0.00000000
## V179 0.00000000
## V180 0.00000000
## V181 0.00000000
## V182 0.00000000
## V183 0.00000000
## V184 0.00000000
## V185 0.00000000
## V186 0.00000000
## V187 0.00000000
## V188 0.00000000
## V189 0.00000000
## V190 0.00000000
## V192 0.00000000
## V193 0.00000000
## V194 0.00000000
## V195 0.00000000
## V196 0.00000000
## V197 0.00000000
## V198 0.00000000
## V199 0.00000000
## V200 0.00000000
##
## Regularization Statistics:
## L1 Norm (Sum|Coefficients|): 1.375872
## L2 Norm (Sum Coefficients^2): 0.3311711
## Sparsity Ratio: 0.9502488
The Elastic Net model achieves perfect test performance metrics (93.75% accuracy, 1.0 AUC) with high sparsity (95% of coefficients near zero), suggesting strong regularization and feature selection. However, the perfect AUC in both cross-validation and testing, along with zero variance, strongly indicates overfitting. Its results are overly optimistic and unreliable without further validation. The chosen alpha (0.2) leans more toward Ridge, but the lack of generalizability remains a critical concern.
Among the three regularized models, Ridge appears to be the better choice, although it is still susceptible to overfitting due to the limitations of the dataset.
##
## === Cross-Validation Results ===
## Mean Cross-Validation AUC: 0.8105
##
## === RPART Model Summary ===
##
## Classification tree:
## rpart(formula = Y_train ~ ., data = data.frame(X_train_scaled,
## Y_train = Y_train), method = "class", control = rpart.control(xval = 5,
## cp = 1e-04, minsplit = 5, minbucket = 10, maxdepth = 8))
##
## Variables actually used in tree construction:
## [1] V2
##
## Root node error: 30/61 = 0.4918
##
## n= 61
##
## CP nsplit rel error xerror xstd
## 1 8e-01 0 1.0 1.30000 0.125014
## 2 1e-04 1 0.2 0.26667 0.087881

##
## Optimal Complexity Parameter (CP): 1e-04
## Final Tree Size: 3 nodes
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 9 2
## Class1 1 4
##
## Accuracy: 0.8125
##
## Sensitivity (Recall): 0.6667
##
## Specificity: 0.9
##
## Test AUC: 0.7833
##
## === Variable Importance ===
## V2 V3 V1 V109 V38 V17
## 19.767665 15.677804 8.861367 8.861367 7.498080 6.816436
The RPART
model demonstrates moderate performance with a cross-validation AUC of
0.81 and a test AUC of 0.78, indicating no significant overfitting and
better consistency compared to Ridge. The tree is based solely on one
variable (V2), making the model extremely simple and transparent, but
potentially underutilizing valuable information from other features.
Let’s fine-tune hyperparameters to improve robustness of the tree.
##
## === Cross-Validation Results ===
## Mean Cross-Validation AUC: 0.8441
##
## === RPART Model Summary ===
##
## Classification tree:
## rpart(formula = Y_train ~ ., data = data.frame(X_train_scaled,
## Y_train = Y_train), method = "class", control = rpart.control(xval = 5,
## cp = 1e-04, minsplit = 5, minbucket = 5, maxdepth = 5))
##
## Variables actually used in tree construction:
## [1] V2 V6
##
## Root node error: 30/61 = 0.4918
##
## n= 61
##
## CP nsplit rel error xerror xstd
## 1 8e-01 0 1.0 1.30000 0.125014
## 2 1e-01 1 0.2 0.26667 0.087881
## 3 1e-04 2 0.1 0.20000 0.077530

##
## Optimal Complexity Parameter (CP): 1e-04
## Final Tree Size: 5 nodes
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 8 0
## Class1 2 6
##
## Accuracy: 0.875
##
## Sensitivity (Recall): 1
##
## Specificity: 0.8
##
## Test AUC: 0.9167
##
## === Variable Importance ===
## V2 V3 V109 V1 V38 V17 V6 V23
## 19.767665 15.677804 12.101367 8.861367 7.498080 6.816436 5.400000 3.240000
## V4 V43 V5
## 2.160000 2.160000 2.160000
The tuned
RPART model demonstrates improved performance, with a cross-validation
AUC of 0.8441 and a test AUC of 0.9167, achieving better generalization
and consistency compared to the initial version. Reducing minbucket to
5 was the only hyperparameter change that yielded positive results, as
adjustments to other parameters like cp, minsplit, and maxdepth
did not lead to further improvements. The resulting tree now uses two
variables (V2 and V6) and grows to 5 nodes, balancing simplicity with
the ability to capture more information from the data.
##
## === Cross-Validation Results ===
## Mean AUC: 1
## Mean Accuracy: 0.9179
## Mean Sensitivity (Recall): 0.9167
## Mean Specificity: 0.9417
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 9 2
## Class1 1 4
##
## Accuracy: 0.8125
##
## Sensitivity (Recall): 0.6667
##
## Specificity: 0.9
##
## Test AUC: 0.9333
Despite the
Random Forest model achieving a slightly higher test AUC (0.93) compared
to the fine-tuned RPART model (0.92), its perfect cross-validation AUC
of 1 strongly suggests overfitting, particularly given the small dataset
size. The RPART model, by contrast, is based on just two variables (V2
and V6) and has a simple 5-node structure, offering transparency and
interpretability, whereas Random Forest’s ensemble nature makes it more
complex and less interpretable, which can be a disadvantage in contexts
requiring explainability. In this case, the fine-tuned RPART model
appears more reliable than Random Forest. Nonetheless, let’s explore
VSURF feature selection technique, to potentially improve Random
Forest’s performance and reduce overfitting risks.
##
## === Cross-Validation Results ===
## Mean AUC: 1
## Mean Accuracy: 0.9179
## Mean Sensitivity (Recall): 0.9167
## Mean Specificity: 0.9417
## Thresholding step
## Estimated computational time (on one core): 2 sec.
##
## Interpretation step (on 20 variables)
## Estimated computational time (on one core): between 0 sec. and 0 sec.
##
## Prediction step (on 3 variables)
## Maximum estimated computational time (on one core): 0 sec.
## | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
##
## === Confusion Matrix ===
## Actual
## Predicted Class0 Class1
## Class0 8 1
## Class1 2 5
##
## Accuracy: 0.8125
##
## Sensitivity (Recall): 0.8333
##
## Specificity: 0.8
##
## Test AUC: 0.95

The Random Forest model with feature selection using VSURF shows strong performance, achieving a test AUC of 0.95, which is slightly higher than the previous Random Forest model (AUC = 0.9333). However, the cross-validation metrics remain suspiciously perfect (AUC = 1, accuracy = 0.9179), suggesting that overfitting is still a concern, likely due to the small dataset size and limited test set. The VSURF process reduced the number of variables to 3 for the final prediction step, improving interpretability compared to the original Random Forest model while maintaining competitive performance. While the model demonstrates better generalization than before, its inflated cross-validation results indicate that further validation is necessary to confirm its reliability.
We have chosen three candidate models: RIDGE, Fine-tuned RPART, and Random Forest with VSURF to identify the best-performing model for the dataset. The fine-tuned RPART model stands out as the most reliable choice despite its slightly lower test AUC (0.9167), as it avoids the overfitting risks seen in Ridge’s reliance on all features and Random Forest’s perfect cross-validation metrics. Its simplicity, interpretability, and focus on just two key variables (V2 and V6) make it more practical for real-world applications. Notably, all three models consistently identify V2, V3, and V6 as the most important features, highlighting their critical role in capturing underlying patterns. This agreement across different modeling approaches reinforces the relevance of these variables and suggests they should be prioritized in future analyses or model development.
Task statement. Repeat the modeling workflow using real-world PM10
pollution observations from the Rouen area available in the VSURF
package, and identify the best-performing approach.
## 'data.frame': 1096 obs. of 18 variables:
## $ PM10 : int 12 15 25 33 23 14 20 15 22 28 ...
## $ NO : int 14 7 22 34 41 14 19 13 9 13 ...
## $ NO2 : int 37 28 47 43 46 48 51 42 45 48 ...
## $ SO2 : int 8 7 13 16 19 18 11 12 8 12 ...
## $ T.min : num -0.6 -1.3 -4.2 -2.5 2.4 6.5 3 4.8 5.6 5.4 ...
## $ T.max : num 4.4 5.6 -0.7 4 7.7 9.1 8.8 9 10.4 11.7 ...
## $ T.moy : num 1.17 2.01 -2.93 1.52 5.03 ...
## $ DV.maxvv : int 160 20 40 200 200 210 160 160 260 210 ...
## $ DV.dom : num 180 22.5 180 202.5 180 ...
## $ VV.max : int 8 7 3 4 5 6 7 9 10 7 ...
## $ VV.moy : num 5.35 4.96 2.25 2.71 3.25 ...
## $ PL.som : int 11 3 0 0 0 0 0 9 3 0 ...
## $ HR.min : int 93 78 63 86 91 87 74 85 74 82 ...
## $ HR.max : int 97 96 91 97 98 97 94 96 94 97 ...
## $ HR.moy : num 96.2 87.9 82.6 94 96.2 ...
## $ PA.moy : num 1011 1017 1024 1022 1022 ...
## $ GTrouen : num 0.7 -0.5 2.4 2.7 2.4 0.6 -0.1 0 -0.1 1.2 ...
## $ GTlehavre: num -0.3 0 0.1 -0.2 -0.2 -0.3 -0.5 NA -0.5 0.1 ...
## Dataset structure:
## List of 2
## $ X:'data.frame': 1096 obs. of 17 variables:
## ..$ NO : int [1:1096] 14 7 22 34 41 14 19 13 9 13 ...
## ..$ NO2 : int [1:1096] 37 28 47 43 46 48 51 42 45 48 ...
## ..$ SO2 : int [1:1096] 8 7 13 16 19 18 11 12 8 12 ...
## ..$ T.min : num [1:1096] -0.6 -1.3 -4.2 -2.5 2.4 6.5 3 4.8 5.6 5.4 ...
## ..$ T.max : num [1:1096] 4.4 5.6 -0.7 4 7.7 9.1 8.8 9 10.4 11.7 ...
## ..$ T.moy : num [1:1096] 1.17 2.01 -2.93 1.52 5.03 ...
## ..$ DV.maxvv : int [1:1096] 160 20 40 200 200 210 160 160 260 210 ...
## ..$ DV.dom : num [1:1096] 180 22.5 180 202.5 180 ...
## ..$ VV.max : int [1:1096] 8 7 3 4 5 6 7 9 10 7 ...
## ..$ VV.moy : num [1:1096] 5.35 4.96 2.25 2.71 3.25 ...
## ..$ PL.som : int [1:1096] 11 3 0 0 0 0 0 9 3 0 ...
## ..$ HR.min : int [1:1096] 93 78 63 86 91 87 74 85 74 82 ...
## ..$ HR.max : int [1:1096] 97 96 91 97 98 97 94 96 94 97 ...
## ..$ HR.moy : num [1:1096] 96.2 87.9 82.6 94 96.2 ...
## ..$ PA.moy : num [1:1096] 1011 1017 1024 1022 1022 ...
## ..$ GTrouen : num [1:1096] 0.7 -0.5 2.4 2.7 2.4 0.6 -0.1 0 -0.1 1.2 ...
## ..$ GTlehavre: num [1:1096] -0.3 0 0.1 -0.2 -0.2 -0.3 -0.5 NA -0.5 0.1 ...
## $ Y: int [1:1096] 12 15 25 33 23 14 20 15 22 28 ...
##
## Feature matrix dimensions: 1096 17
##
## Response variable summary (numeric):
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.00 16.00 20.00 21.19 25.00 95.00 10
The dataset differs from the original as the response variable is numeric, not binary. Also now we have a much better dataset with just 17 variables and 1096 observations
##
##
## Missing values check:
##
## Features missing values: 89
##
## Response missing values: 10
As the number of the missing values is insignificant, for the sake of simplicity we just delete them
## Removed rows: 64
## New dataset size: 1032 observations
Let’s double check the missing values after we have deleted the rows with missing values
##
##
## Missing values check:
##
## Features missing values: 0
##
## Response missing values: 0
##
##
## Scaling check:
##
## Feature mean range: 0.62 1017.62
##
## Feature SD range: 1.24 102.77
Obviously scaling is required. We will apply it on the subsequent steps.
##
##
## Highly correlated feature pairs (|r| > 0.8): 5
##
## Highly Correlated Feature Pair 1 :
## Feature 1: T.min
## Feature 2: T.max
## Correlation between T.min and T.max : 0.9
##
## Highly Correlated Feature Pair 2 :
## Feature 1: T.min
## Feature 2: T.moy
## Correlation between T.min and T.moy : 0.96
##
## Highly Correlated Feature Pair 3 :
## Feature 1: T.max
## Feature 2: T.moy
## Correlation between T.max and T.moy : 0.98
##
## Highly Correlated Feature Pair 4 :
## Feature 1: VV.max
## Feature 2: VV.moy
## Correlation between VV.max and VV.moy : 0.91
##
## Highly Correlated Feature Pair 5 :
## Feature 1: HR.min
## Feature 2: HR.moy
## Correlation between HR.min and HR.moy : 0.92
The analysis identified five highly correlated feature pairs within the temperature, wind, and humidity categories, each exhibiting correlation coefficients above 0.8. Specifically, the temperature features (T.min, T.max, T.moy), wind features (VV.max, VV.moy), and humidity features (HR.min, HR.moy) show strong interrelationships. Given this high multicollinearity, it may be tempting to retain only the average values for each category. However, the analysis will initially include all features and progressively address multicollinearity by eliminating redundant variables.
##
## === Linear Model Summary ===
##
## Call:
## lm(formula = Y ~ ., data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.143 -3.413 -0.595 2.582 49.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.26182 0.19864 107.040 < 2e-16 ***
## NO 3.66921 0.32151 11.413 < 2e-16 ***
## NO2 1.36561 0.32519 4.199 2.97e-05 ***
## SO2 1.65814 0.25542 6.492 1.48e-10 ***
## T.min 3.62535 1.20365 3.012 0.00268 **
## T.max 5.99994 1.95246 3.073 0.00219 **
## T.moy -7.21958 2.60193 -2.775 0.00565 **
## DV.maxvv -0.40568 0.26198 -1.549 0.12189
## DV.dom -0.53909 0.26921 -2.002 0.04557 *
## VV.max -0.40471 0.51630 -0.784 0.43335
## VV.moy -0.08347 0.53393 -0.156 0.87582
## PL.som -0.57053 0.22945 -2.486 0.01310 *
## HR.min 1.43588 0.74140 1.937 0.05313 .
## HR.max -0.60187 0.39285 -1.532 0.12590
## HR.moy -1.57377 0.89052 -1.767 0.07756 .
## PA.moy 0.74323 0.24298 3.059 0.00230 **
## GTrouen 0.18567 0.37425 0.496 0.61996
## GTlehavre 1.06773 0.34076 3.133 0.00179 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.705 on 807 degrees of freedom
## Multiple R-squared: 0.5717, Adjusted R-squared: 0.5627
## F-statistic: 63.37 on 17 and 807 DF, p-value: < 2.2e-16
##
## === Cross-Validation Results ===
## Fold MSE RMSE R2
## 1 1 61.79753 7.861140 0.4658901
## 2 2 37.00695 6.083334 0.6008507
## 3 3 24.98802 4.998802 0.5955907
## 4 4 24.20858 4.920221 0.5635860
## 5 5 24.57733 4.957553 0.4486567
##
## Mean CV MSE: 34.5157
##
## Mean CV RMSE: 5.7642
##
## Mean CV R-squared: 0.5349
##
## === Test Set Evaluation ===
## MSE: 27.9471
## RMSE: 5.2865
## R-squared: 0.4961
##
## === Model Coefficients ===
## Feature Coefficient
## 1 (Intercept) 21.2618
## 2 NO 3.6692
## 3 NO2 1.3656
## 4 SO2 1.6581
## 5 T.min 3.6253
## 6 T.max 5.9999
## 7 T.moy -7.2196
## 8 DV.maxvv -0.4057
## 9 DV.dom -0.5391
## 10 VV.max -0.4047
## 11 VV.moy -0.0835
## 12 PL.som -0.5705
## 13 HR.min 1.4359
## 14 HR.max -0.6019
## 15 HR.moy -1.5738
## 16 PA.moy 0.7432
## 17 GTrouen 0.1857
## 18 GTlehavre 1.0677
##
## Number of coefficients (including intercept): 18
##
## Number of features (excluding intercept): 17
##
## === Correlation Filtering ===
## Highly correlated feature pairs (|r| > 0.8): 5
##
## Removing features with high multicollinearity:
## [1] "T.max" "T.moy" "VV.moy" "HR.moy"
##
## Highly correlated pairs details:
## T.min vs T.max : r = 0.90
## T.min vs T.moy : r = 0.96
## T.max vs T.moy : r = 0.98
## VV.max vs VV.moy : r = 0.91
## HR.min vs HR.moy : r = 0.92
##
## === Linear Model Summary ===
##
## Call:
## lm(formula = Y ~ ., data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.821 -3.473 -0.610 2.440 50.311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.2618 0.1994 106.609 < 2e-16 ***
## NO 3.7271 0.3219 11.579 < 2e-16 ***
## NO2 1.4170 0.3204 4.422 1.11e-05 ***
## SO2 1.7053 0.2549 6.689 4.17e-11 ***
## T.min 2.0569 0.2695 7.633 6.43e-14 ***
## DV.maxvv -0.4152 0.2597 -1.599 0.11018
## DV.dom -0.6364 0.2678 -2.376 0.01774 *
## VV.max -0.4760 0.2525 -1.885 0.05982 .
## PL.som -0.5597 0.2264 -2.473 0.01361 *
## HR.min -0.2854 0.2854 -1.000 0.31757
## HR.max -0.9466 0.2516 -3.763 0.00018 ***
## PA.moy 0.7353 0.2411 3.049 0.00237 **
## GTrouen 0.4042 0.3438 1.176 0.24006
## GTlehavre 1.1842 0.3383 3.500 0.00049 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.728 on 811 degrees of freedom
## Multiple R-squared: 0.5661, Adjusted R-squared: 0.5592
## F-statistic: 81.4 on 13 and 811 DF, p-value: < 2.2e-16
##
## === Cross-Validation Results ===
## Fold MSE RMSE R2
## 1 1 61.43289 7.837914 0.4690416
## 2 2 36.84339 6.069876 0.6026148
## 3 3 25.43862 5.043671 0.5882982
## 4 4 23.61515 4.859543 0.5742838
## 5 5 25.20303 5.020262 0.4346206
##
## Mean CV MSE: 34.5066
##
## Mean CV RMSE: 5.7663
##
## Mean CV R-squared: 0.5338
##
## === Test Set Evaluation ===
## MSE: 28.4852
## RMSE: 5.3372
## R-squared: 0.4864
##
## === Model Coefficients ===
## Feature Coefficient
## 1 (Intercept) 21.2618
## 2 NO 3.7271
## 3 NO2 1.4170
## 4 SO2 1.7053
## 5 T.min 2.0569
## 6 DV.maxvv -0.4152
## 7 DV.dom -0.6364
## 8 VV.max -0.4760
## 9 PL.som -0.5597
## 10 HR.min -0.2854
## 11 HR.max -0.9466
## 12 PA.moy 0.7353
## 13 GTrouen 0.4042
## 14 GTlehavre 1.1842
##
## Number of coefficients (including intercept): 14
##
## Number of features (excluding intercept): 13
##
## === VIF Analysis ===
## Loading required package: carData
##
## -- VIF Iteration 1 --
## Removing feature: T.moy (VIF = 171.4 )
##
## -- VIF Iteration 2 --
## Removing feature: T.max (VIF = 21.2 )
##
## -- VIF Iteration 3 --
## Removing feature: HR.moy (VIF = 16.7 )
##
## -- VIF Iteration 4 --
##
## Final removed features by VIF: T.moy, T.max, HR.moy
##
## === Linear Model Summary ===
##
## Call:
## lm(formula = Y ~ ., data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.833 -3.482 -0.597 2.433 50.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.2618 0.1996 106.548 < 2e-16 ***
## NO 3.7267 0.3221 11.571 < 2e-16 ***
## NO2 1.4053 0.3236 4.343 1.58e-05 ***
## SO2 1.7066 0.2551 6.690 4.17e-11 ***
## T.min 2.0558 0.2697 7.624 6.90e-14 ***
## DV.maxvv -0.4263 0.2631 -1.620 0.105537
## DV.dom -0.6311 0.2687 -2.349 0.019084 *
## VV.max -0.3557 0.5157 -0.690 0.490525
## VV.moy -0.1417 0.5297 -0.267 0.789201
## PL.som -0.5699 0.2297 -2.481 0.013285 *
## HR.min -0.2687 0.2923 -0.919 0.358261
## HR.max -0.9622 0.2584 -3.724 0.000210 ***
## PA.moy 0.7313 0.2417 3.025 0.002562 **
## GTrouen 0.3960 0.3453 1.147 0.251790
## GTlehavre 1.1815 0.3387 3.489 0.000511 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.732 on 810 degrees of freedom
## Multiple R-squared: 0.5662, Adjusted R-squared: 0.5587
## F-statistic: 75.5 on 14 and 810 DF, p-value: < 2.2e-16
##
## === Cross-Validation Results ===
## Fold MSE RMSE R2
## 1 1 61.81030 7.861953 0.4657797
## 2 2 36.84752 6.070215 0.6025704
## 3 3 25.48758 5.048523 0.5875057
## 4 4 23.63101 4.861173 0.5739980
## 5 5 25.20225 5.020184 0.4346380
##
## Mean CV MSE: 34.5957
##
## Mean CV RMSE: 5.7724
##
## Mean CV R-squared: 0.5329
##
## === Test Set Evaluation ===
## MSE: 28.457
## RMSE: 5.3345
## R-squared: 0.4869
##
## === Final Model Diagnostics ===
##
## Variance Inflation Factors (VIF):
## Feature VIF
## 1 NO 2.601794
## 2 NO2 2.626030
## 3 SO2 1.632406
## 4 T.min 1.823807
## 5 DV.maxvv 1.735874
## 6 DV.dom 1.811104
## 7 VV.max 6.671103
## 8 VV.moy 7.037325
## 9 PL.som 1.323062
## 10 HR.min 2.143153
## 11 HR.max 1.674314
## 12 PA.moy 1.465635
## 13 GTrouen 2.991305
## 14 GTlehavre 2.876678
##
## === Model Coefficients ===
## Feature Coefficient
## 1 (Intercept) 21.2618
## 2 NO 3.7267
## 3 NO2 1.4053
## 4 SO2 1.7066
## 5 T.min 2.0558
## 6 DV.maxvv -0.4263
## 7 DV.dom -0.6311
## 8 VV.max -0.3557
## 9 VV.moy -0.1417
## 10 PL.som -0.5699
## 11 HR.min -0.2687
## 12 HR.max -0.9622
## 13 PA.moy 0.7313
## 14 GTrouen 0.3960
## 15 GTlehavre 1.1815
##
## Number of coefficients (including intercept): 15
##
## Number of features (excluding intercept): 14
##
## === Stepwise Selection ===
## Start: AIC=2891.12
## Y ~ NO + NO2 + SO2 + T.min + T.max + T.moy + DV.maxvv + DV.dom +
## VV.max + VV.moy + PL.som + HR.min + HR.max + HR.moy + PA.moy +
## GTrouen + GTlehavre
##
## Df Sum of Sq RSS AIC
## - VV.moy 1 0.8 26270 2889.2
## - GTrouen 1 8.0 26277 2889.4
## - VV.max 1 20.0 26289 2889.8
## <none> 26269 2891.1
## - HR.max 1 76.4 26345 2891.5
## - DV.maxvv 1 78.1 26347 2891.6
## - HR.moy 1 101.7 26370 2892.3
## - HR.min 1 122.1 26391 2892.9
## - DV.dom 1 130.5 26399 2893.2
## - PL.som 1 201.3 26470 2895.4
## - T.moy 1 250.6 26519 2896.9
## - T.min 1 295.3 26564 2898.3
## - PA.moy 1 304.5 26573 2898.6
## - T.max 1 307.4 26576 2898.7
## - GTlehavre 1 319.6 26588 2899.1
## - NO2 1 574.1 26843 2907.0
## - SO2 1 1371.8 27641 2931.1
## - NO 1 4239.7 30508 3012.6
##
## Step: AIC=2889.15
## Y ~ NO + NO2 + SO2 + T.min + T.max + T.moy + DV.maxvv + DV.dom +
## VV.max + PL.som + HR.min + HR.max + HR.moy + PA.moy + GTrouen +
## GTlehavre
##
## Df Sum of Sq RSS AIC
## - GTrouen 1 8.4 26278 2887.4
## <none> 26270 2889.2
## - HR.max 1 75.8 26345 2889.5
## - DV.maxvv 1 77.5 26347 2889.6
## - HR.moy 1 100.9 26371 2890.3
## - VV.max 1 115.5 26385 2890.8
## - HR.min 1 121.9 26391 2891.0
## + VV.moy 1 0.8 26269 2891.1
## - DV.dom 1 132.7 26402 2891.3
## - PL.som 1 202.0 26472 2893.5
## - T.moy 1 253.4 26523 2895.1
## - T.min 1 296.0 26566 2896.4
## - PA.moy 1 309.2 26579 2896.8
## - T.max 1 311.9 26582 2896.9
## - GTlehavre 1 320.9 26591 2897.2
## - NO2 1 586.4 26856 2905.4
## - SO2 1 1371.2 27641 2929.1
## - NO 1 4239.3 30509 3010.6
##
## Step: AIC=2887.41
## Y ~ NO + NO2 + SO2 + T.min + T.max + T.moy + DV.maxvv + DV.dom +
## VV.max + PL.som + HR.min + HR.max + HR.moy + PA.moy + GTlehavre
##
## Df Sum of Sq RSS AIC
## <none> 26278 2887.4
## - DV.maxvv 1 73.8 26352 2887.7
## - HR.max 1 75.4 26353 2887.8
## - HR.moy 1 99.6 26378 2888.5
## + GTrouen 1 8.4 26270 2889.2
## - HR.min 1 123.4 26401 2889.3
## - VV.max 1 124.4 26402 2889.3
## + VV.moy 1 1.2 26277 2889.4
## - DV.dom 1 131.9 26410 2889.5
## - PL.som 1 208.7 26487 2891.9
## - T.moy 1 253.4 26531 2893.3
## - T.min 1 288.6 26567 2894.4
## - PA.moy 1 305.7 26584 2894.9
## - T.max 1 343.4 26621 2896.1
## - GTlehavre 1 511.4 26789 2901.3
## - NO2 1 613.8 26892 2904.5
## - SO2 1 1369.0 27647 2927.3
## - NO 1 4422.5 30701 3013.7
## Selected features after stepwise selection: NO, NO2, SO2, T.min, T.max, T.moy, DV.maxvv, DV.dom, VV.max, PL.som, HR.min, HR.max, HR.moy, PA.moy, GTlehavre
##
## === Linear Model Summary ===
##
## Call:
## lm(formula = selected_formula, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.210 -3.440 -0.644 2.546 49.610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.2618 0.1984 107.153 < 2e-16 ***
## NO 3.6960 0.3168 11.668 < 2e-16 ***
## NO2 1.3919 0.3202 4.347 1.56e-05 ***
## SO2 1.6554 0.2550 6.492 1.48e-10 ***
## T.min 3.4918 1.1714 2.981 0.00296 **
## T.max 6.2092 1.9098 3.251 0.00120 **
## T.moy -7.2458 2.5941 -2.793 0.00534 **
## DV.maxvv -0.3885 0.2577 -1.507 0.13208
## DV.dom -0.5405 0.2682 -2.015 0.04421 *
## VV.max -0.4899 0.2503 -1.957 0.05067 .
## PL.som -0.5730 0.2261 -2.535 0.01144 *
## HR.min 1.4274 0.7323 1.949 0.05160 .
## HR.max -0.5971 0.3919 -1.524 0.12799
## HR.moy -1.5483 0.8842 -1.751 0.08031 .
## PA.moy 0.7415 0.2417 3.068 0.00223 **
## GTlehavre 1.1582 0.2919 3.968 7.89e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.699 on 809 degrees of freedom
## Multiple R-squared: 0.5716, Adjusted R-squared: 0.5636
## F-statistic: 71.95 on 15 and 809 DF, p-value: < 2.2e-16
##
## === Cross-Validation Results ===
## Fold MSE RMSE R2
## 1 1 61.00122 7.810328 0.4727725
## 2 2 36.96371 6.079778 0.6013172
## 3 3 24.94626 4.994623 0.5962666
## 4 4 24.14199 4.913450 0.5647864
## 5 5 24.33609 4.933162 0.4540686
##
## Mean CV MSE: 34.2779
##
## Mean CV RMSE: 5.7463
##
## Mean CV R-squared: 0.5378
##
## === Test Set Evaluation ===
## MSE: 27.8327
## RMSE: 5.2757
## R-squared: 0.4981
##
## === Final Model Coefficients ===
## Feature Coefficient
## 1 (Intercept) 21.2618
## 2 NO 3.6960
## 3 NO2 1.3919
## 4 SO2 1.6554
## 5 T.min 3.4918
## 6 T.max 6.2092
## 7 T.moy -7.2458
## 8 DV.maxvv -0.3885
## 9 DV.dom -0.5405
## 10 VV.max -0.4899
## 11 PL.som -0.5730
## 12 HR.min 1.4274
## 13 HR.max -0.5971
## 14 HR.moy -1.5483
## 15 PA.moy 0.7415
## 16 GTlehavre 1.1582
##
## Number of coefficients (including intercept): 16
##
## Number of features (excluding intercept): 15
##
## === PCA-Based Dimension Reduction ===
## Number of principal components selected: 9
##
## === Linear Model Summary ===
##
## Call:
## lm(formula = Y ~ ., data = train_pcs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.847 -3.446 -0.500 2.512 48.563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.26182 0.20286 104.808 < 2e-16 ***
## PC1 0.32908 0.09385 3.506 0.000479 ***
## PC2 2.89361 0.10497 27.567 < 2e-16 ***
## PC3 -0.24028 0.13511 -1.778 0.075711 .
## PC4 -0.66875 0.17303 -3.865 0.000120 ***
## PC5 -1.41473 0.18635 -7.592 8.63e-14 ***
## PC6 -2.08343 0.23939 -8.703 < 2e-16 ***
## PC7 0.33529 0.25491 1.315 0.188775
## PC8 2.03246 0.26107 7.785 2.11e-14 ***
## PC9 0.68843 0.29337 2.347 0.019183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 815 degrees of freedom
## Multiple R-squared: 0.5489, Adjusted R-squared: 0.5439
## F-statistic: 110.2 on 9 and 815 DF, p-value: < 2.2e-16
##
## === Cross-Validation Results ===
## Fold MSE RMSE R2
## 1 1 61.95678 7.871263 0.4645137
## 2 2 40.10189 6.332605 0.5674694
## 3 3 26.20795 5.119370 0.5758473
## 4 4 21.79691 4.668716 0.6070618
## 5 5 25.34140 5.034024 0.4315164
##
## Mean CV MSE: 35.081
##
## Mean CV RMSE: 5.8052
##
## Mean CV R-squared: 0.5293
##
## === Test Set Evaluation ===
## MSE: 44.5091
## RMSE: 6.6715
## R-squared: 0.1974
##
## === Final Model Coefficients (PCA) ===
## Predictor Coefficient
## 1 (Intercept) 21.2618
## 2 PC1 0.3291
## 3 PC2 2.8936
## 4 PC3 -0.2403
## 5 PC4 -0.6688
## 6 PC5 -1.4147
## 7 PC6 -2.0834
## 8 PC7 0.3353
## 9 PC8 2.0325
## 10 PC9 0.6884
##
## Number of coefficients (including intercept): 10
##
## Number of principal components used (excluding intercept): 9
| Model | CV_MSE | CV_RMSE | CV_R2 | Test_MSE | Test_RMSE | Test_R2 |
|---|---|---|---|---|---|---|
| OLS | 34.5157 | 5.7642 | 0.5349 | 27.9471 | 5.2865 | 0.4961 |
| Correlation Filtering | 34.5066 | 5.7663 | 0.5338 | 28.4852 | 5.3372 | 0.4864 |
| VIF | 34.5957 | 5.7724 | 0.5329 | 28.4570 | 5.3345 | 0.4869 |
| Stepwise Selection | 34.2779 | 5.7463 | 0.5378 | 27.8327 | 5.2757 | 0.4981 |
| PCA | 35.0810 | 5.8052 | 0.5293 | 44.5091 | 6.6715 | 0.1974 |
## Summary of Y (Training):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 16.00 20.00 21.26 25.00 95.00
##
## Range of Y (Training): 6 to 95
##
## Standard Deviation of Y (Training): 8.6276
## Model Test Metrics:
## Test MSE: 27.8327
## Test RMSE: 5.2757
## Test R-squared: 0.4981
## RMSE as a percentage of the response range: 5.93 %
## RMSE as a percentage of the response standard deviation: 61.15 %
Among the various modeling approaches evaluated, only the stepwise selection method managed to improve the baseline OLS results. In contrast, both the correlation filtering and VIF-based approaches resulted in slightly worse performance. Furthermore, the PCA-based model demonstrated the poorest performance by a significant margin. These findings suggest that simply deleting correlated features or reducing dimensionality through PCA does not necessarily lead to performance improvements; in fact, it may remove predictive information essential for model generalization.
Based on the results from OLS, Test MSE (27.8327) and Test RMSE (5.2757) are relatively low when considered as a small percentage of the overall range (about 6%). However, since RMSE is about 61% of the standard deviation, there is still a considerable amount of prediction error relative to the natural variability in the response. An R^2 of 0.4981 indicates that the OLS model explains about half of the variance, which is moderate. This might be acceptable in many practical situations but also suggests there is room for improvement.
##
## === Model Fitting Summary ===
## GLMNET Cross-Validation Results:
## Best lambda (lambda.min): 0.002154435
## Number of lambda values tested: 10
## Fold count:
## Minimum MSE achieved: 34.5082
## Minimum RMSE achieved: 5.8744
## Maximum R-squared achieved: 0.5364
##
## Cross-Validation Performance:
## Lambda MSE RMSE R2 SD
## 1.00000 40.92 6.397 0.4502 9.025
## 0.46416 36.45 6.037 0.5103 8.101
## 0.21544 34.94 5.911 0.5306 7.602
## 0.10000 34.60 5.882 0.5352 7.403
## 0.04642 34.57 5.879 0.5356 7.329
## 0.02154 34.62 5.884 0.5349 7.319
## 0.01000 34.57 5.880 0.5355 7.271
## 0.00464 34.51 5.875 0.5363 7.251
## 0.00215 34.51 5.874 0.5364 7.242
## 0.00100 34.51 5.875 0.5364 7.238
##
## === Test Set Evaluation ===
## MSE: 27.9424
## RMSE: 5.2861
## R-squared: 0.4961
##
## === Non-Zero Coefficients ===
## Feature Coefficient
## (Intercept) 21.26181818
## NO 3.67858361
## NO2 1.37349145
## SO2 1.66235517
## T.min 3.19751770
## T.max 5.23615144
## T.moy -6.05471168
## DV.maxvv -0.40658060
## DV.dom -0.54989051
## VV.max -0.39985701
## VV.moy -0.08915399
## PL.som -0.56635311
## HR.min 1.24527992
## HR.max -0.65177756
## HR.moy -1.36718710
## PA.moy 0.74016567
## GTrouen 0.18347312
## GTlehavre 1.07499920
##
## Non-zero coefficients (including intercept): 18
## Non-zero coefficients (excluding intercept): 17
##
## === Model Fitting Summary ===
## GLMNET Cross-Validation Results:
## Best lambda (lambda.min): 0.4641589
## Number of lambda values tested: 10
## Fold count:
## Minimum MSE achieved: 34.4102
## Minimum RMSE achieved: 5.866
## Maximum R-squared achieved: 0.5377
##
## Cross-Validation Performance:
## Lambda MSE RMSE R2 SD
## 1.00000 34.43 5.868 0.5375 7.609
## 0.46416 34.41 5.866 0.5377 7.430
## 0.21544 34.47 5.871 0.5369 7.351
## 0.10000 34.51 5.875 0.5364 7.319
## 0.04642 34.51 5.875 0.5364 7.302
## 0.02154 34.49 5.873 0.5366 7.284
## 0.01000 34.49 5.873 0.5366 7.267
## 0.00464 34.50 5.874 0.5365 7.254
## 0.00215 34.51 5.874 0.5364 7.247
## 0.00100 34.51 5.875 0.5364 7.242
##
## === Test Set Evaluation ===
## MSE: 28.1332
## RMSE: 5.3041
## R-squared: 0.4927
##
## === Non-Zero Coefficients ===
## Feature Coefficient
## (Intercept) 21.2618182
## NO 3.4000709
## NO2 1.4526096
## SO2 1.6424410
## T.min 0.9640487
## T.max 0.9200983
## T.moy 0.1427990
## DV.maxvv -0.4251865
## DV.dom -0.5718161
## VV.max -0.3545431
## VV.moy -0.2002309
## PL.som -0.5377504
## HR.min 0.2863261
## HR.max -0.8212215
## HR.moy -0.3682087
## PA.moy 0.7398794
## GTrouen 0.3668272
## GTlehavre 1.0659635
##
## Non-zero coefficients (including intercept): 18
## Non-zero coefficients (excluding intercept): 17
##
## === Model Fitting Summary ===
## Best alpha: 0.5
## Best lambda (lambda.min): 0.001
## Number of lambda values tested: 10
## Fold count:
## Minimum MSE achieved: 33.753
## Minimum RMSE achieved: 5.8097
## Maximum R-squared achieved: 0.5466
##
## Cross-Validation Performance:
## Alpha Lambda MSE RMSE R2 SD
## 0.2 1.000000 35.08 5.923 0.5288 7.912
## 0.2 0.464159 34.56 5.879 0.5357 7.542
## 0.2 0.215443 34.50 5.873 0.5366 7.381
## 0.2 0.100000 34.56 5.879 0.5357 7.334
## 0.2 0.046416 34.59 5.881 0.5353 7.304
## 0.2 0.021544 34.52 5.875 0.5363 7.284
## 0.2 0.010000 34.50 5.873 0.5365 7.266
## 0.2 0.004642 34.50 5.874 0.5365 7.254
## 0.2 0.002154 34.51 5.874 0.5364 7.247
## 0.2 0.001000 34.51 5.875 0.5364 7.243
## 0.5 1.000000 36.46 6.038 0.5102 2.634
## 0.5 0.464159 34.48 5.872 0.5368 2.460
## 0.5 0.215443 34.00 5.831 0.5433 2.386
## 0.5 0.100000 33.92 5.824 0.5443 2.351
## 0.5 0.046416 33.95 5.827 0.5439 2.336
## 0.5 0.021544 33.92 5.824 0.5443 2.301
## 0.5 0.010000 33.81 5.814 0.5458 2.288
## 0.5 0.004642 33.77 5.811 0.5464 2.286
## 0.5 0.002154 33.76 5.810 0.5465 2.285
## 0.5 0.001000 33.75 5.810 0.5466 2.284
## 0.8 1.000000 39.24 6.264 0.4729 2.770
## 0.8 0.464159 35.35 5.946 0.5251 1.953
## 0.8 0.215443 34.49 5.873 0.5366 1.620
## 0.8 0.100000 34.36 5.861 0.5385 1.463
## 0.8 0.046416 34.39 5.864 0.5380 1.379
## 0.8 0.021544 34.45 5.870 0.5371 1.342
## 0.8 0.010000 34.37 5.862 0.5383 1.290
## 0.8 0.004642 34.32 5.858 0.5390 1.281
## 0.8 0.002154 34.31 5.857 0.5391 1.280
## 0.8 0.001000 34.31 5.857 0.5391 1.279
##
## === Test Set Evaluation ===
## MSE: 27.9483
## RMSE: 5.2866
## R-squared: 0.496
##
## === Non-Zero Coefficients ===
## Feature Coefficient
## (Intercept) 21.26181818
## NO 3.67502199
## NO2 1.37125506
## SO2 1.66204976
## T.min 3.40397669
## T.max 5.49549628
## T.moy -6.51606874
## DV.maxvv -0.40661471
## DV.dom -0.54608713
## VV.max -0.39907425
## VV.moy -0.09031654
## PL.som -0.56831050
## HR.min 1.32020235
## HR.max -0.62598822
## HR.moy -1.46331245
## PA.moy 0.74087827
## GTrouen 0.19159128
## GTlehavre 1.07341640
##
## Non-zero coefficients (including intercept): 18
## Non-zero coefficients (excluding intercept): 17
| Model | Best_Lambda | Best_Alpha | CV_MSE | CV_RMSE | CV_R2 | Test_MSE | Test_RMSE | Test_R2 | NonZero_Coefficients |
|---|---|---|---|---|---|---|---|---|---|
| Lasso | 0.0022 | NA | 34.5082 | 5.8744 | 0.5364 | 27.9424 | 5.2861 | 0.4961 | 17 |
| Ridge | 0.4642 | NA | 34.4102 | 5.8660 | 0.5377 | 28.1332 | 5.3041 | 0.4927 | 17 |
| Elastic Net | 0.0010 | 0.5 | 33.7530 | 5.8097 | 0.5466 | 27.9483 | 5.2866 | 0.4960 | 17 |
The results show that all three models (lasso, ridge, and elastic net) exhibit similar performance in terms of both cross-validation and test-set metrics. One particularly interesting observation is that neither lasso nor elastic net reduced the number of features: all models ended up with 17 non-zero predictors (excluding the intercept). In this case the chosen regularization parameters did not force any coefficients out of the model, suggesting that every predictor contributes meaningfully to the model’s performance.
##
## === Cross-Validation Results ===
## Mean Cross-Validation MSE: 54.0182
## Mean Cross-Validation RMSE: 7.2328
## Mean Cross-Validation R-squared: 0.2399
##
## === RPART Model Summary ===
##
## Regression tree:
## rpart(formula = Y ~ ., data = train_df, method = "anova", control = rpart.control(xval = 5,
## cp = 1e-04, minsplit = 5, minbucket = 10, maxdepth = 8))
##
## Variables actually used in tree construction:
## [1] DV.dom DV.maxvv GTlehavre GTrouen HR.max HR.min HR.moy
## [8] NO NO2 PA.moy PL.som SO2 T.max T.min
## [15] T.moy VV.max VV.moy
##
## Root node error: 61335/825 = 74.346
##
## n= 825
##
## CP nsplit rel error xerror xstd
## 1 0.25517987 0 1.00000 1.00223 0.144371
## 2 0.10154940 1 0.74482 0.82093 0.083245
## 3 0.04671272 2 0.64327 0.73958 0.083976
## 4 0.03412636 3 0.59656 0.76509 0.098611
## 5 0.03279346 4 0.56243 0.76077 0.098600
## 6 0.01844332 5 0.52964 0.73297 0.097577
## 7 0.01236063 6 0.51119 0.71509 0.097459
## 8 0.01061341 7 0.49883 0.71177 0.097428
## 9 0.01019291 8 0.48822 0.71168 0.097624
## 10 0.01001096 9 0.47803 0.71814 0.097851
## 11 0.00953867 10 0.46802 0.71685 0.097854
## 12 0.00737893 12 0.44894 0.71968 0.097695
## 13 0.00686223 13 0.44156 0.71992 0.097800
## 14 0.00656108 14 0.43470 0.71862 0.097659
## 15 0.00453895 15 0.42814 0.72163 0.097769
## 16 0.00416810 16 0.42360 0.72182 0.097697
## 17 0.00416002 17 0.41943 0.72110 0.097690
## 18 0.00397222 21 0.40270 0.72110 0.097690
## 19 0.00392595 22 0.39873 0.71965 0.097566
## 20 0.00363165 23 0.39481 0.72156 0.097943
## 21 0.00355867 24 0.39117 0.71079 0.093855
## 22 0.00292707 25 0.38762 0.71455 0.093875
## 23 0.00246573 27 0.38176 0.71714 0.093847
## 24 0.00239588 28 0.37930 0.71625 0.093830
## 25 0.00236303 29 0.37690 0.71849 0.093738
## 26 0.00233135 30 0.37454 0.71849 0.093738
## 27 0.00216728 31 0.37220 0.71760 0.093739
## 28 0.00214497 32 0.37004 0.71718 0.093652
## 29 0.00210625 33 0.36789 0.71461 0.093148
## 30 0.00209268 34 0.36579 0.71461 0.093148
## 31 0.00207391 35 0.36369 0.71461 0.093148
## 32 0.00187195 36 0.36162 0.71702 0.093181
## 33 0.00167796 37 0.35975 0.71423 0.093126
## 34 0.00135113 38 0.35807 0.71018 0.093115
## 35 0.00111350 39 0.35672 0.70978 0.093054
## 36 0.00064252 40 0.35561 0.71031 0.093051
## 37 0.00059767 41 0.35496 0.71217 0.093064
## 38 0.00057838 42 0.35437 0.71217 0.093064
## 39 0.00043124 43 0.35379 0.71257 0.093062
## 40 0.00010000 44 0.35336 0.71180 0.093061

##
## Optimal Complexity Parameter (CP): 0.001113502
## Final Tree Size: 79 nodes
##
## === Test Set Evaluation ===
## MSE: 39.0057
## RMSE: 6.2455
## R-squared: 0.2967
##
## === Variable Importance ===
## NO NO2 GTrouen SO2 GTlehavre T.max T.moy PL.som
## 18440.100 9201.996 6530.975 4794.136 4708.561 4554.848 4142.218 3398.593
## VV.moy HR.min HR.moy PA.moy T.min DV.maxvv DV.dom VV.max
## 3125.636 3004.276 2886.841 2252.807 2200.831 1567.724 1298.321 1221.588
## HR.max
## 190.127
##
## === Decision Tree Structure ===

##
## === Cross-Validation Results ===
## Mean Cross-Validation MSE: 54.0182
## Mean Cross-Validation RMSE: 7.2328
## Mean Cross-Validation R-squared: 0.2399
##
## === RPART Model Summary ===
##
## Regression tree:
## rpart(formula = Y ~ ., data = train_df, method = "anova", control = rpart.control(xval = 5,
## cp = 1e-04, minsplit = 5, minbucket = 10, maxdepth = 8))
##
## Variables actually used in tree construction:
## [1] DV.dom DV.maxvv GTlehavre GTrouen HR.max HR.min HR.moy
## [8] NO NO2 PA.moy PL.som SO2 T.max T.min
## [15] T.moy VV.max VV.moy
##
## Root node error: 61335/825 = 74.346
##
## n= 825
##
## CP nsplit rel error xerror xstd
## 1 0.25517987 0 1.00000 1.00223 0.144371
## 2 0.10154940 1 0.74482 0.82093 0.083245
## 3 0.04671272 2 0.64327 0.73958 0.083976
## 4 0.03412636 3 0.59656 0.76509 0.098611
## 5 0.03279346 4 0.56243 0.76077 0.098600
## 6 0.01844332 5 0.52964 0.73297 0.097577
## 7 0.01236063 6 0.51119 0.71509 0.097459
## 8 0.01061341 7 0.49883 0.71177 0.097428
## 9 0.01019291 8 0.48822 0.71168 0.097624
## 10 0.01001096 9 0.47803 0.71814 0.097851
## 11 0.00953867 10 0.46802 0.71685 0.097854
## 12 0.00737893 12 0.44894 0.71968 0.097695
## 13 0.00686223 13 0.44156 0.71992 0.097800
## 14 0.00656108 14 0.43470 0.71862 0.097659
## 15 0.00453895 15 0.42814 0.72163 0.097769
## 16 0.00416810 16 0.42360 0.72182 0.097697
## 17 0.00416002 17 0.41943 0.72110 0.097690
## 18 0.00397222 21 0.40270 0.72110 0.097690
## 19 0.00392595 22 0.39873 0.71965 0.097566
## 20 0.00363165 23 0.39481 0.72156 0.097943
## 21 0.00355867 24 0.39117 0.71079 0.093855
## 22 0.00292707 25 0.38762 0.71455 0.093875
## 23 0.00246573 27 0.38176 0.71714 0.093847
## 24 0.00239588 28 0.37930 0.71625 0.093830
## 25 0.00236303 29 0.37690 0.71849 0.093738
## 26 0.00233135 30 0.37454 0.71849 0.093738
## 27 0.00216728 31 0.37220 0.71760 0.093739
## 28 0.00214497 32 0.37004 0.71718 0.093652
## 29 0.00210625 33 0.36789 0.71461 0.093148
## 30 0.00209268 34 0.36579 0.71461 0.093148
## 31 0.00207391 35 0.36369 0.71461 0.093148
## 32 0.00187195 36 0.36162 0.71702 0.093181
## 33 0.00167796 37 0.35975 0.71423 0.093126
## 34 0.00135113 38 0.35807 0.71018 0.093115
## 35 0.00111350 39 0.35672 0.70978 0.093054
## 36 0.00064252 40 0.35561 0.71031 0.093051
## 37 0.00059767 41 0.35496 0.71217 0.093064
## 38 0.00057838 42 0.35437 0.71217 0.093064
## 39 0.00043124 43 0.35379 0.71257 0.093062
## 40 0.00010000 44 0.35336 0.71180 0.093061

##
## One Standard Error Optimal Complexity Parameter (CP): 0.04671272
## Final Tree Size: 5 nodes
##
## === Test Set Evaluation ===
## MSE: 45.6268
## RMSE: 6.7548
## R-squared: 0.1773
##
## === Variable Importance ===
## NO GTrouen NO2 GTlehavre T.max HR.min SO2
## 15651.5715 6228.5780 3912.8929 2890.9012 1760.8216 1208.9223 1117.9694
## VV.moy T.moy HR.moy
## 1117.9694 1103.7986 893.5513
##
## === Decision Tree Structure ===

##
## === Cross-Validation Results (Regression) ===
## Mean CV MSE: 32.168
## Mean CV RMSE: 5.5379
## Mean CV R-squared: 0.5766
##
## === Test Set Evaluation (Regression) ===
## Test MSE: 26.388
## Test RMSE: 5.1369
## Test R-squared: 0.5242
##
## === Variable Importance ===

##
## === Cross-Validation Results (Regression) ===
## Mean MSE: 32.168
## Mean RMSE: 5.5379
## Mean R-squared: 0.5766
## Thresholding step
## Estimated computational time (on one core): 29.4 sec.
##
## Interpretation step (on 17 variables)
## Estimated computational time (on one core): between 5.1 sec. and 35.7 sec.
##
## Prediction step (on 11 variables)
## Maximum estimated computational time (on one core): 18.7 sec.
## | | | 0% | |====== | 9% | |============= | 18% | |=================== | 27% | |========================= | 36% | |================================ | 45% | |====================================== | 55% | |============================================= | 64% | |=================================================== | 73% | |========================================================= | 82% | |================================================================ | 91% | |======================================================================| 100%
##
## === Test Set Evaluation (Regression) ===
## Test MSE: 28.9801
## Test RMSE: 5.3833
## Test R-squared: 0.4774
##
## === Variable Importance ===
## 4.5
Conclusion
| Model | CV_MSE | CV_RMSE | CV_R2 | Test_MSE | Test_RMSE | Test_R2 |
|---|---|---|---|---|---|---|
| Basic rpart | 54.0182 | 7.2328 | 0.2399 | 39.0057 | 6.2455 | 0.2967 |
| Fine-tuned rpart | 54.0182 | 7.2328 | 0.2399 | 45.6268 | 6.7548 | 0.1773 |
| Basic random forest | 32.1680 | 5.5379 | 0.5766 | 26.3880 | 5.1369 | 0.5242 |
| Random forest with VSURF | 32.1680 | 5.5379 | 0.5766 | 27.4477 | 5.2391 | 0.5051 |
The performance comparison shows that the random forest models outperform the decision tree models. Both versions of the random forest (basic and VSURF-enhanced) have lower cross-validation and test MSE as well as higher R-squared values compared to the rpart models. Notably, the basic random forest achieves the best performance on the test set with a Test MSE of 26.388 and a Test R² of 0.5242, indicating that ensemble methods capture the underlying patterns more effectively.
On the other hand, while the fine-tuned rpart model (with only 5 nodes) offers a simpler and more interpretable tree structure, its test performance (Test R² of 0.1773) is considerably poorer than that of the more complex basic rpart model (Test R² of 0.2967). This suggests that in this case, the additional complexity in the basic rpart tree is beneficial for prediction.
## Model CV MSE CV RMSE CV R² Test MSE Test RMSE Test R²
## 1 Stepwise Selection 34.2779 5.7463 0.5378 27.8327 5.2757 0.4981
## 2 Lasso 34.5082 5.8744 0.5364 27.9424 5.2861 0.4961
## 3 Basic random forest 32.1680 5.5379 0.5766 26.3880 5.1369 0.5242
While the enhancements achieved by Stepwise Selection and Lasso over the basic OLS model are minimal, the Basic random forest offers the best performance among the compared models. However, even this improvement is moderate, indicating that while the random forest does yield better predictive accuracy, the gains are not extraordinarily large. It’s worth noting that the concept of the reduction of the number of features is not very efficient for this dataset. Even within the random forest models, employing a variable selection method such as VSURF resulted in worse performance compared to the Basic random forest. This reinforces the idea that with a relatively small set of predictors, using all available features can be more effective than reducing the feature set.
Task statement. Implement a function that computes cross-validation error to compare a linear model fitted with ordinary least squares and a CART model.
Below is a sample function to perform cross-validation error to compare the linear model computed thanks to ordinary least-square and CART algorithm
cv_compare <- function(formula, data, k = 10, seed = NULL,
loss = function(y, pred) mean((y - pred)^2)) {
# Check if rpart package is installed
if (!requireNamespace("rpart", quietly = TRUE)) {
stop("Package 'rpart' is required for CART. Please install it.")
}
# Set seed for reproducibility if provided
if (!is.null(seed)) {
set.seed(seed)
}
# Number of observations
n <- nrow(data)
if (k > n) {
stop("Number of folds 'k' cannot exceed the number of observations.")
}
# Create folds
folds <- sample(rep(1:k, length.out = n))
# Initialize vectors to store errors
ols_errors <- numeric(k)
cart_errors <- numeric(k)
# Perform k-fold cross-validation
for (i in 1:k) {
# Split data into training and test sets
test_indices <- which(folds == i)
train_data <- data[-test_indices, ]
test_data <- data[test_indices, ]
# Fit OLS model and compute error
lm_model <- lm(formula, data = train_data)
pred_lm <- predict(lm_model, newdata = test_data)
y_test <- model.response(model.frame(formula, test_data))
ols_errors[i] <- loss(y_test, pred_lm)
# Fit CART model and compute error
cart_model <- rpart::rpart(formula, data = train_data, method = "anova")
pred_cart <- predict(cart_model, newdata = test_data)
cart_errors[i] <- loss(y_test, pred_cart)
}
# Return results
list(
ols_cv_error = mean(ols_errors),
cart_cv_error = mean(cart_errors),
ols_errors_per_fold = ols_errors,
cart_errors_per_fold = cart_errors
)
}