Model Optimization and Nonlinear Models#

CSC/DSC 340 Week 5 Slides

Author: Dr. Julie Butler

Date Created: August 19, 2023

Last Modified: September 18, 2023

Plan for the Week#

Monday

Week 5 Pre-Class Homework and Week 4 In-Class Assignment Due

Lecture: Model Optimization and Nonlinear Models

Office Hours: 1pm - 3pm

Tuesday

Office Hours: 4pm - 6pm

Wednesday

Finish Lecture: Model Optimization and Nonlinear Models

Start Week 5 In-Class Assignment

Week 4 Post-Class Homework Due

Thursday

Office Hours: 12:30 - 2pm

Friday

In-Class Assignment Week 5

Problem Analysis Due Before Class (~1 page)

Part 1: Hyperparameter Tuning#

Why do we need hyperparameter tuning?#

  • The values of the hyperparameters change the output of the model

  • Bad hyperparameters lead to bad results

##############################
##         IMPORTS          ##
##############################
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.datasets import fetch_california_housing
  • New data set: California Housing

    • Goal: Predict the price a house will sale for (house price/100k) given information about the house

    • In-Class Week 5

# Print the features
fetch_california_housing().feature_names
['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']
##############################
##        IMPORT DATA       ##
##############################
X, y = fetch_california_housing(return_X_y = True)
##############################
##        SCALE DATA        ##
##############################
scaler = StandardScaler()
scaler.fit(X)
Z = scaler.transform(X)
##############################
##     TRAIN-TEST SPLIT     ##
##############################
X_train, X_test, y_train, y_test = train_test_split(Z,y,test_size=0.2)
##############################
##        GOOD ALPHA        ##
##############################
ridge = Ridge(alpha=1e-15)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

err = MSE(y_pred, y_test)
print("MSE:", err)

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
plt.legend()
MSE: 0.5207853811806975
<matplotlib.legend.Legend at 0x1379a6610>
_images/6f9785647fcb248cb3d3a75e4013899c9306fc6decb84d67692a3b1c60f4cbb3.png
##############################
##        BAD ALPHA         ##
##############################
ridge = Ridge(alpha=1e15)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

err = MSE(y_pred, y_test)
print("MSE:", err)

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
MSE: 1.3166823615106975
<matplotlib.collections.PathCollection at 0x137a58970>
_images/38e5af99f37ca760f52884e701995a5671fa2cb216261a179e44a79471314ff2.png

Methods for Hyperparameter Tuning#

Using Default Values#

  • Scikit-Learn sets the default value of \(\alpha\) for ridge regression to 1.0, which is a reasonable high level of regularization

# Make the data set smaller (20k+ points in total)
# More points = more data to generate patterns BUT more run time
X = X[:1000]
y = y[:1000]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
%%time
ridge = Ridge()
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
CPU times: user 2.43 ms, sys: 5.11 ms, total: 7.54 ms
Wall time: 1.29 ms
err = MSE(y_pred, y_test)
print("MSE:", err)

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
MSE: 0.3596000507155571
<matplotlib.collections.PathCollection at 0x137b13cd0>
_images/f9f1ea452d2237217e2fecd35f0c464e857043af9c9f6979204d061205fadd6c.png

Pros#

  • Fast and no need to modify the algorithm

Cons#

  • Default value may not be the best value, but no test are done to check

For Loop Tuning#

  • Use a for loop to check many different values for the hyperparameters

  • Can use nested for loops if more than one hyperparamter

%%time
best_err = 1e4
best_alpha = None
for alpha in np.logspace(-15,4,1000):
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    y_pred = ridge.predict(X_test)
    
    err = MSE(y_pred, y_test)
    
    if err < best_err:
        best_err = err
        best_alpha = alpha
CPU times: user 630 ms, sys: 5.05 ms, total: 635 ms
Wall time: 297 ms
ridge = Ridge(alpha=best_alpha)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

err = MSE(y_pred, y_test)
print("MSE:", err)
print("CHOSEN ALPHA:", best_alpha)

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
MSE: 0.3552471269777846
CHOSEN ALPHA: 6.022541201461928e-15
<matplotlib.collections.PathCollection at 0x1076fbb50>
_images/d28f35ca9e6e0b5a29505cd3f63d39728c550f7cf2db13ca034d7948cdf6ad60.png

Pros#

  • Checks more than one value to find the best value

  • Simple concept

  • Short(ish) run times

Cons#

  • Long piece of code

  • Not checking all possible values of the hyperparameters

GridSearchCV (Scikit-Learn)#

  • Scikit-Learn has several hyperparameter tuning implementation

  • Grid search is a brute force algorithm which checks as many values as given

    • If more than one hyperparameter, it checks every single possible combination

  • “Same” test as for loops but gives much more information, but also takes longer

%%time
from sklearn.model_selection import GridSearchCV

parameters = {'alpha':np.logspace(-15,4,5000)}

ridge = Ridge()

grid_search = GridSearchCV(ridge, parameters,\
                           scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
{'alpha': 0.20590302760544998}
CPU times: user 13.2 s, sys: 212 ms, total: 13.4 s
Wall time: 13.1 s
ridge = Ridge(alpha=grid_search.best_params_['alpha'])
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

err = MSE(y_pred, y_test)
print("MSE:", err)
print('CHOSEN ALPHA:', grid_search.best_params_['alpha'])

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
MSE: 0.3562628152996591
CHOSEN ALPHA: 0.20590302760544998
<matplotlib.collections.PathCollection at 0x137e80a90>
_images/be3d7c2020132424c9c0fc75a9c3df4285aee2ff244fae26985eae19098509d0.png

Pros#

  • Only takes a few lines to implement

  • Gives a lot of data once it is fit

Cons#

  • Long run times

  • Only searches the range of parameters given

RandomizedSearchCV (Scikit-Learn)#

  • Tries n_iter randomly chosen values for the hyperparmeters taken from a given distribution (uniform in this case)

%%time
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

distributions = {'alpha':uniform(loc=0, scale=4)}

ridge = Ridge()

random_search = RandomizedSearchCV(ridge, distributions,\
                                   scoring='neg_mean_squared_error', n_iter=5000)
random_search.fit(X_train, y_train)

print(random_search.best_params_, random_search.best_score_)
{'alpha': 0.20597888239758078} -0.28856463563603973
CPU times: user 13.3 s, sys: 230 ms, total: 13.5 s
Wall time: 13.2 s
ridge = Ridge(alpha=random_search.best_params_['alpha'])
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

err = MSE(y_pred, y_test)
print("MSE:", err)
print('CHOSEN ALPHA:', random_search.best_params_['alpha'])

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
MSE: 0.35626318516764216
CHOSEN ALPHA: 0.20597888239758078
<matplotlib.collections.PathCollection at 0x164bb3940>
_images/be3d7c2020132424c9c0fc75a9c3df4285aee2ff244fae26985eae19098509d0.png

Pros#

  • Only takes a few lines to implement

  • Gives a lot of data once it is fit

Cons#

  • Long run times (depending on value of n_iter)

  • Only searches a finite number of parameter combinations

Bayesian Ridge Regression#

  • Finds the “most likely” value for \(\alpha\) using Bayesian stastics

  • Leaves no uncertainity that the best value was just not sampled

%%time
from sklearn.linear_model import BayesianRidge

bayesian_ridge = BayesianRidge()
bayesian_ridge.fit(X_train, y_train)
y_pred = bayesian_ridge.predict(X_test)

print(bayesian_ridge.alpha_)
3.688440522864384
CPU times: user 5.58 ms, sys: 1.28 ms, total: 6.85 ms
Wall time: 1.68 ms
ridge = BayesianRidge()
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

err = MSE(y_pred, y_test)
print("MSE:", err)
print('CHOSEN ALPHA:', ridge.alpha_)

X_test_plot = scaler.inverse_transform(X_test)

plt.scatter(X_test_plot[:,1],y_test,label='True')
plt.scatter(X_test_plot[:,1],y_pred,label='Predicted')
MSE: 0.3592175812997633
CHOSEN ALPHA: 3.688440522864384
<matplotlib.collections.PathCollection at 0x164d18610>
_images/e54df8ad8187114ee8728134157bd7bea58fc69dec202cd228fb60f1dfa7622b.png

Pros#

  • Only takes a few lines to implement

  • Statical certainity that the given \(\alpha\) value is the best value

  • Short run times

Cons#

  • Only valid for Bayesian ridge regression (linear model)

Comparison of \(\alpha\), Accuracy, and Run Time#

  • Exact values depend on the train-test split and RandomizedSearch results; these values were taken from one run

labels = ['default', 'for loops', 'GridSearchCV', 'RandomizedSearchCV', 'Bayesian']
alphas = [1.0, 1e-15, 769.79, 3.99, 1.90]
mse = [0.23461613368850237, 0.2343672966024232, 0.23546465060931623, 0.23546192678932165, 0.23439940455100014]
run_times = [20.8/1000, 1.35, 57.7, 7*60+40, 1.18] #ms
plt.bar(np.arange(0,len(labels)), alphas)
plt.yscale('log')
plt.xticks(np.arange(0,len(labels)), labels, rotation=45)
plt.ylabel(r"Best $\alpha$ Value")
plt.xlabel("Hyperparameter Tuning Method")
Text(0.5, 0, 'Hyperparameter Tuning Method')
_images/6a146f227e2ade44e574e781891a2688e2a1d6a13a4aa04842c1e01cce9d4935.png
plt.bar(np.arange(0,len(labels)), mse)
plt.xticks(np.arange(0,len(labels)), labels, rotation=45)
plt.ylabel(r"MSE")
plt.xlabel("Hyperparameter Tuning Method")
Text(0.5, 0, 'Hyperparameter Tuning Method')
_images/c569609b03b2b8cd2958ff13526a5371972b26bb99fe955c13411ce0675694cf.png
plt.bar(np.arange(0,len(labels)), run_times)
plt.xticks(np.arange(0,len(labels)), labels, rotation=45)
plt.ylabel(r"Run Time (sec.)")
plt.xlabel("Hyperparameter Tuning Method")
plt.yscale('log')
_images/6ffca5b6552a25ce1daeee7c6dd051b600e5366e43c192552a1ac3f360b41b54.png

Part 2: Feature Engineering#

Feature engineering is the process of eliminating or altering given inputs in order to improve the model’s predictions

  • Design matrix/alter the inputs

  • Remove features that are not useful

  • Scaling the features or targets

import pandas as pd
import seaborn as sns

housing = fetch_california_housing()

housing_data = pd.DataFrame(housing.data, columns=housing.feature_names)

housing_data['target'] = housing.target

housing_data = housing_data.sample(1000)

Pairplots can give us initial ideas about the data set and and obvious relations

sns.pairplot(housing_data)
/Users/juliehartley/Library/Python/3.9/lib/python/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x164247e20>
_images/129d47b1868a2f5ad7333f6073ef3d2aa6d8f7b5ec0f42cc4557f3968a164207.png
# Remove outliers from the data set and make the pairplot again

from scipy import stats

housing_data = housing_data[(np.abs(stats.zscore(housing_data)) < 3).all(axis=1)]

sns.pairplot(housing_data)
/Users/juliehartley/Library/Python/3.9/lib/python/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x1689a4070>
_images/048bfb225ecfd5c6a29aa2c5b26b57f6477702c32bcd6f9e705df8aaeafd5ef8.png

Correlation Matrix#

  • Correlation score = \(\sqrt{R2\ Score}\)

  • Values close to \(\pm1\) means that the two feautures are linearly related

correlation_matrix = housing_data.corr()
correlation_matrix
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude target
MedInc 1.000000 -0.176376 0.624278 -0.182821 -0.006504 -0.118286 -0.145364 0.051990 0.671977
HouseAge -0.176376 1.000000 -0.273970 -0.171407 -0.250648 -0.000109 0.007118 -0.100684 0.077059
AveRooms 0.624278 -0.273970 1.000000 0.292713 -0.108061 -0.049828 0.077900 -0.022613 0.254047
AveBedrms -0.182821 -0.171407 0.292713 1.000000 -0.080388 -0.102396 0.132675 -0.047917 -0.121542
Population -0.006504 -0.250648 -0.108061 -0.080388 1.000000 0.170063 -0.097118 0.070431 0.019527
AveOccup -0.118286 -0.000109 -0.049828 -0.102396 0.170063 1.000000 -0.198600 0.214724 -0.324090
Latitude -0.145364 0.007118 0.077900 0.132675 -0.097118 -0.198600 1.000000 -0.931286 -0.198935
Longitude 0.051990 -0.100684 -0.022613 -0.047917 0.070431 0.214724 -0.931286 1.000000 0.006876
target 0.671977 0.077059 0.254047 -0.121542 0.019527 -0.324090 -0.198935 0.006876 1.000000
plt.matshow(correlation_matrix)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x28b885fd0>
_images/a4820a25a199faef10a309b371527a6d2bbb998debb3e7cbcf3171e12975ca38.png

LASSO Regression for Feature Selection#

lasso = Lasso()
lasso.fit(X,y)
ypred = lasso.predict(X)
print(MSE(ypred,y))
lasso.coef_
0.638471842835069
array([ 0.06161184, -0.        ,  0.        , -0.        ,  0.00011673,
       -0.        ,  0.        ,  0.        ])

Accuracy with All Data (Scaled)#

X = housing_data.drop(columns=['target'])
y = housing_data['target']

scaler = StandardScaler()
scaler.fit(X)
Z = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(Z,y,test_size=0.2)

bayesian_ridge = BayesianRidge()
bayesian_ridge.fit(X_train, y_train)
y_pred = bayesian_ridge.predict(X_test)

print('MSE:', MSE(y_pred, y_test))
MSE: 0.44933334607255193

Accuracy with Most Correlated Features (Scaled)#

X_corr = housing_data[['MedInc', 'HouseAge', 'AveRooms']]
y = housing_data['target']

scaler = StandardScaler()
scaler.fit(X_corr)
Z = scaler.transform(X_corr)

X_train, X_test, y_train, y_test = train_test_split(Z,y,test_size=0.2)

bayesian_ridge = BayesianRidge()
bayesian_ridge.fit(X_train, y_train)
y_pred = bayesian_ridge.predict(X_test)

print('MSE:', MSE(y_pred, y_test))
MSE: 0.5527782905639612
  • Automate the dimensionality reduction with unsupervised learning (next week!)

Part 3: Nonlinear Models#

Kernel Ridge Regression (KRR)#

  • All previous models we have studied have been linear–capable of modeling linear patterns

    • Design matrices can only add some much functions

  • Many data sets will have a nonlinear pattern and thus we need a nonlinear model

    • Kernel ridge regression (KRR)

    • Support vector machines (SVMs)

    • Neural Networks

Kernel Functions#

  • Scikit-Learn kernels are found here

    • Linear: \(k(x,y) = x^Ty\)

    • Polynomial: k(x,y) = \((\gamma x^Ty+c_0)^d\)

    • Sigmoid: \(k(x,y) = tanh(\gamma x^Ty+c_0)\)

    • Radial Basis Function (RBF): \(k(x,y) = exp(-\gamma||x-y||_2)\)

  • Inputs to KRR algorithm are modified by the kernel function, thus giving the method nonlinearuty \(\longrightarrow\) kernel methods/trick allows linear methods to solve nonlinear problems

    • Kernel ridge regression is just ridge regression with the inputs modified by the kernel function

KRR Equations#

  • Form of predictions: \(\hat{y}(x) = \sum_{i=1}^m \theta_ik(x_i,x)\)

  • Loss function:\(J(\theta) = MSE(y,\hat{y}) + \frac{\alpha}{2}\sum_{i=1}^n\theta_i^2\)

  • Optimized parameters: \(\theta = (\textbf{K}-\alpha\textbf{I})^{-1}y\)

    • \(\textbf{K} = k(x_i, x_j)\)

Hyperparameter Tuning with Many Hyperparameters#

  • KRR has the same hyperparameter as ridge regression: \(\alpha\)

  • Each kernel function has 0-3 hyperparameters

  • The choice of kernel function is a hyperparameter

  • Hyperparameter tuning becomes more important as the number of hyperparameters in the model increases

Housing Data with Kernel Ridge Regression#

from sklearn.kernel_ridge import KernelRidge

X,y = fetch_california_housing(return_X_y=True)

X = X[:500]
y = y[:500]

scaler = StandardScaler()
scaler.fit(X)
Z = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(Z,y,test_size=0.2)

krr = KernelRidge()
krr.fit(X_train, y_train)
y_pred = krr.predict(X_test)
print("MSE:", MSE(y_pred, y_test))
MSE: 4.105741823609707
distributions = {'alpha':uniform(loc=0, scale=4), 'kernel':['linear', \
                                                            'polynomial', 'rbf', \
                                                            'sigmoid', 'laplacian'], \
                 'gamma':uniform(loc=0, scale=4),\
                'degree':np.arange(0,10), 'coef0':uniform(loc=0, scale=4)}

krr = KernelRidge()

random_search = RandomizedSearchCV(krr, distributions,\
                                   scoring='neg_mean_squared_error', n_iter=50)
random_search.fit(X_train, y_train)
print(random_search)
print(random_search.best_params_, random_search.best_score_)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In [32], line 11
      7 krr = KernelRidge()
      9 random_search = RandomizedSearchCV(krr, distributions,\
     10                                    scoring='neg_mean_squared_error', n_iter=50)
---> 11 random_search.fit(X_train, y_train)
     12 print(random_search)
     13 print(random_search.best_params_, random_search.best_score_)

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_search.py:875, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    869     results = self._format_results(
    870         all_candidate_params, n_splits, all_out, all_more_results
    871     )
    873     return results
--> 875 self._run_search(evaluate_candidates)
    877 # multimetric is determined here because in the case of a callable
    878 # self.scoring the return type is only known after calling
    879 first_test_score = all_out[0]["test_scores"]

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_search.py:1753, in RandomizedSearchCV._run_search(self, evaluate_candidates)
   1751 def _run_search(self, evaluate_candidates):
   1752     """Search n_iter candidates from param_distributions"""
-> 1753     evaluate_candidates(
   1754         ParameterSampler(
   1755             self.param_distributions, self.n_iter, random_state=self.random_state
   1756         )
   1757     )

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_search.py:822, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    814 if self.verbose > 0:
    815     print(
    816         "Fitting {0} folds for each of {1} candidates,"
    817         " totalling {2} fits".format(
    818             n_splits, n_candidates, n_candidates * n_splits
    819         )
    820     )
--> 822 out = parallel(
    823     delayed(_fit_and_score)(
    824         clone(base_estimator),
    825         X,
    826         y,
    827         train=train,
    828         test=test,
    829         parameters=parameters,
    830         split_progress=(split_idx, n_splits),
    831         candidate_progress=(cand_idx, n_candidates),
    832         **fit_and_score_kwargs,
    833     )
    834     for (cand_idx, parameters), (split_idx, (train, test)) in product(
    835         enumerate(candidate_params), enumerate(cv.split(X, y, groups))
    836     )
    837 )
    839 if len(out) < 1:
    840     raise ValueError(
    841         "No fits were performed. "
    842         "Was the CV iterator empty? "
    843         "Were there no candidates?"
    844     )

File ~/Library/Python/3.9/lib/python/site-packages/joblib/parallel.py:1088, in Parallel.__call__(self, iterable)
   1085 if self.dispatch_one_batch(iterator):
   1086     self._iterating = self._original_iterator is not None
-> 1088 while self.dispatch_one_batch(iterator):
   1089     pass
   1091 if pre_dispatch == "all" or n_jobs == 1:
   1092     # The iterable was consumed all at once by the above for loop.
   1093     # No need to wait for async callbacks to trigger to
   1094     # consumption.

File ~/Library/Python/3.9/lib/python/site-packages/joblib/parallel.py:901, in Parallel.dispatch_one_batch(self, iterator)
    899     return False
    900 else:
--> 901     self._dispatch(tasks)
    902     return True

File ~/Library/Python/3.9/lib/python/site-packages/joblib/parallel.py:819, in Parallel._dispatch(self, batch)
    817 with self._lock:
    818     job_idx = len(self._jobs)
--> 819     job = self._backend.apply_async(batch, callback=cb)
    820     # A job can complete so quickly than its callback is
    821     # called before we get here, causing self._jobs to
    822     # grow. To ensure correct results ordering, .insert is
    823     # used (rather than .append) in the following line
    824     self._jobs.insert(job_idx, job)

File ~/Library/Python/3.9/lib/python/site-packages/joblib/_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~/Library/Python/3.9/lib/python/site-packages/joblib/_parallel_backends.py:597, in ImmediateResult.__init__(self, batch)
    594 def __init__(self, batch):
    595     # Don't delay the application, to avoid keeping the input
    596     # arguments in memory
--> 597     self.results = batch()

File ~/Library/Python/3.9/lib/python/site-packages/joblib/parallel.py:288, in BatchedCalls.__call__(self)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~/Library/Python/3.9/lib/python/site-packages/joblib/parallel.py:288, in <listcomp>(.0)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/utils/fixes.py:117, in _FuncWrapper.__call__(self, *args, **kwargs)
    115 def __call__(self, *args, **kwargs):
    116     with config_context(**self.config):
--> 117         return self.function(*args, **kwargs)

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_validation.py:686, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    684         estimator.fit(X_train, **fit_params)
    685     else:
--> 686         estimator.fit(X_train, y_train, **fit_params)
    688 except Exception:
    689     # Note fit time as time until error
    690     fit_time = time.time() - start_time

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/kernel_ridge.py:195, in KernelRidge.fit(self, X, y, sample_weight)
    192     ravel = True
    194 copy = self.kernel == "precomputed"
--> 195 self.dual_coef_ = _solve_cholesky_kernel(K, y, alpha, sample_weight, copy)
    196 if ravel:
    197     self.dual_coef_ = self.dual_coef_.ravel()

File ~/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_ridge.py:249, in _solve_cholesky_kernel(K, y, alpha, sample_weight, copy)
    243 K.flat[:: n_samples + 1] += alpha[0]
    245 try:
    246     # Note: we must use overwrite_a=False in order to be able to
    247     #       use the fall-back solution below in case a LinAlgError
    248     #       is raised
--> 249     dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)
    250 except np.linalg.LinAlgError:
    251     warnings.warn(
    252         "Singular matrix in solving dual problem. Using "
    253         "least-squares solution instead."
    254     )

File ~/Library/Python/3.9/lib/python/site-packages/scipy/linalg/_basic.py:251, in solve(a, b, lower, overwrite_a, overwrite_b, check_finite, assume_a, transposed)
    247 # Positive definite case 'posv'
    248 else:
    249     pocon, posv = get_lapack_funcs(('pocon', 'posv'),
    250                                    (a1, b1))
--> 251     lu, x, info = posv(a1, b1, lower=lower,
    252                        overwrite_a=overwrite_a,
    253                        overwrite_b=overwrite_b)
    254     _solve_check(n, info)
    255     rcond, info = pocon(lu, anorm)

KeyboardInterrupt: 
X,y = fetch_california_housing(return_X_y=True)

scaler = StandardScaler()
scaler.fit(X)
Z = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(Z,y,test_size=0.2)
# Using the best parameters from through one run of the 
# RandomizedSearchCV method above
krr = KernelRidge(alpha= 0.35753693232459094, coef0= 3.2725118241858264, degree= 9, gamma= 0.14696609545731532, kernel= 'laplacian')
krr.fit(X_train, y_train)
y_pred = krr.predict(X_test)
print('MSE:', MSE(y_pred, y_test))
MSE: 0.23499324697787755
bayesian_ridge = BayesianRidge()
bayesian_ridge.fit(X_train, y_train)
y_pred = bayesian_ridge.predict(X_test)
print('MSE:', MSE(y_pred, y_test))
MSE: 0.5242651488478602