john_1726

Untitled

Jul 9th, 2022
61
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 13.64 KB | None | 0 0
  1.  
  2. # Usually, hyperparameter tuning is combined with cross-validation. Sometimes, we want to run cross validation independently to see whether a candidate model is generalized enough for this dataset. To this end, let's use linear regressor as an example.
  3. #
  4. # Code Listing 9.01. Import all the necessary packages for the cross-validation example. We use the first 150 data points of the diabetes dataset and make a linear regressor with default parameters.
  5. from sklearn import datasets, linear_model
  6. from sklearn.model_selection import cross_validate
  7. diabetes = datasets.load_diabetes()
  8. X = diabetes.data[:150]
  9. y = diabetes.target[:150]
  10. lr = linear_model.LinearRegression()
  11. # Next, we will use the method cross_validate() to apply cross-validation on the linear regressor
  12. #
  13. # Code Listing 9.02. Use the cross_validate() method to apply 5-fold cross-validation on the linear regressor, and then print the test scores.
  14. scores = cross_validate(lr, X, y, cv=5, scoring=('r2', 'neg_mean_squared_error'),
  15. return_train_score=True)
  16. print("negative mean squared errors: ", scores["test_neg_mean_squared_error"])
  17. print("r2 scores: ", scores["test_r2"])
  18. negative mean squared errors: [-2547.29219945 -4523.25983124 -2301.49369105 -4378.07848216
  19. -2409.19372015]
  20. r2 scores: [0.36324841 0.28239194 0.4211776 0.30071196 0.61240533]
  21. # We use 5-fold cross-validation and use r2 and negative mean square error as the metrics. As we can see from the output, the linear regressor performs differently on each fold. That's why cross-validation can help us observe the performance vibration when data changes.
  22. #
  23. # Now we can try to incorporate hyperparameter tuning and see how it improves the performance over the model with default parameters. We know random forest models perform very well with default settings. Can we still make improvements with hyperparameter tuning and cross-validation?
  24. #
  25. # Firstly, we will fetch the California housing dataset for this practice. As usual, we will randomly get 80% of the data for training.
  26. #
  27. # Code Listing 9.03. Fetch the California housing dataset and split it into training/test sets.
  28. import numpy as np
  29. from sklearn.model_selection import RandomizedSearchCV
  30.  
  31. from sklearn.datasets import fetch_california_housing
  32. from sklearn.model_selection import train_test_split
  33. from sklearn.metrics import mean_squared_error, r2_score
  34.  
  35. california_housing_bunch = fetch_california_housing()
  36. california_housing_X, california_housing_y = california_housing_bunch.data, california_housing_bunch.target
  37. x_train, x_test, y_train, y_test = train_test_split(california_housing_X, california_housing_y, test_size=0.2)
  38. # For the second step, we need to create a basic estimator. If you want to fix some hyperparameters, you can set them at this stage.
  39. #
  40. # train a kNN regressor with k = 10
  41. from sklearn.neighbors import KNeighborsRegressor
  42. knn_10_regr = KNeighborsRegressor(n_neighbors=10)
  43. knn_10_regr.fit(x_train, y_train)
  44.  
  45. # train a kNN regressor with k = 100
  46. knn_100_regr = KNeighborsRegressor(n_neighbors=100, weights="distance")
  47. knn_100_regr.fit(x_train, y_train)
  48. KNeighborsRegressor(n_neighbors=100, weights='distance')
  49. # Now it is time to create a hyperparameter grid for a random search.
  50. #
  51. # Code Listing 9.05. Create a hyperparameter grid for 3 parameters: n_estimators, max_depth, bootstrap for RandomForestRegressor.
  52. #
  53. # Number of trees in random forest
  54. n_estimators = [int(x) for x in np.linspace(start = 600, stop = 2000, num = 15)]
  55. # Maximum number of levels in tree
  56. max_depth = [int(x) for x in np.linspace(10, 80, num = 8)]
  57. max_depth.append(None)
  58. # Method of selecting samples for training each tree
  59. bootstrap = [True, False]
  60.  
  61. random_grid = {'n_estimators': n_estimators,
  62. 'max_depth': max_depth,
  63. 'bootstrap': bootstrap}
  64. # numpy's linspace() method can create a list of numbers by pre-defined start/stop. We can start the random search + cross-validation. When we have a basic regressor, we can use it as the value for the parameter estimator for RandomizedSearchCV(), and randomly try different combinations of hyperparameters we want to test.
  65. #
  66. # Code Listing 9.06. Use training data to process the randomized search and cross-validation.
  67. #
  68. # Random search of parameters, using 3 fold cross validation,
  69. # search across 10 different combinations, and use all available cores
  70. rf_random = RandomizedSearchCV(estimator = knn_10_regr, param_distributions = random_grid, n_iter = 10, cv = 3, n_jobs = -1)
  71. # Fit the random search model
  72. rf_random.fit(x_train, y_train)
  73. ---------------------------------------------------------------------------
  74. _RemoteTraceback Traceback (most recent call last)
  75. _RemoteTraceback:
  76. """
  77. Traceback (most recent call last):
  78. File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 436, in _process_worker
  79. r = call_item()
  80. File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 288, in __call__
  81. return self.fn(*self.args, **self.kwargs)
  82. File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 595, in __call__
  83. return self.func(*args, **kwargs)
  84. File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
  85. return [func(*args, **kwargs)
  86. File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
  87. return [func(*args, **kwargs)
  88. File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
  89. return self.function(*args, **kwargs)
  90. File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 586, in _fit_and_score
  91. estimator = estimator.set_params(**cloned_parameters)
  92. File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py", line 230, in set_params
  93. raise ValueError('Invalid parameter %s for estimator %s. '
  94. ValueError: Invalid parameter n_estimators for estimator KNeighborsRegressor(n_neighbors=10). Check the list of available parameters with `estimator.get_params().keys()`.
  95. """
  96.  
  97. The above exception was the direct cause of the following exception:
  98.  
  99. ValueError Traceback (most recent call last)
  100. D:\Users\psalm\AppData\Local\Temp/ipykernel_33000/966981705.py in <module>
  101. 7 rf_random = RandomizedSearchCV(estimator = knn_10_regr, param_distributions = random_grid, n_iter = 10, cv = 3, n_jobs = -1)
  102. 8 # Fit the random search model
  103. ----> 9 rf_random.fit(x_train, y_train)
  104.  
  105. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
  106. 61 extra_args = len(args) - len(all_args)
  107. 62 if extra_args <= 0:
  108. ---> 63 return f(*args, **kwargs)
  109. 64
  110. 65 # extra_args > 0
  111.  
  112. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
  113. 839 return results
  114. 840
  115. --> 841 self._run_search(evaluate_candidates)
  116. 842
  117. 843 # multimetric is determined here because in the case of a callable
  118.  
  119. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
  120. 1631 def _run_search(self, evaluate_candidates):
  121. 1632 """Search n_iter candidates from param_distributions"""
  122. -> 1633 evaluate_candidates(ParameterSampler(
  123. 1634 self.param_distributions, self.n_iter,
  124. 1635 random_state=self.random_state))
  125.  
  126. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
  127. 793 n_splits, n_candidates, n_candidates * n_splits))
  128. 794
  129. --> 795 out = parallel(delayed(_fit_and_score)(clone(base_estimator),
  130. 796 X, y,
  131. 797 train=train, test=test,
  132.  
  133. C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
  134. 1054
  135. 1055 with self._backend.retrieval_context():
  136. -> 1056 self.retrieve()
  137. 1057 # Make sure that we get a last message telling us we are done
  138. 1058 elapsed_time = time.time() - self._start_time
  139.  
  140. C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in retrieve(self)
  141. 933 try:
  142. 934 if getattr(self._backend, 'supports_timeout', False):
  143. --> 935 self._output.extend(job.get(timeout=self.timeout))
  144. 936 else:
  145. 937 self._output.extend(job.get())
  146.  
  147. C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
  148. 540 AsyncResults.get from multiprocessing."""
  149. 541 try:
  150. --> 542 return future.result(timeout=timeout)
  151. 543 except CfTimeoutError as e:
  152. 544 raise TimeoutError from e
  153.  
  154. C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
  155. 443 raise CancelledError()
  156. 444 elif self._state == FINISHED:
  157. --> 445 return self.__get_result()
  158. 446 else:
  159. 447 raise TimeoutError()
  160.  
  161. C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
  162. 388 if self._exception:
  163. 389 try:
  164. --> 390 raise self._exception
  165. 391 finally:
  166. 392 # Break a reference cycle with the exception in self._exception
  167.  
  168. ValueError: Invalid parameter n_estimators for estimator KNeighborsRegressor(n_neighbors=10). Check the list of available parameters with `estimator.get_params().keys()`.
  169. estimator.get_params().keys()
  170. ---------------------------------------------------------------------------
  171. NameError Traceback (most recent call last)
  172. D:\Users\psalm\AppData\Local\Temp/ipykernel_33000/2616873292.py in <module>
  173. ----> 1 estimator.get_params().keys()
  174.  
  175. NameError: name 'estimator' is not defined
  176. # Parameter n_iter determines how many hyperparameter combinations we want to try. Initially, we should set it to 1 and see how long it will take and use this information to estimate the time cost if we set it a large number. cv =3 means we will do 3-fold cross-validation. In total we will train/test n_iter * cv times and find the optimal hyperparameter combination. Given the California housing dataset, it takes about 7.5 minutes on a Macbook Pro laptop when we run code in Code Listing 9.06.
  177. #
  178. # Once we have done the procedure, we can evaluate the best (fine-tune) estimator we obtain with the basic estimator.
  179. #
  180. # Code Listing 9.07. Evaluate the basic estimator and the fine-tune estimator.
  181. # Evaluate the three models on the test set with metrics MSE and R2
  182. from sklearn.metrics import mean_squared_error, r2_score
  183.  
  184. print("Performance of Linear regressor")
  185. california_housing_y_pred = lr.predict(x_test)
  186. print("Mean squared error: %.2f" % mean_squared_error(y_test, california_housing_y_pred))
  187. print("Coefficient of determination: %.2f" % r2_score(y_test, california_housing_y_pred))
  188. print()
  189.  
  190. print("Performance of kNN regressor with n = 10")
  191. california_housing_y_pred = knn_10_regr.predict(x_test)
  192. print("Mean squared error: %.2f" % mean_squared_error(y_test, california_housing_y_pred))
  193. print("Coefficient of determination: %.2f" % r2_score(y_test, california_housing_y_pred))
  194. print()
  195.  
  196. print("Performance of kNN regressor with n = 100")
  197. california_housing_y_pred = knn_100_regr.predict(x_test)
  198. print("Mean squared error: %.2f" % mean_squared_error(y_test, california_housing_y_pred))
  199. print("Coefficient of determination: %.2f" % r2_score(y_test, california_housing_y_pred))
  200. Performance of Linear regressor
  201. ---------------------------------------------------------------------------
  202. NotFittedError Traceback (most recent call last)
  203. D:\Users\psalm\AppData\Local\Temp/ipykernel_33000/2629837005.py in <module>
  204. 8
  205. 9 print("Performance of Linear regressor")
  206. ---> 10 california_housing_y_pred = lr.predict(x_test)
  207. 11 print("Mean squared error: %.2f" % mean_squared_error(y_test, california_housing_y_pred))
  208. 12 print("Coefficient of determination: %.2f" % r2_score(y_test, california_housing_y_pred))
  209.  
  210. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in predict(self, X)
  211. 236 Returns predicted values.
  212. 237 """
  213. --> 238 return self._decision_function(X)
  214. 239
  215. 240 _preprocess_data = staticmethod(_preprocess_data)
  216.  
  217. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in _decision_function(self, X)
  218. 216
  219. 217 def _decision_function(self, X):
  220. --> 218 check_is_fitted(self)
  221. 219
  222. 220 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
  223.  
  224. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
  225. 61 extra_args = len(args) - len(all_args)
  226. 62 if extra_args <= 0:
  227. ---> 63 return f(*args, **kwargs)
  228. 64
  229. 65 # extra_args > 0
  230.  
  231. C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
  232. 1096
  233. 1097 if not attrs:
  234. -> 1098 raise NotFittedError(msg % {'name': type(estimator).__name__})
  235. 1099
  236. 1100
  237.  
  238. NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
  239.  
  240.  
  241.  
Advertisement
Add Comment
Please, Sign In to add comment