Model Selection

class datatoolkit.model_selection.BayesianSearchCV(estimator: sklearn.base.BaseEstimator, parameter_space: dict[str, <module 'hyperopt.hp' from '/home/docs/checkouts/readthedocs.org/user_builds/datatoolkit/envs/latest/lib/python3.9/site-packages/hyperopt/hp.py'>], n_iter: int = 10, scoring=typing.Union[collections.abc.Iterable[str], collections.abc.Callable, NoneType], cv=<class 'sklearn.model_selection._split.StratifiedShuffleSplit'>, refit: str = 'loss', verbose=0, random_state=None, error_score='raise', return_train_score=False)

Bayesian Search Cross Validation.

Parameters
  • (BaseEstimator) – Sci-kit learn base estimator.

  • (ClassifierMixin) – Sci-kit learn classifier mixin.

Raises
  • TypeError – When scoring argument is of wrong type.

  • NotFittedError – When estimator is not fitted.

References

[1] https://stackoverflow.com/questions/52408949/cross-validation-and-parameters-tuning-with-xgboost-and-hyperopt

cross_validate(parameter_space: dict, X: collections.abc.Iterable[float], y: collections.abc.Iterable[float]) dict

Fit estimator on training set and evaluate on validation set, in accordance to cross-validation generator.

Parameters
  • parameter_space (dict) – Dict containing parameter space.

  • X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.

  • y (Iterable[float]) – Array-like of shape (n_samples,) containing target label.

Returns

Dict containing cross validation results.

Return type

dict

fit(X: collections.abc.Iterable[float], y: collections.abc.Iterable[float])

Fits estimator.

Parameters
  • X (Iterable[float]) – Matrix of shape (n_samples, n_features).

  • y (Iterable[float]) – Array-like of shape (n_samples,).

get_dataset_type_score_name_index(split_iterator: Optional[collections.abc.Iterable[int]] = None) collections.abc.Generator[tuple[str, str, int]]

Generates tuple composed of dataset type, score name, and index.

Parameters

split_iterator (Union[Iterable[int], None], optional) – Array-like of shape (n_splits,) having the size of number of CV splits. Defaults to None.

Yields

Generator[tuple[str, str, int]] – Tuple composed of dataset type, score name, and index.

objective(y_true: collections.abc.Iterable[float], y_pred: collections.abc.Iterable[float], score_name: str) float

Objective function to be minimized.

Parameters
  • y_true (Iterable[float]) – Array-like of shape (n_samples,) containing true values of target label.

  • y_pred (Iterable[float]) – Array-like of shape (n_samples,) containing predicted values of target label.

  • score_name (str) – _description_

Returns

Returns absolute difference between score and optimal value.

Return type

float

optimize(X: collections.abc.Iterable[float], y: collections.abc.Iterable[float]) dict

Runs hyperparameter optimization.

Parameters
  • X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.

  • y (Iterable[float]) – Array-like of shape (n_samples,) containing target label.

Returns

Optimal parameter space.

Return type

dict

post_process_cv_results()

Process cross validation results by calculating average and standard deviation of scores.

predict(X: collections.abc.Iterable[float]) collections.abc.Iterable[float]

Predicts observation class

Parameters

X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.

Returns

Classes.

Return type

Iterable[float]

predict_proba(X: collections.abc.Iterable[float]) collections.abc.Iterable[float]

Predict probabilities observation of be in a class.

Parameters

X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.

Returns

Classes probabilities.

Return type

Iterable[float]

static scorer_class_map(y_pred: np.ndarray[float], score_name: str, threshold: float = 0.5) np.ndarray[float]

Maps score name to class.

Parameters
  • y_pred (np.ndarray[float]) – Array-like of shape (n_samples,).

  • score_name (str) – Name of the performance metric

  • threshold (float, optional) – Threshold used to transform probability into class. Defaults to 0.5.

Returns

Array-like of shape (n_samples,).

Return type

np.ndarray[float]

static scorer_optimal_value(score_name: str) float

Maps score name to optimal value.

Parameters

score_name (str) – Name of performance metric

Returns

Optimal value.

Return type

float

class datatoolkit.model_selection.ClassificationCostFunction(metrics: collections.abc.Iterable[str], M: np.ndarray[float] = None, metric_class_opt_val_map: dict[str, tuple[str, float]] = None, proba_threshold: float = 0.5)
objective(y_true: np.ndarray[float], y_pred: np.ndarray[float]) float

Objective function.

Parameters
  • y_true (np.ndarray[float]) – Array-like of true labels of length N.

  • y_pred (np.ndarray[float]) – Array-like of predicted labels of length N.

class datatoolkit.model_selection.CostFunction(metrics: collections.abc.Iterable[str], M: np.ndarray[float])

Abstract class for cost functions

abstract objective(y_true: np.ndarray[float], y_pred: np.ndarray[float]) float

Objective function.

Parameters
  • y_true (np.ndarray[float]) – Array-like of true labels of length N.

  • y_pred (np.ndarray[float]) – Array-like of predicted labels of length N.