Model Selection

class datatoolkit.model_selection.BayesianSearchCV(estimator: sklearn.base.BaseEstimator, parameter_space: dict[str, <module 'hyperopt.hp' from '/home/docs/checkouts/readthedocs.org/user_builds/datatoolkit/envs/latest/lib/python3.9/site-packages/hyperopt/hp.py'>], n_iter: int = 10, scoring=typing.Union[collections.abc.Iterable[str], collections.abc.Callable, NoneType], cv=<class 'sklearn.model_selection._split.StratifiedShuffleSplit'>, refit: str = 'loss', verbose=0, random_state=None, error_score='raise', return_train_score=False)

Bayesian Search Cross Validation.

Parameters

(BaseEstimator) – Sci-kit learn base estimator.
(ClassifierMixin) – Sci-kit learn classifier mixin.

Raises

TypeError – When scoring argument is of wrong type.
NotFittedError – When estimator is not fitted.

References

[1] https://stackoverflow.com/questions/52408949/cross-validation-and-parameters-tuning-with-xgboost-and-hyperopt

cross_validate(parameter_space: dict, X: collections.abc.Iterable[float], y: collections.abc.Iterable[float]) → dict

Fit estimator on training set and evaluate on validation set, in accordance to cross-validation generator.

Parameters

parameter_space (dict) – Dict containing parameter space.
X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.
y (Iterable[float]) – Array-like of shape (n_samples,) containing target label.

Returns

Dict containing cross validation results.

Return type

dict

fit(X: collections.abc.Iterable[float], y: collections.abc.Iterable[float])

Fits estimator.

Parameters

X (Iterable[float]) – Matrix of shape (n_samples, n_features).
y (Iterable[float]) – Array-like of shape (n_samples,).

get_dataset_type_score_name_index(split_iterator: Optional[collections.abc.Iterable[int]] = None) → collections.abc.Generator[tuple[str, str, int]]

Generates tuple composed of dataset type, score name, and index.

Parameters: split_iterator (Union[Iterable[int], None], optional) – Array-like of shape (n_splits,) having the size of number of CV splits. Defaults to None.
Yields: Generator[tuple[str, str, int]] – Tuple composed of dataset type, score name, and index.

objective(y_true: collections.abc.Iterable[float], y_pred: collections.abc.Iterable[float], score_name: str) → float

Objective function to be minimized.

Parameters

y_true (Iterable[float]) – Array-like of shape (n_samples,) containing true values of target label.
y_pred (Iterable[float]) – Array-like of shape (n_samples,) containing predicted values of target label.
score_name (str) – _description_

Returns

Returns absolute difference between score and optimal value.

Return type

float

optimize(X: collections.abc.Iterable[float], y: collections.abc.Iterable[float]) → dict

Runs hyperparameter optimization.

Parameters

X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.
y (Iterable[float]) – Array-like of shape (n_samples,) containing target label.

Returns

Optimal parameter space.

Return type

dict

post_process_cv_results(): Process cross validation results by calculating average and standard deviation of scores.

predict(X: collections.abc.Iterable[float]) → collections.abc.Iterable[float]

Predicts observation class

Parameters: X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.
Returns: Classes.
Return type: Iterable[float]

predict_proba(X: collections.abc.Iterable[float]) → collections.abc.Iterable[float]

Predict probabilities observation of be in a class.

Parameters: X (Iterable[float]) – Array-like of shape (n_samples, n_features) containing predictors.
Returns: Classes probabilities.
Return type: Iterable[float]

static scorer_class_map(y_pred: np.ndarray[float], score_name: str, threshold: float = 0.5) → np.ndarray[float]

Maps score name to class.

Parameters

y_pred (np.ndarray[float]) – Array-like of shape (n_samples,).
score_name (str) – Name of the performance metric
threshold (float, optional) – Threshold used to transform probability into class. Defaults to 0.5.

Returns

Array-like of shape (n_samples,).

Return type

np.ndarray[float]

static scorer_optimal_value(score_name: str) → float

Maps score name to optimal value.

Parameters: score_name (str) – Name of performance metric
Returns: Optimal value.
Return type: float

class datatoolkit.model_selection.ClassificationCostFunction(metrics: collections.abc.Iterable[str], M: np.ndarray[float] = None, metric_class_opt_val_map: dict[str, tuple[str, float]] = None, proba_threshold: float = 0.5)

objective(y_true: np.ndarray[float], y_pred: np.ndarray[float]) → float

Objective function.

Parameters

y_true (np.ndarray[float]) – Array-like of true labels of length N.
y_pred (np.ndarray[float]) – Array-like of predicted labels of length N.

class datatoolkit.model_selection.CostFunction(metrics: collections.abc.Iterable[str], M: np.ndarray[float])

Abstract class for cost functions

abstract objective(y_true: np.ndarray[float], y_pred: np.ndarray[float]) → float

Objective function.

Parameters

y_true (np.ndarray[float]) – Array-like of true labels of length N.
y_pred (np.ndarray[float]) – Array-like of predicted labels of length N.