skbold.preproc package

class LabelFactorizer(grouping)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transforms labels according to a given factorial grouping.

Factorizes/encodes labels based on part of the string label. For example, the label-vector [‘A_1’, ‘A_2’, ‘B_1’, ‘B_2’] can be grouped based on letter (A/B) or number (1/2).

Parameters:grouping (List of str) – List with identifiers for condition names as strings
Variables:new_labels (list) – List with new labels.
fit(y=None, X=None)[source]

Does nothing, but included to be used in sklearn’s Pipeline.

get_new_labels()[source]

Returns new labels based on factorization.

transform(y, X=None)[source]

Transforms label-vector given a grouping.

Parameters:
  • y (List/ndarray of str) – List of ndarray with strings indicating label-names
  • X (ndarray) – Numeric (float) array of shape = [n_samples, n_features]
Returns:

  • y_new (ndarray) – array with transformed y-labels
  • X_new (ndarray) – array with transformed data of shape = [n_samples, n_features] given new factorial grouping/design.

class MajorityUndersampler(verbose=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Undersamples the majority-class(es) by selecting random samples.

Parameters:verbose (bool) – Whether to print downsamples number of samples.
__init__(verbose=False)[source]

Initializes MajorityUndersampler object.

fit(X=None, y=None)[source]

Does nothing, but included for scikit-learn pipelines.

transform(X, y)[source]

Downsamples majority-class(es).

Parameters:X (ndarray) – Numeric (float) array of shape = [n_samples, n_features]
Returns:X – Transformed array of shape = [n_samples, n_features] given the indices calculated during fit().
Return type:ndarray
class LabelBinarizer(params)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

__init__(params)[source]

Initializes LabelBinarizer object.

fit(X=None, y=None)[source]

Does nothing, but included for scikit-learn pipelines.

transform(X, y)[source]

Binarizes y-attribute.

Parameters:X (ndarray) – Numeric (float) array of shape = [n_samples, n_features]
Returns:X – Transformed array of shape = [n_samples, n_features] given the indices calculated during fit().
Return type:ndarray
class ConfoundRegressor(confound, X, cross_validate=True, precise=False, stack_intercept=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Fits a confound onto each feature in X and returns their residuals.

__init__(confound, X, cross_validate=True, precise=False, stack_intercept=True)[source]

Regresses out a variable (confound) from each feature in X.

Parameters:
  • confound (numpy array) – Array of length (n_samples, n_confounds) to regress out of each feature; May have multiple columns for multiple confounds.
  • X (numpy array) – Array of length (n_samples, n_features), from which the confound will be regressed. This is used to determine how the confound-models should be cross-validated (which is necessary to use in in scikit-learn Pipelines).
  • cross_validate (bool) – Whether to cross-validate the confound-parameters (y~confound) estimated from the train-set to the test set (cross_validate=True) or whether to fit the confound regressor separately on the test-set (cross_validate=False). Setting this parameter to True is equivalent to “foldwise confound regression” (FwCR) as described in our paper (https://www.biorxiv.org/content/early/2018/03/28/290684). Setting this parameter to False, however, is NOT equivalent to “whole dataset confound regression” (WDCR) as it does not apply confound regression to the full dataset, but simply refits the confound model on the test-set. We recommend setting this parameter to True.
  • precise (bool) – Transformer-objects in scikit-learn only allow to pass the data (X) and optionally the target (y) to the fit and transform methods. However, we need to index the confound accordingly as well. To do so, we compare the X during initialization (self.X) with the X passed to fit/transform. As such, we can infer which samples are passed to the methods and index the confound accordingly. When setting precise to True, the arrays are compared feature-wise, which is accurate, but relatively slow. When setting precise to False, it will infer the index by looking at the sum of all the features, which is less accurate, but much faster. For dense data, this should work just fine. Also, to aid the accuracy, we remove the features which are constant (0) across samples.
  • stack_intercept (bool) – Whether to stack an intercept to the confound (default is True)
Variables:

weights (numpy array) – Array with weights for the confound(s).

fit(X, y=None)[source]

Fits the confound-regressor to X.

Parameters:
  • X (numpy array) – An array of shape (n_samples, n_features), which should correspond to your train-set only!
  • y (None) – Included for compatibility; does nothing.
transform(X)[source]

Regresses out confound from X.

Parameters:X (numpy array) – An array of shape (n_samples, n_features), which should correspond to your train-set only!
Returns:X_new – ndarray with confound-regressed features
Return type:ndarray