skbold.core.mvp_between module

class MvpBetween(source, subject_idf='sub0???', remove_zeros=True, X=None, y=None, mask=None, mask_thres=0, subject_list=None)[source]

Bases: skbold.core.mvp.Mvp

Extracts and stores multivoxel pattern information across subjects. The MvpBetween class allows for the extraction and storage of multivoxel (MRI) pattern information across subjects. The MvpBetween class can handle various types of information, including functional contrasts, 3D (subject-specific) and 4D (subjects stacked) VBM and TBSS data, dual-regression data, and functional-connectivity data from resting-state scans (experimental).

Parameters:
  • source (dict) –

    Dictionary with types of data as keys and data-specific dictionaries as values. Keys can be ‘Contrast_*’ (indicating a 3D functional contrast), ‘4D_anat’ (for 4D anatomical - VBM/TBSS - files), ‘VBM’, ‘TBSS’, and ‘dual_reg’ (a subject-spedific 4D file with components as fourth dimension).

    The dictionary passed as values must include, for each data-type, a path with wildcards to the corresponding (subject-specific) data-file. Other, optional, key-value pairs per data-type can be assigned, including ‘mask’: ‘path’, to use data-type-specific masks.

    An example:

    >>> source = {}
    >>> path_emo = '~/data/sub0*/*.feat/stats/tstat1.nii.gz'
    >>> source['Contrast_emo'] = {'path': path_emo}
    >>> vbm_mask = '~/vbm_mask.nii.gz'
    >>> path_vbm = '~/data/sub0*/*vbm.nii.gz'
    >>> source['VBM'] = {'path': path_vbm, 'mask': vbm_mask}
    
  • subject_idf (str) – Subject-identifier. This identifier is used to extract subject-names from the globbed directories in the ‘path’ keys in source, so that it is known which pattern belongs to which subject. This way, MvpBetween can check which subjects contain complete data!
  • X (ndarray) – Not necessary to pass MvpWithin, but needs to be defined as it is needed in the super-constructor.
  • y (ndarray or list) – Labels or targets corresponding to the samples in X.
  • mask (str) – Absolute path to nifti-file that will be used as a common mask. Note: this will only be applied if its shape corresponds to the to-be-indexed data. Otherwise, no mask is applied. Also, this mask is ‘overridden’ if source[data_type] contains a ‘mask’ key, which implies that this particular data-type has a custom mask.
  • mask_threshold (int or float) – Minimum value to binarize the mask when it’s probabilistic.
Variables:
  • mask_shape (tuple) – Shape of mask that patterns will be indexed with.
  • nifti_header (list of Nifti1Header objects) – Nifti-headers from original data-types.
  • affine (list of ndarray) – Affines corresponding to nifti-masks of each data-type.
  • X (ndarray) – The actual patterns (2D: samples X features)
  • y (list or ndarray) – Array/list with labels/targets corresponding to samples in X.
  • common_subjects (list) – List of subject-names that have complete data specified in source.
  • featureset_id (ndarray) – Array with integers of size X.shape[1] (i.e. the amount of features in X). Each unique integer, starting at 0, refers to a different feature-set.
  • voxel_idx (ndarray) –

    Array with integers of size X.shape[1]. Per feature-set, these voxel- indices allow the features to be mapped back to whole-brain space. For example, to map back the features in X from feature set 1 to MNI152 (2mm) space, do:

    >>> mni_vol = np.zeros((91, 109, 91))
    >>> tmp_idx = mvp.featureset_id == 0
    >>> mni_vol[mvp.featureset_id[tmp_idx]] = mvp.X[0, tmp_idx]
    
  • data_shape (list of tuples) – Original (whole-brain) shape of the loaded data, per data-type.
  • data_name (list of str) – List of names of data-types.
add_y(file_path, col_name, sep='\t', index_col=0, normalize=False, remove=None, ensure_balanced=False, nan_strategy='remove', **kwargs)[source]

Sets y attribute to an outcome-variable (target).

Parameters:
  • file_path (str) – Absolute path to spreadsheet-like file including the outcome var.
  • col_name (str) – Column name in spreadsheet containing the outcome variable
  • sep (str) – Separator to parse the spreadsheet-like file.
  • index_col (int) – Which column to use as index (should correspond to subject-name).
  • normalize (bool) – Whether to normalize (0 mean, unit std) the outcome variable.
  • remove (int or float or str) – Removes instances in which y == remove from MvpBetween object.
  • ensure_balanced (bool) – Whether to ensure balanced classes (if True, done by undersampling the majority class).
  • nan_strategy (str) – Strategy on how to deal with NaNs. Default: ‘remove’. Also, a specific string, int, or float can be specified when you want to impute a specific value. Other options, see: sklearn.preprocessing.Imputer.
  • **kwargs – Arbitrary keyword arguments passed to pandas read_csv.
apply_binarization_params(param_file, ensure_balanced=False)[source]

Applies binarization-parameters to y.

binarize_y(params, save_path=None, ensure_balanced=False)[source]

Binarizes mvp’s y-attribute using a specified method.

Parameters:
  • params (dict) –

    The outcome variable (y) will be binarized along the key-value pairs in the params-argument. Options:

    >>> params = {'type': 'percentile', 'high': 75, 'low': 25}
    >>> params = {'type': 'zscore', 'std': 1}
    >>> params = {'type': 'constant', 'cutoff': 10}
    >>> params = {'type': 'median'}
    
  • save_path (str) – If not None (default), this should be an absolute path referring to where the binarization-params should be saved.
  • ensure_balanced (bool) – Whether to ensure balanced classes (if True, done by undersampling the majority class).
calculate_confound_weighting(file_path, col_name, sep='\t', index_col=0, estimator=None, nan_strategy='depends', **kwargs)[source]

Calculates inverse probability weighting for confounds.

Note: should be moved to mvp-core

Parameters:
  • file_path (str) – Absolute path to spreadsheet-like file including the confounding variable.
  • col_name (str or List[str]) – Column name in spreadsheet containing the confouding variable
  • sep (str) – Separator to parse the spreadsheet-like file.
  • index_col (int) – Which column to use as index (should correspond to subject-name).
  • estimator (scikit-learn estimator) – Estimator used to calculate p(y=1 | confound-array)
  • nan_strategy (str) – How to impute NaNs.
  • **kwargs – Arbitrary keyword arguments passed to pandas read_csv.
Returns:

ipw – Array with inverse probability weights for the samples, based on the confounds indicated by col_name.

Return type:

array

References

Linn, K.A., Gaonkar, B., Doshi, J., Davatzikos, C., & Shinohara, R. (2016). Addressing confounding in predictive models with an application to neuroimaging. Int. J. Biostat., 12(1): 31-44.

Code adapted from https://github.com/kalinn/IPW-SVM.

create()[source]

Extracts and stores data as specified in source.

Raises:ValueError – If data-type is not one of [‘VBM’, ‘TBSS’, ‘4D_anat*’, ‘dual_reg’, ‘Contrast*’]
regress_out_confounds(file_path, col_name, backend='numpy', sep='\t', index_col=0, nan_strategy='depends', **kwargs)[source]

Regresses out a confounding variable from X.

Parameters:
  • file_path (str) – Absolute path to spreadsheet-like file including the confounding variable.
  • col_name (str or List[str]) – Column name in spreadsheet containing the confouding variable
  • backend (str) – Which algorithm to use to regress out the confound. The option ‘numpy’ uses np.linalg.lstsq() and ‘sklearn’ uses the LinearRegression estimator.
  • sep (str) – Separator to parse the spreadsheet-like file.
  • index_col (int) – Which column to use as index (should correspond to subject-name).
  • nan_strategy (str) – How to impute NaNs.
  • **kwargs – Arbitrary keyword arguments passed to pandas read_csv.
run_searchlight(out_dir, name='sl_results', n_folds=10, radius=5, mask=None, estimator=None, **kwargs)[source]

Runs a searchlight on the mvp object.

Parameters:
  • out_dir (str) – Path to which to save the searchlight results
  • name (str) – Name for the searchlight-results-file (nifti)
  • n_folds (int) – The amount of folds in sklearn’s StratifiedKFold.
  • radius (int/list) – Radius for the searchlight. If list, it iterates over radii.
  • mask (str) – Path to mask to apply to mvp. If nothing is listed, it will use the masks applied when the mvp was created.
  • estimator (sklearn estimator or pipeline) – Estimator to use in the classification process.
  • **kwargs – Other keyword arguments for initializing nilearn’s searchlight object (see nilearn.github.io/decoding/searchlight.html).
split(file_path, col_name, target, sep='\t', index_col=0, nan_strategy='train', **kwargs)[source]

Splits an MvpBetween object based on some external index.

Parameters:
  • file_path (str) – Absolute path to spreadsheet-like file including the outcome var.
  • col_name (str) – Column name in spreadsheet containing the outcome variable
  • target (str or int or float) – Target to which the data in col_name needs to be compared to, in order to create an index.
  • sep (str) – Separator to parse the spreadsheet-like file.
  • index_col (int) – Which column to use as index (should correspond to subject-name).
  • nan_strategy (str) – Which value to impute if the labeling is absent. Default: ‘train’.
  • **kwargs – Arbitrary keyword arguments passed to pandas read_csv.
update_sample(idx)[source]

Updates the data matrix and associated attributes.

write_4D(path=None, return_nimg=False)[source]

Writes a 4D nifti (subs = 4th dimension) of X.

Parameters:
  • path (str) – Absolute path to save nifti to.
  • return_nimg (bool) – Whether to actually return the Nifti1-image object.
check_zeropadding_and_sort(lst)[source]