gwadama.datasets#
datasets.py
Main classes to manage GW datasets.
There are two basic type of datasets, clean and injected:
Clean datasets’ classes inherit from the Base class, extending their properties as needed.
Injected datasets’ classes inherit from the BaseInjected class, and optionally from other UserDefined(Base) classes.
Notes#
TODO: The Base and BaseInjected couple should be more general, building from unlabeled data as in UnlabeledWaves.
- class gwadama.datasets.Base[source]#
Bases:
objectBase class for all datasets.
TODO: Update docstring.
Any dataset made of ‘clean’ (noiseless) GW must inherit this class. It is designed to store strains as nested dictionaries, with each level’s key identifying a class/property of the strain. Each individual strain is a 1D NDArray containing the features.
By default there are two basic levels:
Class; to group up strains in categories.
Id; An unique identifier for each strain, which must exist in the metadata DataFrame as Index.
Extra depths can be added, and will be thought of as modifications of the same original strains from the upper identifier level. If splitting the dataset into train and test susbsets, only combinations of (Class, Id) will be considered.
Notes
The additional depths in the strains nested dictionary can’t be directly tracked by the metadata Dataframe.
If working with two polarizations, they can be stored with just an extra depth layer.
TODO: Always check self.times (when provided) to determine wether the sampling frequency is variable. Depending on the result, act accordingly with the current value of self.fs.
- Attributes:
- classesdict
Dict of strings and their integer labels, one per class (category).
- metadatapandas.DataFrame
All parameters and data related to the strains. The order is the same as inside ‘strains’ if unrolled to a flat list of strains up to the second depth level (the ID). The total number of different waves must be equal to len(metadata); this does not include possible variations such polarizations or multiple scallings of the same waveform when performing injections.
- strainsdict[dict […]]
Strains stored as a nested dictionary, with each strain in an independent array to provide more flexibility with data of a wide range of lengths.
Shape: {class: {id: strain} }
The ‘class’ key is the name of the class, which must exist in the ‘classes’ list.
The ‘id’ is a unique identifier for each strain, and must exist in the index of the ‘metadata’ (DataFrame) attribute.
Extra depths can be added as variations of each strain, such as polarizations.
- labelsdict
Class label of each wave ID, with shape {id: class_label}. Each ID points to the label of its class in the ‘classes’ attribute. Can be automatically constructed by calling the ‘_gen_labels()’ method.
- max_lengthint
Length of the longest strain in the dataset. Remember to update it if modifying the strains length.
- paddingdict, optional
- Padding added to the strains with the form:
{id: (pad_left, pad_right)}
This only keeps track of any padding added for later potential usages.
- timesdict, optional
Time samples associated with the strains, following the same structure up to the second depth level: {class: {id: time_points} } Useful when the sampling frequency is variable or different between strains. If None, all strains are assumed to be constantly sampled to the sampling frequency indicated by the ‘fs’ attribute, which must be provided.
- fsint, optional
If the ‘times’ attribute is present, this value is ignored. Otherwise it is assumed all strains are constantly sampled to this value.
Note
If dealing with variable sampling frequencies, avoid setting this attribute to anything other than None.
- random_seedint, optional
Seed used to initialize the random number generator (RNG), as well as for calling
sklearn.model_selection.train_test_split()to generate the Train and Test subsets.- Xtrain, Xtestdict, optional
Train and test subsets randomly split using SKLearn train_test_split function with stratified labels. Shape: {id: strain}. The ‘id’ corresponds to the strain’s index at ‘self.metadata’. They are just another views into the same data stored at ‘self.strains’, so no copies are performed.
- Ytrain, YtestNDArray[int], optional
1D Array containing the labels in the same order as ‘Xtrain’ and ‘Xtest’ respectively. See the attribute ‘labels’ for more info.
- id_train, id_testNDArray[int], optional
1D Array containing the id of the signals in the same order as ‘Xtrain’ and ‘Xtest’ respectively.
Methods
apply_window(window[, all])Apply a window to all strains.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
find_class(id)Find which 'class' corresponds the strain 'id'.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, with_id, with_index])Get the filtered test labels.
get_ytrain_array([classes, with_id, with_index])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length])Stack an subset of strains by their ID into a Numpy array.
whiten(*, flength[, asd_array, highpass, ...])Whiten the strains.
- apply_window(window, all=False)[source]#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)[source]#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)[source]#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- find_class(id)[source]#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- get_strain(*indices, normalize=False) ndarray[Any, dtype[_ScalarType_co]][source]#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list][source]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[float64]][source]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length=None, classes='all')[source]#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- test_arrayNDArray
test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- get_xtrain_array(length=None, classes='all')[source]#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- train_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_ytest_array(classes='all', with_id=False, with_index=False)[source]#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', with_id=False, with_index=False)[source]#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()[source]#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list[source]#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)[source]#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None[source]#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None[source]#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- resample(fs, verbose=False) None[source]#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None[source]#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None)[source]#
Stack an subset of strains by their ID into a Numpy array.
Stack an arbitrary selection of strains by their original ID into a zero-padded 2d-array. The resulting order is the same as the order of that in ‘id_list’.
- Parameters:
- id_listlist
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, asd_array: ndarray[Any, dtype[_ScalarType_co]] | None = None, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)[source]#
Whiten the strains.
TODO
Calling this method performs the whitening of all strains.
If asd_array is None, the ASD will be estimated for each strain using SciPy’s Welch method with median average and the same parameters used for whitening.
- class gwadama.datasets.BaseInjected(clean_dataset: Base, *, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable, noise_length: int = 0, freq_cutoff: int = 0, noise_instance: NonwhiteGaussianNoise | None = None, detector: str = '', random_seed: int | None = None)[source]#
Bases:
BaseManage an injected dataset with multiple SNR values.
It is designed to store strains as nested dictionaries, with each level’s key identifying a class/property of the strain. Each individual strain is a 1D NDArray containing the features.
NOTE: Instances of this class or any other Class(BaseInjected) are initialized from an instance of any Class(Base) instance (clean dataset).
By default there are THREE basic levels:
Class; to group up strains in categories.
Id; An unique identifier for each strain, which must exist in the metadata DataFrame as Index.
SNR; the signal-to-noise ratio at which has been injected w.r.t. a power spectral density of reference (e.g. the sensitivity of a GW detector).
An extra depth can be added below, and will be treated as multiple injections at the same SNR value. This is usfeul for example to make injections at multiple noise realizations.
Notes
TODO: Right now this class is oriented to simulate the background noise and apply the whitening using the same PSD and other parameters. This needs further generalization so that it can explicitly accept any pre-computed noise and a different PSD for whitening, as well as the possibility to estimate the PSD from the data in a programatically way.
TODO: Implement optioni in gen_injections to return (or store) the scaling factors.
- Attributes:
- classeslist[str]
List of labels, one per class (category).
- metadatapandas.DataFrame
All parameters and data related to the original strains, inherited (copied) from a clean Class(Base) instance. The order is the same as inside ‘strains’ if unrolled to a flat list of strains up to the second depth level (the ID). The total number of different waves must be equal to len(metadata); this does not include possible variations such polarizations or multiple scallings of the same waveform when performing injections.
- strains_originaldict[dict]
Strains inherited (copied) from the strains_original attribute of the Class(Base) instance. This copy is kept in order to perform new injections.
Shape: {class: {id: strain} }
The ‘class’ key is the name of the class, a string which must exist in the ‘classes’ list.
The ‘id’ is a unique identifier for each strain, and must exist in the index of the ‘metadata’ (DataFrame) attribute.
Warning
These strains should be not modified. If new clean strains are needed, create a new clean dataset instance first, and then initialise this class with it.
- strainsdict[dict]
Injected trains stored as a nested dictionary, with each strain in an independent array to provide more flexibility with data of a wide range of lengths.
Shape: {class: {id: {snr: strain} } }
The ‘class’ key is the name of the class, a string which must exist in the ‘classes’ list.
The ‘id’ is a unique identifier for each strain, and must exist in the index of the ‘metadata’ (DataFrame) attribute.
The ‘snr’ key is an integer indicating the signal-to-noise ratio of the injection.
A fourth depth will be added below as additional injections per SNR if specified when performing the injections.
- labelsdict
Indices of the class of each wave ID, inherited from a clean Class(Base) instance, with shape {id: class_index}. Each ID points to the index of its class in the ‘classes’ attribute.
- unitsstr
Flag indicating whether the data is in ‘geometrized’ or ‘IS’ units.
- timesdict, optional
Time samples associated with the strains, following the same structure. Useful when the sampling frequency is variable or different between strains. If None, all strains are assumed to be constantly sampled to the sampling frequency indicated by the ‘fs’ attribute.
- fsint
Inherited from the parent Class(Base) instance.
- max_lengthint
Length of the longest strain in the dataset. Remember to update it if manually changing strains’ length.
- random_seedint
Seed used to initialize the random number generator (RNG), as well as for calling
sklearn.model_selection.train_test_split()to generate the Train and Test subsets.- rngnp.random.Generator
Random number generator used for sampling the background noise. Initialized with np.random.default_rng(random_seed).
- detectorstr
GW detector name.
- psd_NDArray
Numerical representation of the Power Spectral Density (PSD) of the detector’s sensitivity.
- asd_NDArray
Numerical representation of the Amplitude Spectral Density (ASD) of the detector’s sensitivity.
- noisegwadama.synthetic.NonwhiteGaussianNoise
Background noise instance from NonwhiteGaussianNoise.
- snr_listlist
List of SNR values at which each signal has been injected.
- paddict
Padding introduced at each SNR injection, used in case the strains will be whitened after, to remove the vigneting at edges. It is associated to SNR values because the only implemented way to pad the signals is during the signal injection.
- injections_per_snrint
Number of injections per SNR value.
- injection_snr_scalesdict
Scaling factors used for generating the injections, stored as a nested dictionary with the same structure as self.strains.
- whitenedbool
Flat indicating whether the dataset has been whitened. Initially will be set to False, and changed to True after calling the ‘whiten’ method. Once whitened, this flag will remain True, since the whitening is implemented to be irreversible instance-wise.
- whiten_paramsdict
TODO
- freq_cutoffint
Frequency cutoff below which no noise bins will be generated in the frequency space, and also used for the high-pass filter applied to clean signals before injection.
- Xtrain, Xtestdict, optional
Train and test subsets randomly split using SKLearn train_test_split function with stratified labels. Shape adds the SNR layer: {id: {snr: strain}}. The ‘id’ corresponds to the strain’s index at ‘self.metadata’.
- Ytrain, YtestNDArray[int], optional
1D Array containing the labels in the same order as ‘Xtrain’ and ‘Xtest’ respectively.
Warning
Does not include the SNR layer, therefore labels are not repeated.
Methods
apply_window(window[, all])Apply a window to all strains.
asd(frequencies)Amplitude spectral density (ASD) of the detector at given frequencies.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
export_strains_to_gwf(path, channel[, ...])Export all strains to GWF format, one file per strain.
find_class(id)Find which 'class' corresponds the strain 'id'.
gen_injections(snr[, randomize_noise, ...])Inject all strains in simulated noise with the given SNR values.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes, snr, ...])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes, snr, ...])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, snr, with_id, ...])Get the filtered test labels.
get_ytrain_array([classes, snr, with_id, ...])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
psd(frequencies)Power spectral density (PSD) of the detector at given frequencies.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length, snr_included])Stack a subset of strains by ID into a zero-padded 2d-array.
whiten(*, flength[, highpass, normed, ...])Whiten injected strains.
- __init__(clean_dataset: Base, *, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable, noise_length: int = 0, freq_cutoff: int = 0, noise_instance: NonwhiteGaussianNoise | None = None, detector: str = '', random_seed: int | None = None)[source]#
Base constructor for injected datasets.
TODO: Update docstring.
When inheriting from this class, it is recommended to run this method first in your __init__ function.
Relevant attributes are inherited from the ‘clean_dataset’ instance, which can be any inherited from BaseDataset whose strains have not been injected yet.
If train/test subsets are present, they too are updated when performing injections or changing units, but only through re-building them from the main ‘strains’ attribute using the already generated indices. Original train/test subsets from the clean dataset are not inherited.
Warning
Initializing this class does not perform the injections! For that use the method ‘gen_injections’.
- Parameters:
- clean_datasetBase
Instance of a Class(Base) with noiseless signals.
- psdNDArray | Callable
Power Spectral Density of the detector’s sensitivity in the range of frequencies of interest. Can be given as a callable function whose argument is expected to be an array of frequencies, or as a 2d-array with shape (2, psd_length) so that:
` psd[0] = frequency_samples psd[1] = psd_samples `Note
psd is also used to compute the ‘asd’ attribute (ASD).
- noise_lengthint
Length of the background noise array to be generated for later use. It should be at least longer than the longest signal expected to be injected.
- freq_cutoffint
Frequency cutoff below which no noise bins will be generated in the frequency space, and also used for the high-pass filter applied to clean signals before injection. TODO: Properly separate this parameter from the whitening frequency cutoff, which can be set to a different value.
- noise_instanceNonwhiteGaussianNoise-like, optional
[Experimental] Instead of generating random Gaussian noise, an already generated (or real) noise array can be given.
Warning
This option still needs to be properly integrated and tested.
- detectorstr, optional
GW detector name. Not used, just for identification.
- random_seedint, optional
Seed to initialize the random number generator (used for generating synthetic noise and injecting into random noise positions), as well as for calling
sklearn.model_selection.train_test_split()to generate the Train and Test subsets.
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- asd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]][source]#
Amplitude spectral density (ASD) of the detector at given frequencies.
Interpolates the ASD at the given frequencies from their array representation. If during initialization the ASD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- export_strains_to_gwf(path: str, channel: str, t0_gps: float = 0, verbose=False) None[source]#
Export all strains to GWF format, one file per strain.
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- gen_injections(snr: int | float | list | tuple, randomize_noise: bool = False, random_seed: int | None = None, injections_per_snr: int = 1, verbose=False, **inject_kwargs)[source]#
Inject all strains in simulated noise with the given SNR values.
The SNR is computed using a matched filter against the noise PSD.
If the strain is in geometrized units, it will be converted first to the IS, then injected and converted back to geometrized units.
The automatic highpass filter before each injection is not applied anymore. It is now assumed that clean signals are also filtered properly before.
If the method ‘whiten’ has been already called, all further injections will automatically be whitened with the same parameters, including the unpadding (if > 0).
- Parameters:
- snrint | float | list | tuple
- randomize_noisebool
If True, the noise segment is randomly chosen before the injection. This can be used to avoid having the same noise injected for all clean strains. False by default.
Note
To avoid the possibility of repeating the same noise section in different injections, the noise realization must be reasonably large, e.g:
noise_length > n_clean_strains * self.max_length * len(snr)
- random_seedint, optional
Random seed for noise realization, used only if randomize_noise is True. By default, the random number generator (RNG) created during initialization is used.
Warning
Setting this parameter creates a new RNG, replacing the one initialized with the class. If this is unintended, do not provide this parameter. A warning will be issued when it is used.
- injections_per_snrint, optional
Number of injections per SNR value. Defaults to 1.
This is useful to minimize the statistical impact of the noise when performing injections at a sensitive (low) SNR.
- **inject_kwargs
Additional arguments passed to the _inject method.
- Raises:
- ValueError
Once injections have been performed at a certain SNR value, there cannot be injected again at the same value. Trying it will trigger this exception.
Notes
If whitening is intended to be applied afterwards it is useful to pad the signals beforehand, in order to avoid the window vignetting produced by the whitening itself.
New injections are stored in the ‘strains’ atrribute.
- get_strain(*indices, normalize=False) ndarray[Any, dtype[_ScalarType_co]]#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[float64]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)[source]#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘test_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the test array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- test_arrayNDArray
Test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘test_array’.
- get_xtrain_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)[source]#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the train array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- train_arrayNDArray
Train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘train_array’.
- get_ytest_array(classes='all', snr='all', with_id=False, with_index=False)[source]#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', snr='all', with_id=False, with_index=False)[source]#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- psd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]][source]#
Power spectral density (PSD) of the detector at given frequencies.
Interpolates the PSD at the given frequencies from their array representation. If during initialization the PSD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None, snr_included: int | list[int] | str = 'all')[source]#
Stack a subset of strains by ID into a zero-padded 2d-array.
This may allow (for example) to group up strains by their original ID without leaking differnet injections (SNR) of the same strain into different splits.
- Parameters:
- id_listarray-like
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- snr_includedint | list[int] | str, optional
The SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` id0 snr0 id0 snr1
…
All injections are included by default.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
- Raises:
- ValueError
If the value of ‘snr’ is not valid.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)[source]#
Whiten injected strains.
Calling this method performs the whitening of all injected strains. Strains are later cut to their original size before adding the pad, to remove the vigneting.
Warning
This is an irreversible action; if the original injections need to be preserved it is advised to make a copy of the instance before performing the whitening.
- Parameters:
- flengthint
Length (in samples) of the time-domain FIR whitening.
- highpassfloat, optional
Frequency cutoff.
- normedbool
Normalization applied after the whitening filter.
- shrinkint
Margin at each side of the strain to crop (for each strain ID), in order to avoid edge effects. The corrupted area at each side is 0.5 * flength, which corresponds to the amount of samples it takes for the whitening filter to settle.
- windowstr | tuple, optional
Window to apply to the strain prior to FFT, ‘hann’ by default. see
scipy.signal.get_window()for details on acceptable formats.
- class gwadama.datasets.CoReWaves(*, coredb: CoReManager, classes: dict[str, Any], discarded: dict[int | str, set[set | list | tuple]], cropped: dict, distance: float, inclination: float, phi: float)[source]#
Bases:
BaseManage all operations needed to perform over a noiseless CoRe dataset.
Initial strains and metadata are obtained from a CoReManager instance.
NOTE: This class treats as different classes (categories) each equation of state (EOS) present in the CoReManager instance.
NOTE^2: This class adds a time attribute with time samples related to each GW.
Workflow:
Load the strains from a CoreWaEasy instance, discarding or cropping those indicated with their respective arguments.
Resample.
Project onto the ET detector arms.
Change units and scale from geometrized to IS and vice versa.
Export the (latest version of) dataset to a HDF5.
Export the (latest version of) dataset to a GWF.
- Attributes:
- classesdict
Dict of strings and their integer labels, one per class (category). The keys are the name of the Equation of State (EOS) used to describe the physics behind the simulation which produced each strain.
- strainsdict {class: {id: gw_strain} }
Strains stored as a nested dictionary, with each strain in an independent array to provide more flexibility with data of a wide range of lengths. The class key is the name of the class, a string which must exist in the ‘classes’ list. The ‘id’ is an unique identifier for each strain, and must exist in the self.metadata.index column of the metadata DataFrame. .. note:
Initially, an extra depth layer is defined to store the polarizations of the CoRe GW simulated data. After the projection this layer will be collapsed to a single strain.
- timesdict {class: {id: gw_time_points} }
Time samples associated with the strains, following the same structure. Useful when the sampling frequency is variable or different between strains.
- metadatapandas.DataFrame
All parameters and data related to the strains. The order is the same as inside ‘strains’ if unrolled to a flat list of strains up to the second depth level (the id.). Example:
``` metadata[eos][key] = {
‘id’: str, ‘mass’: float, ‘mass_ratio’: float, ‘eccentricity’: float, ‘mass_starA’: float, ‘mass_starB’: float, ‘spin_starA’: float, ‘spin_starB’: float
- unitsstr
Flag indicating whether the data is in ‘geometrized’ or ‘IS’ units.
- fsint, optional
Initially this attribute is None because the initial GW from CoRe are sampled at different and non-constant sampling frequencies. After the resampling, this attribute will be set to the new global sampling frequency.
Caveat: If the ‘times’ attribute is present, this value is ignored. Otherwise it is assumed all strains are constantly sampled to this.
Methods
apply_window(window[, all])Apply a window to all strains.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
Convert data from scaled geometrized units to IS units.
Convert data from IS to scaled geometrized units.
find_class(id)Find which 'class' corresponds the strain 'id'.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, with_id, with_index])Get the filtered test labels.
get_ytrain_array([classes, with_id, with_index])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
project(*, detector, ra, dec, geo_time, psi)Project strains into the chosen detector at specified coordinates.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length])Stack an subset of strains by their ID into a Numpy array.
trim_relative_to_merger([inspiral_span, ...])Trim strains/time arrays relative to the merger.
whiten(*, flength[, asd_array, highpass, ...])Whiten the strains.
find_merger
- __init__(*, coredb: CoReManager, classes: dict[str, Any], discarded: dict[int | str, set[set | list | tuple]], cropped: dict, distance: float, inclination: float, phi: float)[source]#
Initialize a CoReWaves dataset.
TODO
- Parameters:
- coredbioo.CoReManager
Instance of CoReManager with the actual data.
- classesdict[str]
Dictionary with the Equation of State (class) name as key and the corresponding label index as value.
- discardeddict[set|list|tuple]
Dictionary with each key corresponding to each class, indicating by ID which signals to discard. Each value can be a set, list or tuple.
- croppeddict[str]
Dictionary with the class name as key and the corresponding cropping range as value. The range is given as a tuple of the form (start_index, stop_index).
- distancefloat
Distance to the source in Mpc.
- inclinationfloat
Inclination of the source in radians.
- phifloat
Azimuthal angle of the source in radians.
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- convert_to_IS_units() None[source]#
Convert data from scaled geometrized units to IS units.
Convert strains and times from geometrized units (scaled to the mass of the system and the source distance) to IS units.
Will raise an error if the data is already in IS units.
- convert_to_scaled_geometrized_units() None[source]#
Convert data from IS to scaled geometrized units.
Convert strains and times from IS to geometrized units, and scaled to the mass of the system and the source distance.
Will raise an error if the data is already in geometrized units.
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- get_strain(*indices, normalize=False) ndarray[Any, dtype[_ScalarType_co]]#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[float64]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length=None, classes='all')#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- test_arrayNDArray
test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- get_xtrain_array(length=None, classes='all')#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- train_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_ytest_array(classes='all', with_id=False, with_index=False)#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', with_id=False, with_index=False)#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- project(*, detector: str, ra: float, dec: float, geo_time: float, psi: float)[source]#
Project strains into the chosen detector at specified coordinates.
Project strains into the chosen detector at specified coordinates, using Bilby.
This collapses the polarization layer in ‘strains’ and ‘times’ to a single strain. The times are rebuilt taking as a reference point the merger (t = 0).
- Parameters:
- detectorstr
Name of the ET arm in Bilby for InterferometerList().
- ra, decfloat
Sky position in equatorial coordinates.
- geo_timeint | float
Time of injection in GPS.
- psifloat
Polarization angle.
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None)#
Stack an subset of strains by their ID into a Numpy array.
Stack an arbitrary selection of strains by their original ID into a zero-padded 2d-array. The resulting order is the same as the order of that in ‘id_list’.
- Parameters:
- id_listlist
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- trim_relative_to_merger(inspiral_span: int | None = None, postmerger_span: int | None = None, logpad: bool = False) None[source]#
Trim strains/time arrays relative to the merger.
Keeps a user-specified amount of data on each side of the merger index and discards the rest. The inspiral side is defined as samples strictly before the merger index; the postmerger side is defined as samples from the merger index onwards. After trimming, the ‘merger_pos’ metadata is re-evaluated to match the new arrays.
This operation is in-place and irreversible.
- Parameters:
- inspiral_spanint or None, default=None
Number of samples to keep on the inspiral (left) side closest to the merger. If
None, keep the entire inspiral. If0, drop all inspiral.- postmerger_spanint or None, default=None
Number of samples to keep on the postmerger (right) side starting at the merger. If
None, keep the entire postmerger. If0, drop all postmerger.- logpadbool, default=False
By default the trimming won’t be accounted for in the self.padding register. Set to True if it must.
- Raises:
- TypeError
If
inspiral_spanorpostmerger_spanis notNoneorint.- ValueError
If a provided span is negative.
- ValueError
If the requested trimming would remove all samples from an array.
- RuntimeError
If required metadata is missing or inconsistent.
Warning
- UserWarning
Emitted when a requested span exceeds the available samples on that side.
Notes
Spans larger than the available samples on a side are clipped to that
side’s length and a warning is emitted. - If both spans are
0, the result is an empty array (a warning is emitted). - Inspiral contains no merger sample; postmerger includes the merger sample. Keeping both sides includes the merger exactly once.
- whiten(*, flength: int, asd_array: ndarray[Any, dtype[_ScalarType_co]] | None = None, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)#
Whiten the strains.
TODO
Calling this method performs the whitening of all strains.
If asd_array is None, the ASD will be estimated for each strain using SciPy’s Welch method with median average and the same parameters used for whitening.
- class gwadama.datasets.InjectedCoReWaves(clean_dataset: Base, *, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable, detector: str, noise_length: int, freq_cutoff: int, random_seed: int)[source]#
Bases:
BaseInjectedManage injections of GW data from CoRe dataset.
Tracks index position of the merger.
Computes the SNR only at the ring-down starting from the merger.
Computes also the usual SNR over the whole signal and stores it for later reference (attr. ‘whole_snr_list’).
- Attributes:
- snr_listlist
Partial SNR values at which each signal is injected. This SNR is computed ONLY over the Ring-Down section of the waveform starting from the merger, hence the name ‘partial SNR’.
- whole_snrdict
Nested dictionary storing for each injection the equivalent SNR value computed over the whole signal, hence the name ‘whole SNR’. Structure: {id_: {partial_snr: whole_snr}}
- TODO
Methods
apply_window(window[, all])Apply a window to all strains.
asd(frequencies)Amplitude spectral density (ASD) of the detector at given frequencies.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
export_strains_to_gwf(path, channel[, ...])Export all strains to GWF format, one file per strain.
find_class(id)Find which 'class' corresponds the strain 'id'.
gen_injections(snr[, snr_offset, ...])Inject all strains in simulated noise with the given SNR values.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes, snr, ...])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes, snr, ...])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, snr, with_id, ...])Get the filtered test labels.
get_ytrain_array([classes, snr, with_id, ...])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
psd(frequencies)Power spectral density (PSD) of the detector at given frequencies.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length, snr_included])Stack a subset of strains by ID into a zero-padded 2d-array.
whiten(*, flength[, highpass, normed, ...])Whiten injected strains.
- __init__(clean_dataset: Base, *, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable, detector: str, noise_length: int, freq_cutoff: int, random_seed: int)[source]#
Initializes an instance of the InjectedCoReWaves class.
- Parameters:
- clean_datasetBase
An instance of a BaseDataset class with noiseless signals.
- psdNDArray | Callable
Power Spectral Density of the detector’s sensitivity in the range of frequencies of interest. Can be given as a callable function whose argument is expected to be an array of frequencies, or as a 2d-array with shape (2, psd_length) so that
NOTE: It is also used to compute the ‘asd’ attribute (ASD).
- detectorstr
GW detector name.
- noise_lengthint
Length of the background noise array to be generated for later use. It should be at least longer than the longest signal expected to be injected.
- freq_cutoffint | float
Frequency cutoff for the filter applied to the signal.
- random_seedint
Random seed for generating random numbers.
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- asd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]#
Amplitude spectral density (ASD) of the detector at given frequencies.
Interpolates the ASD at the given frequencies from their array representation. If during initialization the ASD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- export_strains_to_gwf(path: str, channel: str, t0_gps: float = 0, verbose=False) None#
Export all strains to GWF format, one file per strain.
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- gen_injections(snr: int | float | list, snr_offset: int = 0, randomize_noise: bool = False, random_seed: int | None = None, injections_per_snr: int = 1, verbose=False)[source]#
Inject all strains in simulated noise with the given SNR values.
See ‘BaseInjected.gen_injections’ for more details.
- Parameters:
- snrint | float | list
- snr_offsetint
An offset (relative to the position of the merger) added to the start of the segment of the clean signal used for SNR calculation. If the SNR computation needs to include a portion of signal BEFORE the merger, the offset should be negative.
- randomize_noisebool
If True, the noise segment is randomly chosen before the injection. This can be used to avoid having the same noise injected for all clean strains. False by default.
NOTE: To avoid the possibility of repeating the same noise section in different injections, the noise realization must be reasonably large, e.g:
noise_length > n_clean_strains * self.max_length * len(snr)
- random_seedint, optional
Random seed for the noise realization. Only used when randomize_noise is True.
- injections_per_snrint
Number of injections per SNR value. 1 by default.
- Raises:
- ValueError
Once injections have been performed at a certain SNR value, there cannot be injected again at the same value. Trying it will trigger this exception.
Notes
If whitening is intended to be applied afterwards it is useful to pad the signal in order to avoid the window vignetting produced by the whitening itself. This pad will be cropped afterwards.
New injections are stored in the ‘strains’ atrribute, with the pad associated to all the injections performed at once. Even when whitening is also performed right after the injections.
- get_strain(*indices, normalize=False) ndarray[Any, dtype[_ScalarType_co]]#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[float64]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘test_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the test array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- test_arrayNDArray
Test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘test_array’.
- get_xtrain_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the train array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- train_arrayNDArray
Train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘train_array’.
- get_ytest_array(classes='all', snr='all', with_id=False, with_index=False)#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', snr='all', with_id=False, with_index=False)#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- psd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]#
Power spectral density (PSD) of the detector at given frequencies.
Interpolates the PSD at the given frequencies from their array representation. If during initialization the PSD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None, snr_included: int | list[int] | str = 'all')#
Stack a subset of strains by ID into a zero-padded 2d-array.
This may allow (for example) to group up strains by their original ID without leaking differnet injections (SNR) of the same strain into different splits.
- Parameters:
- id_listarray-like
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- snr_includedint | list[int] | str, optional
The SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` id0 snr0 id0 snr1
…
All injections are included by default.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
- Raises:
- ValueError
If the value of ‘snr’ is not valid.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)#
Whiten injected strains.
Calling this method performs the whitening of all injected strains. Strains are later cut to their original size before adding the pad, to remove the vigneting.
Warning
This is an irreversible action; if the original injections need to be preserved it is advised to make a copy of the instance before performing the whitening.
- Parameters:
- flengthint
Length (in samples) of the time-domain FIR whitening.
- highpassfloat, optional
Frequency cutoff.
- normedbool
Normalization applied after the whitening filter.
- shrinkint
Margin at each side of the strain to crop (for each strain ID), in order to avoid edge effects. The corrupted area at each side is 0.5 * flength, which corresponds to the amount of samples it takes for the whitening filter to settle.
- windowstr | tuple, optional
Window to apply to the strain prior to FFT, ‘hann’ by default. see
scipy.signal.get_window()for details on acceptable formats.
- class gwadama.datasets.InjectedSyntheticWaves(clean_dataset: SyntheticWaves, *, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable, detector: str, noise_length: int, freq_cutoff: int, random_seed: int)[source]#
Bases:
BaseInjectedTODO
Methods
apply_window(window[, all])Apply a window to all strains.
asd(frequencies)Amplitude spectral density (ASD) of the detector at given frequencies.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
export_strains_to_gwf(path, channel[, ...])Export all strains to GWF format, one file per strain.
find_class(id)Find which 'class' corresponds the strain 'id'.
gen_injections(snr[, randomize_noise, ...])Inject all strains in simulated noise with the given SNR values.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes, snr, ...])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes, snr, ...])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, snr, with_id, ...])Get the filtered test labels.
get_ytrain_array([classes, snr, with_id, ...])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
psd(frequencies)Power spectral density (PSD) of the detector at given frequencies.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length, snr_included])Stack a subset of strains by ID into a zero-padded 2d-array.
whiten(*, flength[, highpass, normed, ...])Whiten injected strains.
- __init__(clean_dataset: SyntheticWaves, *, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable, detector: str, noise_length: int, freq_cutoff: int, random_seed: int)[source]#
Base constructor for injected datasets.
TODO: Update docstring.
When inheriting from this class, it is recommended to run this method first in your __init__ function.
Relevant attributes are inherited from the ‘clean_dataset’ instance, which can be any inherited from BaseDataset whose strains have not been injected yet.
If train/test subsets are present, they too are updated when performing injections or changing units, but only through re-building them from the main ‘strains’ attribute using the already generated indices. Original train/test subsets from the clean dataset are not inherited.
Warning
Initializing this class does not perform the injections! For that use the method ‘gen_injections’.
- Parameters:
- clean_datasetBase
Instance of a Class(Base) with noiseless signals.
- psdNDArray | Callable
Power Spectral Density of the detector’s sensitivity in the range of frequencies of interest. Can be given as a callable function whose argument is expected to be an array of frequencies, or as a 2d-array with shape (2, psd_length) so that:
` psd[0] = frequency_samples psd[1] = psd_samples `Note
psd is also used to compute the ‘asd’ attribute (ASD).
- noise_lengthint
Length of the background noise array to be generated for later use. It should be at least longer than the longest signal expected to be injected.
- freq_cutoffint
Frequency cutoff below which no noise bins will be generated in the frequency space, and also used for the high-pass filter applied to clean signals before injection. TODO: Properly separate this parameter from the whitening frequency cutoff, which can be set to a different value.
- noise_instanceNonwhiteGaussianNoise-like, optional
[Experimental] Instead of generating random Gaussian noise, an already generated (or real) noise array can be given.
Warning
This option still needs to be properly integrated and tested.
- detectorstr, optional
GW detector name. Not used, just for identification.
- random_seedint, optional
Seed to initialize the random number generator (used for generating synthetic noise and injecting into random noise positions), as well as for calling
sklearn.model_selection.train_test_split()to generate the Train and Test subsets.
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- asd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]#
Amplitude spectral density (ASD) of the detector at given frequencies.
Interpolates the ASD at the given frequencies from their array representation. If during initialization the ASD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- export_strains_to_gwf(path: str, channel: str, t0_gps: float = 0, verbose=False) None#
Export all strains to GWF format, one file per strain.
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- gen_injections(snr: int | float | list | tuple, randomize_noise: bool = False, random_seed: int | None = None, injections_per_snr: int = 1, verbose=False, **inject_kwargs)#
Inject all strains in simulated noise with the given SNR values.
The SNR is computed using a matched filter against the noise PSD.
If the strain is in geometrized units, it will be converted first to the IS, then injected and converted back to geometrized units.
The automatic highpass filter before each injection is not applied anymore. It is now assumed that clean signals are also filtered properly before.
If the method ‘whiten’ has been already called, all further injections will automatically be whitened with the same parameters, including the unpadding (if > 0).
- Parameters:
- snrint | float | list | tuple
- randomize_noisebool
If True, the noise segment is randomly chosen before the injection. This can be used to avoid having the same noise injected for all clean strains. False by default.
Note
To avoid the possibility of repeating the same noise section in different injections, the noise realization must be reasonably large, e.g:
noise_length > n_clean_strains * self.max_length * len(snr)
- random_seedint, optional
Random seed for noise realization, used only if randomize_noise is True. By default, the random number generator (RNG) created during initialization is used.
Warning
Setting this parameter creates a new RNG, replacing the one initialized with the class. If this is unintended, do not provide this parameter. A warning will be issued when it is used.
- injections_per_snrint, optional
Number of injections per SNR value. Defaults to 1.
This is useful to minimize the statistical impact of the noise when performing injections at a sensitive (low) SNR.
- **inject_kwargs
Additional arguments passed to the _inject method.
- Raises:
- ValueError
Once injections have been performed at a certain SNR value, there cannot be injected again at the same value. Trying it will trigger this exception.
Notes
If whitening is intended to be applied afterwards it is useful to pad the signals beforehand, in order to avoid the window vignetting produced by the whitening itself.
New injections are stored in the ‘strains’ atrribute.
- get_strain(*indices, normalize=False) ndarray[Any, dtype[_ScalarType_co]]#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[float64]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘test_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the test array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- test_arrayNDArray
Test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘test_array’.
- get_xtrain_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the train array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- train_arrayNDArray
Train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘train_array’.
- get_ytest_array(classes='all', snr='all', with_id=False, with_index=False)#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', snr='all', with_id=False, with_index=False)#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- psd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]#
Power spectral density (PSD) of the detector at given frequencies.
Interpolates the PSD at the given frequencies from their array representation. If during initialization the PSD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None, snr_included: int | list[int] | str = 'all')#
Stack a subset of strains by ID into a zero-padded 2d-array.
This may allow (for example) to group up strains by their original ID without leaking differnet injections (SNR) of the same strain into different splits.
- Parameters:
- id_listarray-like
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- snr_includedint | list[int] | str, optional
The SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` id0 snr0 id0 snr1
…
All injections are included by default.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
- Raises:
- ValueError
If the value of ‘snr’ is not valid.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)#
Whiten injected strains.
Calling this method performs the whitening of all injected strains. Strains are later cut to their original size before adding the pad, to remove the vigneting.
Warning
This is an irreversible action; if the original injections need to be preserved it is advised to make a copy of the instance before performing the whitening.
- Parameters:
- flengthint
Length (in samples) of the time-domain FIR whitening.
- highpassfloat, optional
Frequency cutoff.
- normedbool
Normalization applied after the whitening filter.
- shrinkint
Margin at each side of the strain to crop (for each strain ID), in order to avoid edge effects. The corrupted area at each side is 0.5 * flength, which corresponds to the amount of samples it takes for the whitening filter to settle.
- windowstr | tuple, optional
Window to apply to the strain prior to FFT, ‘hann’ by default. see
scipy.signal.get_window()for details on acceptable formats.
- class gwadama.datasets.InjectedUnlabeledWaves(clean_dataset: UnlabeledWaves, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable | None = None, noise_length: int = 0, freq_cutoff: int | float | None = None, noise_instance: NonwhiteGaussianNoise | None = None, detector: str = '', random_seed: int | None = None)[source]#
Bases:
UnlabeledBaseMixin,BaseInjectedDataset class for injected gravitational wave signals without labels.
This class extends Base, modifying its behavior to handle injections in UnlabeledWaves datasets, where gravitational wave signals are provided without associated labels.
Notes
Unlike BaseInjected, this class does not track class labels.
Train/Test split is still supported but is not stratified.
- Attributes:
- TODO
Methods
apply_window(window[, all])Apply a window to all strains.
asd(frequencies)Amplitude spectral density (ASD) of the detector at given frequencies.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
export_strains_to_gwf(path, channel[, ...])Export all strains to GWF format, one file per strain.
find_class(id)Find which 'class' corresponds the strain 'id'.
gen_injections(snr[, randomize_noise, ...])Inject all strains in simulated noise with the given SNR values.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes, snr, ...])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes, snr, ...])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, snr, with_id, ...])Get the filtered test labels.
get_ytrain_array([classes, snr, with_id, ...])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
psd(frequencies)Power spectral density (PSD) of the detector at given frequencies.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length, snr_included])Stack a subset of strains by ID into a zero-padded 2d-array.
whiten(*, flength[, highpass, normed, ...])Whiten injected strains.
- __init__(clean_dataset: UnlabeledWaves, psd: ndarray[Any, dtype[_ScalarType_co]] | Callable | None = None, noise_length: int = 0, freq_cutoff: int | float | None = None, noise_instance: NonwhiteGaussianNoise | None = None, detector: str = '', random_seed: int | None = None)[source]#
Initialize an InjectedUnlabeledWaves dataset.
This constructor is built from a previous UnlabeledWaves instance.
If train/test subsets are present, they too are updated when performing injections or changing units, but only through re-building them from the main ‘strains’ attribute using the already generated indices. Original train/test subsets from the clean dataset are not inherited.
Warning
Initializing this class does not perform the injections! For that use the method ‘gen_injections’.
- Parameters:
- clean_datasetUnlabeledWaves
- psdNDArray | Callable, optional
Power Spectral Density of the detector’s sensitivity in the range of frequencies of interest. Can be given as a callable function whose argument is expected to be an array of frequencies, or as a 2d-array with shape (2, psd_length) so that
` psd[0] = frequency_samples psd[1] = psd_samples `.If not given, it will be assumed that the dataset lives in the whitened space.
Note
psd is also used to compute the ‘asd’ attribute, if given.
- noise_lengthint, optional
Length of the background noise array to be generated for later use. It should be at least longer than the longest signal expected to be injected.
- freq_cutoffint | float, optional
Frequency cutoff below which no noise bins will be generated in the frequency space, and also used for the high-pass filter applied to clean signals before injection.
- noise_instanceNonwhiteGaussianNoise-like, optional
[Experimental] Instead of generating random Gaussian noise, an already generated (or real) noise array can be given.
Warning
This option still needs to be properly integrated and tested.
- detectorstr, optional
GW detector name. Not used, just for identification.
- random_seedint, optional
Value passed to ‘sklearn.model_selection.train_test_split’ to generate the Train and Test subsets. Saved for reproducibility purposes, and also used to initialize Numpy’s default RandomGenerator.
Notes
A dummy class label (‘unique’: 1) is assigned for compatibility.
Metadata is omitted in this class.
The dataset structure supports train/test splitting, but labels are not relevant.
This constructor is a reimplementation of Base.__init__ adapted for a single (dummy) class.
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- asd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]#
Amplitude spectral density (ASD) of the detector at given frequencies.
Interpolates the ASD at the given frequencies from their array representation. If during initialization the ASD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- classes: dict[str, Any]#
- export_strains_to_gwf(path: str, channel: str, t0_gps: float = 0, verbose=False) None#
Export all strains to GWF format, one file per strain.
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- gen_injections(snr: int | float | list | tuple, randomize_noise: bool = False, random_seed: int | None = None, injections_per_snr: int = 1, verbose=False, **inject_kwargs)#
Inject all strains in simulated noise with the given SNR values.
The SNR is computed using a matched filter against the noise PSD.
If the strain is in geometrized units, it will be converted first to the IS, then injected and converted back to geometrized units.
The automatic highpass filter before each injection is not applied anymore. It is now assumed that clean signals are also filtered properly before.
If the method ‘whiten’ has been already called, all further injections will automatically be whitened with the same parameters, including the unpadding (if > 0).
- Parameters:
- snrint | float | list | tuple
- randomize_noisebool
If True, the noise segment is randomly chosen before the injection. This can be used to avoid having the same noise injected for all clean strains. False by default.
Note
To avoid the possibility of repeating the same noise section in different injections, the noise realization must be reasonably large, e.g:
noise_length > n_clean_strains * self.max_length * len(snr)
- random_seedint, optional
Random seed for noise realization, used only if randomize_noise is True. By default, the random number generator (RNG) created during initialization is used.
Warning
Setting this parameter creates a new RNG, replacing the one initialized with the class. If this is unintended, do not provide this parameter. A warning will be issued when it is used.
- injections_per_snrint, optional
Number of injections per SNR value. Defaults to 1.
This is useful to minimize the statistical impact of the noise when performing injections at a sensitive (low) SNR.
- **inject_kwargs
Additional arguments passed to the _inject method.
- Raises:
- ValueError
Once injections have been performed at a certain SNR value, there cannot be injected again at the same value. Trying it will trigger this exception.
Notes
If whitening is intended to be applied afterwards it is useful to pad the signals beforehand, in order to avoid the window vignetting produced by the whitening itself.
New injections are stored in the ‘strains’ atrribute.
- get_strain(*indices, normalize=False)#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[_ScalarType_co]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘test_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the test array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- test_arrayNDArray
Test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘test_array’.
- get_xtrain_array(length: int | None = None, classes: str | list = 'all', snr: int | list | str = 'all', with_metadata: bool = False)#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Allows the possibility to filter by class and SNR.
NOTE: Same signals injected at different SNR are stacked continuously.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | list[str]
Whitelist of classes to include in the stack. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` eos0 id0 snr0 eos0 id0 snr1
…
All injections are included by default.
- with_metadatabool
If True, the associated metadata is returned in addition to the train array in a Pandas DataFrame instance. This metadata is obtained from the original ‘metadata’ attribute, with the former index inserted as the first column, ‘id’, and with an additional column for the SNR values. False by default.
- Returns:
- train_arrayNDArray
Train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- metadatapd.DataFrame, optional
If ‘with_metadata’ is True, the associated metadata is returned with its entries in the same order as the ‘train_array’.
- get_ytest_array(classes='all', snr='all', with_id=False, with_index=False)#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', snr='all', with_id=False, with_index=False)#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
Whitelist of classes to include in the labels. All classes are included by default.
- snrint | list[int] | str
Whitelist of SNR injections to include in the labels. All injections are included by default.
- with_idbool
If True, return also the related IDs. False by default.
- with_indexbool
If True, return also the related GLOBAL indices w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- psd(frequencies: float | ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]#
Power spectral density (PSD) of the detector at given frequencies.
Interpolates the PSD at the given frequencies from their array representation. If during initialization the PSD was given as its array representation, the interpolant is computed using SciPy’s quadratic spline interpolant function.
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None, snr_included: int | list[int] | str = 'all')#
Stack a subset of strains by ID into a zero-padded 2d-array.
This may allow (for example) to group up strains by their original ID without leaking differnet injections (SNR) of the same strain into different splits.
- Parameters:
- id_listarray-like
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- snr_includedint | list[int] | str, optional
The SNR injections to include in the stack. If more than one are selected, they are stacked zipped as follows:
``` id0 snr0 id0 snr1
…
All injections are included by default.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
- Raises:
- ValueError
If the value of ‘snr’ is not valid.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)#
Whiten injected strains.
Calling this method performs the whitening of all injected strains. Strains are later cut to their original size before adding the pad, to remove the vigneting.
Warning
This is an irreversible action; if the original injections need to be preserved it is advised to make a copy of the instance before performing the whitening.
- Parameters:
- flengthint
Length (in samples) of the time-domain FIR whitening.
- highpassfloat, optional
Frequency cutoff.
- normedbool
Normalization applied after the whitening filter.
- shrinkint
Margin at each side of the strain to crop (for each strain ID), in order to avoid edge effects. The corrupted area at each side is 0.5 * flength, which corresponds to the amount of samples it takes for the whitening filter to settle.
- windowstr | tuple, optional
Window to apply to the strain prior to FFT, ‘hann’ by default. see
scipy.signal.get_window()for details on acceptable formats.
- class gwadama.datasets.SyntheticWaves(*, classes: dict, n_waves_per_class: int, wave_parameters_limits: dict, max_length: int, peak_time_max_length: float, amp_threshold: float, tukey_alpha: float, fs: int, random_seed: int | None = None)[source]#
Bases:
BaseClass for building synthetically generated wavforms and background noise.
Part of the datasets for the CLAWDIA main paper.
The classes are hardcoded:
SG: Sine Gaussian,
G: Gaussian,
RD: Ring-Down.
- Attributes:
- classesdict
Dict of strings and their integer labels, one per class (category).
- strainsdict {class: {key: gw_strains} }
Strains stored as a nested dictionary, with each strain in an independent array to provide more flexibility with data of a wide range of lengths. The class key is the name of the class, a string which must exist in the ‘classes’ attribute. The ‘key’ is an identifier of each strain. In this case it’s just the global index ranging from 0 to ‘self.n_samples’.
- labelsNDArray[int]
Indices of the classes, one per waveform. Each one points its respective waveform inside ‘strains’ to its class in ‘classes’. The order is that of the index of ‘self.metadata’, and coincides with the order of the strains inside ‘self.strains’ if unrolled to a flat list of arrays.
- metadatapandas.DataFrame
All parameters and data related to the strains. The order is the same as inside ‘strains’ if unrolled to a flat list of strains.
- train_sizeint | float
If int, total number of samples to include in the train dataset. If float, fraction of the total samples to include in the train dataset. For more details see ‘sklearn.model_selection.train_test_split’ with the flag stratified=True.
- unitsstr
Flag indicating whether the data is in ‘geometrized’ or ‘IS’ units.
- Xtrain, Xtestdict {key: strain}
Train and test subsets randomly split using SKLearn train_test_split function with stratified labels. The key corresponds to the strain’s index at ‘self.metadata’.
- Ytrain, YtestNDArray[int]
1D Array containing the labels in the same order as ‘Xtrain’ and ‘Xtest’ respectively.
Methods
apply_window(window[, all])Apply a window to all strains.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
find_class(id)Find which 'class' corresponds the strain 'id'.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, with_id, with_index])Get the filtered test labels.
get_ytrain_array([classes, with_id, with_index])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length])Stack an subset of strains by their ID into a Numpy array.
whiten(*, flength[, asd_array, highpass, ...])Whiten the strains.
- __init__(*, classes: dict, n_waves_per_class: int, wave_parameters_limits: dict, max_length: int, peak_time_max_length: float, amp_threshold: float, tukey_alpha: float, fs: int, random_seed: int | None = None)[source]#
- Parameters:
- n_waves_per_classint
Number of waves per class to produce.
- wave_parameters_limitsdict
Min/Max limits of the waveforms’ parameters, 9 in total. Keys:
mf0, Mf0: min/Max central frequency (SG and RD).
mQ, MQ: min/Max quality factor (SG and RD).
mhrss, Mhrss: min/Max sum squared amplitude of the wave.
mT, MT: min/Max duration (only G).
- max_lengthint
Maximum length of the waves. This parameter is used to generate the initial time array with which the waveforms are computed.
- peak_time_max_lengthfloat
Time of the peak of the envelope of the waves in the initial time array (built with ‘max_length’).
- amp_thresholdfloat
Fraction w.r.t. the maximum absolute amplitude of the wave envelope below which to end the wave by shrinking the array and applying a windowing to the edges.
- tukey_alphafloat
Alpha parameter (width) of the Tukey window applied to each wave to make sure their values end at the exact duration determined by either the duration parameter or the amplitude threshold.
- fsint
- random_seedint, optional.
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- get_strain(*indices, normalize=False) ndarray[Any, dtype[_ScalarType_co]]#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[float64]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length=None, classes='all')#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- test_arrayNDArray
test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- get_xtrain_array(length=None, classes='all')#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- train_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_ytest_array(classes='all', with_id=False, with_index=False)#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', with_id=False, with_index=False)#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None)#
Stack an subset of strains by their ID into a Numpy array.
Stack an arbitrary selection of strains by their original ID into a zero-padded 2d-array. The resulting order is the same as the order of that in ‘id_list’.
- Parameters:
- id_listlist
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, asd_array: ndarray[Any, dtype[_ScalarType_co]] | None = None, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)#
Whiten the strains.
TODO
Calling this method performs the whitening of all strains.
If asd_array is None, the ASD will be estimated for each strain using SciPy’s Welch method with median average and the same parameters used for whitening.
- class gwadama.datasets.UnlabeledWaves(strains: ndarray[Any, dtype[_ScalarType_co]] | dict[int | str, ndarray[Any, dtype[_ScalarType_co]]], *, fs: int, strain_limits: ndarray[Any, dtype[_ScalarType_co]] | None = None, whitened=False, random_seed=None)[source]#
Bases:
UnlabeledBaseMixin,BaseDataset class for clean gravitational wave signals without labels.
This class extends Base, modifying its behavior to handle datasets where gravitational wave signals are provided without associated labels. Unlike Base, it does not require a classification structure but retains methods for loading, storing, and managing waveform data.
The dataset consists of nested dictionaries, storing each waveform in an independent array to accommodate variable lengths.
Notes
Unlike Base, this class does not track class labels.
Train/Test split is still supported but is not stratified.
- Attributes:
- strainsdict
Dictionary of stored waveforms, indexed by unique identifiers.
- max_lengthint
Length of the longest waveform in the dataset.
- fsint, optional
The constant sampling frequency for the waveforms, if provided.
- Xtrain, Xtestdict, optional
Train and test subsets randomly split using train_test_split, if required. These are views into strains, without associated labels.
Methods
apply_window(window[, all])Apply a window to all strains.
bandpass(*, f_low, f_high, f_order[, verbose])Apply a forward-backward digital bandpass filter.
build_train_test_subsets(train_size)Generate a random Train and Test subsets.
find_class(id)Find which 'class' corresponds the strain 'id'.
get_strain(*indices[, normalize])Get a single strain from the complete index coordinates.
get_strains_array([length])Get all strains stacked in a zero-padded Numpy 2d-array.
get_times(*indices)Get a single time array from the complete index coordinates.
get_xtest_array([length, classes])Get the test subset stacked in a zero-padded Numpy 2d-array.
get_xtrain_array([length, classes])Get the train subset stacked in a zero-padded Numpy 2d-array.
get_ytest_array([classes, with_id, with_index])Get the filtered test labels.
get_ytrain_array([classes, with_id, with_index])Get the filtered training labels.
items()Return a new view of the dataset's items with unrolled indices.
keys([max_depth])Return the unrolled combinations of all strain identifiers.
normalise([mode, all_strains])Normalise strains.
pad_strains(padding[, window, logpad])Pad strains with zeros on both sides.
pad_to_length(length, *[, window, logpad])Centre-pad all strains to a common target length.
resample(fs[, verbose])Resample strain and time arrays to a constant rate.
shrink_strains(padding[, logpad])Shrink strains by a specified padding.
stack_by_id(id_list[, length])Stack an subset of strains by their ID into a Numpy array.
whiten(*, flength[, asd_array, highpass, ...])Whiten the strains.
- CLASS_NAME = 'unique'#
- __init__(strains: ndarray[Any, dtype[_ScalarType_co]] | dict[int | str, ndarray[Any, dtype[_ScalarType_co]]], *, fs: int, strain_limits: ndarray[Any, dtype[_ScalarType_co]] | None = None, whitened=False, random_seed=None)[source]#
Initialize an UnlabeledWaves dataset.
This constructor processes a NumPy array of gravitational wave signals, storing them in a structured dictionary while optionally discarding unnecessary zero-padding. Unlike Base, this class does not support labeled categories nor requires metadata, but retains support for dataset splitting and signal management.
- Parameters:
- strainsNDArray | dict[int|str, NDArray]
Gravitational wave strains. If a 2d-array is given, each row must contain a single waveform, possibly zero-padded. If a dict is given it should be formatted as {id: strain_array}.
- fsint
The assumed constant sampling frequency for the waveforms.
- strain_limitslist[tuple[int, int]] | None, optional
A list of (start, end) indices defining the valid range for each waveform in strains. If None, waveforms are assumed to contain no unnecessary padding.
- whitenedbool, optional
If True, it is assumed that signals in strains have already been whitened. This effectively changes some of the behaviour of the class when treating data internally.
- random_seedint, optional
Seed used to initialize the random number generator (RNG), as well as for calling
sklearn.model_selection.train_test_split()to generate the Train and Test subsets.
Notes
A dummy class label (‘unique’: 1) is assigned for compatibility inside the strains dict.
Metadata is omitted in this class.
The dataset structure supports train/test splitting, but labels are ignored.
TODO: Implement optional explicit time arrays as argument for time varying sampling (and all corresponding checks).
- apply_window(window, all=False)#
Apply a window to all strains.
Apply a window to self.strains recursively, and optionally to self.strains_original as well.
- Parameters:
- windowstr | tuple
Window to apply, formatted to be accepted by SciPy’s get_window.
- allbool, optional
If True, apply the window also to self.strains_original.
Notes
Since strains may have different lengths, a window is generated for each one.
TODO: Generalise this method to BaseInjected for when all=True.
- bandpass(*, f_low: int | float, f_high: int | float, f_order: int | float, verbose=False)#
Apply a forward-backward digital bandpass filter.
Apply a forward-backward digital bandpass filter to all clean strains between frequencies ‘f_low’ and ‘f_high’ with an order of ‘f_order’.
This method is intended to be used prior to any whitening.
Warning
This is an irreversible operation. Original (non-bandpassed) strains will be lost.
- build_train_test_subsets(train_size: int | float)#
Generate a random Train and Test subsets.
Only indices in the ‘labels’ attribute are considered independent waveforms, any extra key (layer) in the ‘strains’ dict is treated monolithically during the shuffle.
The strain values are just new views into the ‘strains’ attribute. The shuffling is performed by Scikit-Learn’s function ‘train_test_split’, with stratification enabled.
- Parameters:
- train_sizeint | float
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train subset. If int, represents the absolute number of train waves.
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- classes: dict[str, Any]#
- find_class(id)#
Find which ‘class’ corresponds the strain ‘id’.
Finds the ‘class’ of the strain represented by the unique identifier ‘id’.
- Parameters:
- idstr
Unique identifier of the string, that which also appears in the metadata.index DataFrame.
- Returns:
- clasint | str
Class key associated to the strain ‘id’.
- get_strain(*indices, normalize=False)#
Get a single strain from the complete index coordinates.
This is just a shortcut to avoid having to write several squared brackets.
NOTE: The returned strain is not a copy; if its contents are modified, the changes will be reflected inside the ‘strains’ attribute.
- Parameters:
- *indicesstr | int
The indices of the strain to retrieve.
- normalizebool
If True, the returned strain will be normalized to its maximum amplitude.
- Returns:
- strainNDArray
The requested strain.
- get_strains_array(length: int | None = None) tuple[ndarray[Any, dtype[_ScalarType_co]], list]#
Get all strains stacked in a zero-padded Numpy 2d-array.
Stacks all signals into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
- Parameters:
- lengthint, optional
Target length of the ‘strains_array’. If None, the longest signal determines the length.
- Returns:
- strains_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_times(*indices) ndarray[Any, dtype[_ScalarType_co]]#
Get a single time array from the complete index coordinates.
If there is no time tracking (thus no stored times), a new time array is generated using self.fs and the length of the correspoinding strain stored at the same index coordinates.
Warning
The returned array is not a copy; if its contents are modified, the changes will be reflected inside the ‘times’ attribute.
- get_xtest_array(length=None, classes='all')#
Get the test subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the test subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- test_arrayNDArray
test subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘test_array’.
- get_xtrain_array(length=None, classes='all')#
Get the train subset stacked in a zero-padded Numpy 2d-array.
Stacks all signals in the train subset into an homogeneous numpy array whose length (axis=1) is determined by either ‘length’ or, if None, by the longest strain in the subset. The remaining space is zeroed.
Optionally, classes can be filtered by specifying which to include with the classes parameter.
- Parameters:
- lengthint, optional
Target length of the ‘train_array’. If None, the longest signal determines the length.
- classesstr | List[str], optional
Specify which classes to include. Include ‘all’ by default.
- Returns:
- train_arrayNDArray
train subset.
- lengthslist
Original length of each strain, following the same order as the first axis of ‘train_array’.
- get_ytest_array(classes='all', with_id=False, with_index=False)#
Get the filtered test labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtest_array’ WITHOUT filters.
- Returns:
- NDArray
Filtered test labels.
- NDArray, optional
IDs associated to the filtered test labels.
- NDArray, optional
Indices associated to the filtered test labels.
- get_ytrain_array(classes='all', with_id=False, with_index=False)#
Get the filtered training labels.
- Parameters:
- classesstr | list[str] | ‘all’
The classes to include in the labels. All classes are included by default.
- with_idbool
If True, return also the list of related IDs.
- with_indexbool
If True, return also the related GLOBAL indices; w.r.t. the stacked arrays returned by ‘get_xtrain_array’ WITHOUT filters. False by default.
- Returns:
- NDArray
Filtered train labels.
- NDArray, optional
IDs associated to the filtered train labels.
- NDArray, optional
Indices associated to the filtered train labels.
- items()#
Return a new view of the dataset’s items with unrolled indices.
Each iteration consists on a tuple containing all the nested keys in ‘self.strains’ along with the corresponding strain, (clas, id, *, strain).
It can be thought of as an extension of Python’s dict.items(). Useful to quickly iterate over all items in the dataset.
Example of usage with an arbitrary number of keys in the nested dictionary of strains:
``` for *keys, strain in self.items():
print(f”Number of identifiers: {len(keys)}”) print(f”Length of the strain: {len(strain)}”) do_something(strain)
- keys(max_depth: int | None = None) list#
Return the unrolled combinations of all strain identifiers.
Return the unrolled combinations of all keys of the nested dictionary of strains by a hierarchical recursive search.
It can be thought of as the extended version of Python’s ‘dict().keys()’, although this returns a plain list.
- Parameters:
- max_depthint, optional
If specified, it is the number of layers to iterate to at most in the nested ‘strains’ dictionary.
- Returns:
- keyslist
The unrolled combination in a Python list.
- normalise(mode='amplitude', all_strains=False)#
Normalise strains.
Normalise strains to the indicated mode, and optionally to self.strains_original as well.
- Parameters:
- modestr, optional
Normalisation method. Available: amplitude, l2
- all_strainsbool, optional
If True, normalise also self.strains_original.
Notes
TODO: Generalise this method to BaseInjected for when all=True.
- pad_strains(padding: int | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | dict, window=None, logpad=True) None#
Pad strains with zeros on both sides.
This function pads each strain with a specific number of samples on both sides. It also updates the ‘max_length’ attribute to reflect the new maximum length of the padded strains.
- Parameters:
- paddingint | ArrayLike | dict
The padding to apply to each strain. If padding is an integer, it will be applied at both sides of all strains. If padding is a tuple, it must be of the form (left_pad, right_pad) in samples. If padding is a dictionary, it must be of the form {id: (left_pad, right_pad)}, where id is the identifier of each strain.
- windowstr | tuple | list | Callable, optional
Window to apply before padding the arrays. If str, tuple or list, it will be used a scipy.signal.get_window(window). If Callable, it must take the strain before padding as argument, and return the windowed array. By default, no window is applied.
Added in version 0.4.0: This parameter was added in v0.4.0 to emphasize the potential need of windowing before padding strains to avoid spectral leakage.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
If time arrays are present, they are also padded accordingly.
- pad_to_length(length: int, *, window=None, logpad=True) None#
Centre-pad all strains to a common target length.
Computes, for each strain, the number of samples to pad on the left and right so that its final length equals
length. Once the per-strain padding dictionary is built, this method callspad_strains().- Parameters:
- lengthint
Target total length (in samples) for all strains after padding. Must be greater than or equal to the current length of every strain.
- windowstr | tuple | Callable, optional
Window to apply before padding, passed through to
pad_strains(). See that method for details.- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
- Raises:
- ValueError
If any existing strain length exceeds
length. This method only pads; it does not truncate.
See also
pad_strainsApply explicit per-strain left/right padding.
Notes
If time arrays are tracked (
self._track_times is True), their
padding is handled by
pad_strains().
- resample(fs, verbose=False) None#
Resample strain and time arrays to a constant rate.
This assumes time tracking either with time arrays or with the sampling frequency provided during initialization, which will be used to generate the time arrays previous to the resampling.
This method updates the sampling frequency and the maximum length attributes.
- Parameters:
- fsint
The new sampling frequency in Hz.
- verbosebool
If True, print information about the resampling.
Warning
This method will generate time arrays if time tracking was not enabled. This can lead to inconsistent results when combined with padding-like operations followed by
get_times(). In particular, if resampling is performed before padding, a time origin will be set automatically, and subsequent padding will preserve it. If resampling is performed after padding, however,get_times()will generate time arrays with origin at 0. Thus, the final time arrays may differ depending on the order of operations. This side effect is temporary and may be removed in a future release.
- shrink_strains(padding: int | tuple | dict, logpad=True) None#
Shrink strains by a specified padding.
Shrink strains (and their associated time arrays if present) by the specified padding, which is understood as negative.
It also updates the ‘max_length’ attribute, and the previous padding if present.
- Parameters:
- paddingint | tuple | dict
The pad to shrink to all strains. Values must be given in absolute value (positive int). If pad is an integer, symmettric shrinking is applied to all samples. If pad is a tuple, it must be of the form (pad_left, pad_right) in samples. If pad is a dictionary, it must be of the form
{id: (pad_left, pad_right)},
where id is the identifier of each strain.
Note
If extra layers below ID are present, they will be shrunk using the same pad in cascade.
- logpadbool, default=True
If False, the changes will not be reflected in the self.padding attribute.
Notes
This method shrinks strains_original as well.
- stack_by_id(id_list: list, length: int | None = None)#
Stack an subset of strains by their ID into a Numpy array.
Stack an arbitrary selection of strains by their original ID into a zero-padded 2d-array. The resulting order is the same as the order of that in ‘id_list’.
- Parameters:
- id_listlist
The IDs of the strains to be stacked.
- lengthint, optional
The target length of the stacked array. If None, the longest signal determines the length.
- Returns:
- stacked_signalsNDArray
The array containing the stacked strains.
- lengthslist
The original lengths of each strain, following the same order as the first axis of ‘stacked_signals’.
Notes
Unlike in ‘get_xtrain_array’ and ‘get_xtest_array’, this method does not filter by ‘classes’ since it would be redundant, as IDs are unique.
- whiten(*, flength: int, asd_array: ndarray[Any, dtype[_ScalarType_co]] | None = None, highpass: int | None = None, normed=False, shrink: int = 0, window: str | tuple = 'hann', verbose=False)#
Whiten the strains.
TODO
Calling this method performs the whitening of all strains.
If asd_array is None, the ASD will be estimated for each strain using SciPy’s Welch method with median average and the same parameters used for whitening.