clawdia.dictionaries#

Main module for managing all SDL models.

This module serves as the central interface for handling dictionary models included in the CLAWDIA pipeline. It provides classes and functions to load, save, and manage different types of dictionary models used in Sparse Dictionary Learning (SDL). Support is included for both SPAMS-based dictionaries and Low-Rank Sparse Dictionary Learning (LRSDL) models, ensuring compatibility and ease of use.

class clawdia.dictionaries.DictionaryLRSDL(lambd=0.01, lambd2=0.01, eta=0.0001, k=10, k0=5, updateX_iters=100, updateD_iters=100)[source]#

Bases: LRSDL

Interface for the Low-Rank Shared Dictionary Learning class.

Notes

The authors of Dictol didn’t provide a seed parameter for the random initialization of the dictionary. If reproducibility is important, one must set the global numpy’s seed before callint LRSDL.__init__().

References

[1]

Vu, T. H.; Monga, V. (2017). Fast low-rank shared dictionary learning for image classification, IEEE Transactions on Image Processing, 26(11), 5160–5175. (https://doi.org/10.1109/TIP.2017.2729885)

Attributes:

t_trainfloat: Training time in seconds.
lambdfloat: See self.__init__() for details.
lambd2float: See self.__init__() for details.
etafloat: See self.__init__() for details.
Dndarray: Class-specific dictionary.
Xndarray: Class-specific coefficient vector of the training set given when calling self.fit().
Yndarray: Class-specific target vector (the training set) given when calling self.fit().
kint: See self.__init__() for details.
k0int: See self.__init__() for details.
updateX_itersint: See self.__init__() for details.
updateD_itersint: See self.__init__() for details.
D_rangelist[int]: Auxiliar list containing the range of indices of each class in D.
D0ndarray: Shared dictionary.
Y_rangelist[→nt]: Auxiliar list containing the range of indices of each class in Y. Derived directly from ‘train_label’, equivalent to the ‘y_true’ labels. Example: given train_label = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], then Y_range = [0, 4, 6]. The first value is always 0, marking the start of the first class, and the last value is always the number of classes + 1.
X0ndarray: Shared coefficient vector of the training set given when calling self.fit().

Methods

`fit`(X, *, y_true, l_atoms, iterations[, ...])	Train the LRSDL dictionary.
`predict`(X, *[, threshold, offset, with_losses])	Predict the class of each window in X.
`save`(file)	Save the dictionary to a file.

evaluate
loss

__init__(lambd=0.01, lambd2=0.01, eta=0.0001, k=10, k0=5, updateX_iters=100, updateD_iters=100)[source]#

Initialize the LRSDL dictionary.

This method sets up the parameters required for training class-specific and shared dictionaries. These dictionaries are used to represent data with sparsity and low-rank properties, which can be regularized by the parameters defined below.

Parameters:

lambdfloat

Regularization parameter for the sparsity term:

\[\lambda \|X\|_1\]

This encourages sparsity in the class-specific dictionary, similar to the LASSO regularization term.

lambd2float

Regularization parameter for the reconstruction term:

\[\frac{\lambda_2}{2} \|X^0 - M^0\|^2\]

This ensures that the shared vector (used to select shared atoms) is sparse and close to the mean shared vector, ensuring consistency across all \(X^0\).

etafloat

Regularization parameter for the low-rank term:

\[\eta \| D^0 \|_*\]

Here, \(\|\cdot\|_*\) is the nuclear norm, which enforces the shared dictionary to have low rank.

kint

Number of class-specific atoms for each class. The total number of atoms in the class-specific dictionary is given by \(k \times C\), where \(C\) is the number of classes.

k0int

Total number of shared atoms. A value of \(k_0 = 0\) indicates that no shared dictionary is used.

updateX_iters, updateD_itersint

These parameters are passed to the parent class LRSDL.__init__(). However, they are suspected to be dummy parameters because no usage of them could be found in the original implementation. They are retained here for compatibility but appear to have no functional effect in this class.

Warning

The updateX_iters and updateD_iters parameters are inherited from the parent class LRSDL, but they appear to be unused in this implementation. Consider verifying their relevance before relying on them.

Notes

The parameters lambd, lambd2, and eta control the sparsity and low-rank properties of the dictionaries.
Setting k0 = 0 disables the shared dictionary.

evaluate(data, label)#

fit(X, *, y_true, l_atoms, iterations, step=None, threshold=0, random_seed=None, verbose=False, show_after=5)[source]#

Train the LRSDL dictionary.

This method trains the dictionary using the provided data and allows for several configuration options:

Split the input data X into sliding windows of length l_atoms.
Use the entire input as a single window.
Discard training windows whose L2-norm is below a specified threshold.

The splitting behavior depends on the step parameter. If step is None, the entire input is treated as a single window. Otherwise, overlapping patches of size l_atoms are created with the specified step size.

Parameters:

Xndarray of shape (n_samples, n_features): Training samples. The number of features must be equal to or greater than the dictionary’s atom size.
y_truendarray of shape (n_samples,): Labels corresponding to the samples in X. The length of y_true must equal the number of samples in X.
l_atomsint: Length of the dictionary’s atoms.
iterationsint: Number of training iterations.
stepint, optional: The step size for splitting input samples into patches of length l_atoms. If not specified, it is set the same (step = l_atoms) so that all information abailable is extracted from X without any repetition (overlap).
thresholdfloat, optional: L2-norm threshold (relative to the maximum L2-norm in each strain). Training windows with L2-norm below this value will be discarded. Default is 0 (only null arrays are discarded).
random_seedint, optional: Random seed for reproducibility. Default is None.
verbosebool, optional: If True, print verbose output during training. Default is False.
show_afterint, optional: If verbose is True, progress will be displayed every show_after iterations. Default is 5.

Returns:

None: The method trains the dictionary in place.

Notes

Per-class sufficiency is now checked on the effective number of training windows (after patching and thresholding), not on the raw number of input strains per class.

loss()#

predict(X, *, threshold=0, offset=0, with_losses=False)[source]#

Predict the class of each window in X.

The class of a window is the class of the closest codeword to that window in the dictionary.

Parameters:

X2d-array, shape=(n_signals, n_samples): Input signals, with equal or more samples than the atoms’.
thresholdfloat, optional: Loss threshold ABOVE which signals will be marked as “unknown” class, which corresponds to the label value -1. Zero by default, all signals will be classified.
offsetint, optional: Index i0 at which to crop the input signals X. The i1 will be offset + l_atoms. By default 0.
with_lossesbool, optional: If True, return a tuple with the class predictions and the corresponding losses.

Returns:

y_pred1d-array, shape=(n_signals): Class predictions for each input signal.
losses1d-array, shape=(n_signals), optional: Losses of the closest codewords to each input signal. Only returned if with_losses=True.

save(file: str) → None[source]#

Save the dictionary to a file.

Save the dictionary attributes using NumPy’s ‘np.savez()’.

Parameters:

filestr: Path to the file where to save the dictionary.

class clawdia.dictionaries.DictionarySpams(dict_init=None, model=None, signal_pool=None, a_length=None, d_size=None, wave_pos=None, patch_min=1, l2_normed=True, allow_allzeros=False, random_state=None, ignore_completeness=False, lambda1=None, batch_size=64, n_iter=None, n_train=None, trained=False, mode_traindl=0, modeD_traindl=0, mode_lasso=2, identifier='')[source]#

Bases: object

Sparse Dictionary Learning (SDL) model for waveform denoising via SPAMS.

This class provides an object-oriented implementation of a Sparse Dictionary Learning model, designed for the denoising and reconstruction of waveforms. At its core, it utilizes the trainDL function for dictionary learning and the lasso function for sparse coding from the SPAMS-python library [1].

It extends these core functionalities to arbitrarily long signals and minibatch processing for large datasets. Additionally, the class includes various utilities for signal preprocessing, composite models of denoising (such as iterative reconstruction), and the ability to easily save and load the dictionary’s state.

References

[1]

SPAMS (for python), (http://spams-devel.gforge.inria.fr/). Last accessed in October 2018.

Attributes:

dict_initndarray: Atoms of the initial dictionary. Remains unaltered after training.
componentsndarray: Atoms of the current (trained) dictionary.
modeltuple: SPAMS’ trainDL model components in the form (A, B, iter).
d_sizeint: Number of atoms in the dictionary (dictionary size).
a_lengthint: Length of each atom in the dictionary (patch size).
lambda1float: Regularization parameter for training the dictionary.
batch_sizeint: Batch size used in mini-batch training.
n_iterint: Number of iterations performed during training.
t_trainfloat: Total training time in seconds.
trainedbool: Indicates whether the dictionary has been trained.
n_trainint: Number of patches used during training.
mode_traindlint: Training mode for SPAMS’ trainDL function.
modeD_traindlint: Dictionary mode for SPAMS’ trainDL function.
mode_lassoint: Mode for SPAMS’ lasso function.
identifierstr: Optional identifier or note for distinguishing the dictionary.

Methods

`copy`()	Return a copy of the dictionary.
`reconstruct`(signal, sc_lambda[, step, ...])	Reconstruct a signal as a sparse combination of dictionary atoms.
`reconstruct_batch`(signals, sc_lambda[, out, ...])	TODO
`reconstruct_iterative`(signals[, sc_lambda, ...])	Reconstruct multiple signals using iterative residual subtraction.
`reconstruct_loss_optimised`(strain, *, reference)	Find the best reconstruction of a signal w.r.t.
`reconstruct_margin_constrained`(signal, *, ...)	TODO
`reconstruct_minibatch`(signals, *, sc_lambda)	TODO
`reset`()	Reset the dictionary to its initial (untrained) state.
`save`(file)	Save the current state of the DictionarySpams object to a file.
`train`(patches[, lambda1, n_iter, ...])	Train the dictionary.

__init__(dict_init=None, model=None, signal_pool=None, a_length=None, d_size=None, wave_pos=None, patch_min=1, l2_normed=True, allow_allzeros=False, random_state=None, ignore_completeness=False, lambda1=None, batch_size=64, n_iter=None, n_train=None, trained=False, mode_traindl=0, modeD_traindl=0, mode_lasso=2, identifier='')[source]#

Initialize the dictionary.

There are two ways to initialize the dictionary:

By directly providing the initial dictionary with dict_init.
By providing a collection of signals (signal_pool) from which atoms are randomly extracted to form the initial dictionary.

If the second option is used, a_length and d_size must be explicitly specified to define the size of the dictionary. Additional optional parameters provide more control over this process.

Parameters:

dict_initndarray of shape (d_size, a_length), optional: Atoms of the initial dictionary. If None, signal_pool must be provided.
modeldict, optional: SPAMS’ trainDL model components as a dictionary with elements {A, B, iter}. Must be provided if continuing training from a previous state.
signal_poolndarray of shape (n_signals, n_samples), optional: A collection of signals from which atoms are extracted to form the initial dictionary. Ignored if dict_init is provided.
a_lengthint, optional: Length of each atom in the dictionary (patch size). Required if signal_pool is provided.
d_sizeint, optional: Number of atoms in the dictionary. Required if signal_pool is provided.
wave_posarray-like of shape (n_signals, 2), optional: Positions of waveforms within signal_pool to extract atoms from. If None, the entire array is used.
patch_minint, default=1: Minimum number of samples for each extracted patch. Ignored if wave_pos is None.
l2_normedbool, default=True: If True, normalize extracted atoms to their L2 norm.
allow_allzerosbool, default=False: By default, random atoms with all zeros are excluded from the initial dictionary. If allow_allzeros=True, they are allowed.
random_stateint, optional: Seed for random sampling from signal_pool.
ignore_completenessbool, optional, default=False: If False, the dictionary must be overcomplete (d_size > a_length).
lambda1float, optional: Regularization parameter for training.
batch_sizeint, default=64: Batch size used during training.
n_iterint, optional: Total number of iterations for training. If None, this must be set when calling the train method.
n_trainint, optional: Number of patches used for training. Informational only.
trainedbool, default=False: Indicates whether the dictionary is already trained.
mode_traindlint, default=0: Training mode for SPAMS’ trainDL function. See SPAMS documentation.
modeD_traindlint, default=0: Dictionary mode for SPAMS’ trainDL function. See SPAMS documentation.
mode_lassoint, default=2: Mode for SPAMS’ lasso function. See SPAMS documentation.
identifierstr, optional: A note or label for identifying the dictionary.

Notes

This method initializes the dictionary but does not train it. Use the train method for training.

copy()[source]#

Return a copy of the dictionary.

Returns a new instance of the same dictionary with the same values and state.

Returns:

dico_copyDictionarySpams: A copy of the current dictionary.

reconstruct(signal, sc_lambda, step=1, normed=True, with_code=False, **kwargs)[source]#

Reconstruct a signal as a sparse combination of dictionary atoms.

Parameters:

signalndarray: Sample to be reconstructed.
sc_lambdafloat: Regularization parameter of the sparse coding transformation.
stepint, 1 by default: Sample interval between each patch extracted from signal. Determines the number of patches to be extracted. 1 by default.
normedboolean, True by default: Normalize the result to the maximum absolute value.
with_codeboolean, False by default.: If True, also returns the coefficients array.
**kwargs: Passed directly to the external learning function.

Returns:

signal_recarray: Reconstructed signal.
codearray(a_length, d_size), optional: Transformed data, encoded as a sparse combination of atoms. Returned when ‘with_code’ is True.

reconstruct_batch(signals, sc_lambda, out=None, step=1, normed=True, verbose=True, **kwargs)[source]#

TODO

Reconstruct multiple signals, each one as a sparse combination of dictionary atoms.

WARNING: Only viable for small ‘signals’ set, it is really memory expensive (all patches are stored in a single array in memory).

WARNING: ‘out’ deprecated, left for backwards compatibility but will be ignored if given.

reconstruct_iterative(signals, sc_lambda=0.01, step=1, batchsize=64, max_iter=100, threshold=0.001, normed=True, full_output=False, verbose=True, kwargs_lasso={})[source]#

Reconstruct multiple signals using iterative residual subtraction.

This method reconstructs each signal by iteratively updating and accumulating reconstructions. In the first iteration, the original input signal is reconstructed and then subtracted from itself to obtain the initial residual. In each subsequent iteration, a new reconstruction is generated from the current residual and subtracted from it, producing an updated residual for the next iteration, while also being added to the cumulative reconstruction. The process repeats until the Euclidean norm of the difference between consecutive residuals falls below a specified threshold, which sets the convergence criterion.

NOTE: In contrast with the usual procedure, the windows into which each signal is split are not normalized. This is needed to enhance the dictionary discrimination. Otherwise, the residuals are amplified at each iteration, the algorithm takes longer to converge, and some ad-hoc tests showed it also messes up with the resulting shape.

Parameters:

signalsndarray: Input signals to be reconstructed, with each signal along the first dimension.
sc_lambdafloat, optional: Sparsity control parameter for reconstruction.
stepint, optional: Step size for the reconstruction.
batchsizeint, optional: Number of signals processed in each minibatch.
max_iterint, optional: Maximum number of iterations before stopping.
thresholdfloat, optional: Convergence threshold based on the relative change in residuals.
normedbool, optional: If True, the reconstructed signals are normalized after convergence.
full_outputbool, optional: If True, returns additional output values (residuals and iteration counts).
verbosebool, optional: If True, prints progress information at each iteration.
kwargs_lassodict, optional: Additional arguments for the Lasso reconstruction method.

Returns:

ndarray or tuple: The final reconstructed signals. If full_output is True, also returns the residuals and the number of iterations per signal.

reconstruct_loss_optimised(strain, *, reference, step=1, limits=None, loss_func='match', normed=True, kwargs_minimize={'bounds': (-2, 1), 'method': 'bounded', 'options': {'maxiter': 100, 'xatol': 0.04}}, kwargs_lasso={}, verbose=False)[source]#

Find the best reconstruction of a signal w.r.t. a reference.

Find the lambda which produces a reconstruction of the input ‘strain’ closest to the given ‘reference’, according to a chosen loss function: Match, Overlap, SSIM, or a custom one.

The minimisation is performed by SciPy’s ‘minimize_scalar’, with options specified through kwargs_minimize.

Parameters:

strain: ndarray: Input strain to be reconstructed (and optimized).
reference: ndarray: Reference strain which to compare the reconstruction to.
step: int, optional: Separation in samples between each window into which the input strain is split up to be reconstructed by the dictionary. Defaults to 1.
limits: array-like, optional: Indices of limits to where compute the loss between the reconstruction and the reference strain.
loss_func: str | callable, optional: If ‘str’, can be ‘match’ (default), ‘overlap’ or ‘ssim’. In both cases, their pseudo-distance is used. Refer to their documentation in ‘clawdia.estimators’ for more details. If ‘callable’, it must be a symmetric function of 2 arguments, over whose the ‘reference’ signal and the denoised signal will be passed. It must return a distance-like score between 0 (best) and 1 (worst) to guide the minimisation algorithm.
normed: bool, optional: If True, returns the signal normed to its maximum absolute amplitude.
kwargs_minimize: dict: Passed to SciPy’s minimize_scalar(**kwargs_minimize). Bracket or boundary values must be passed as np.log10(bounds).
kwargs_lasso: dict, optional: Passed to Python-Spams’ lasso(**kwargs_lasso).
verbose: bool, optional: Set the maximum verbosity (‘disp’: 3) to SciPy’s minimize_scalar and print info about the minimization results. False by default.

Returns:

rec: ndarray: Optimum reconstruction found.
l_opt: float: Optimum value for lambda.
loss: float: dOverlap (1 - Overlap)/2 or DSSIM (1 - SSIM)/2 between the optimized reconstruction and the reference.

reconstruct_margin_constrained(signal: ndarray[Any, dtype[_ScalarType_co]], *, margin: int | tuple | list | ndarray[Any, dtype[_ScalarType_co]], lambda_lims: tuple | list, step: int = 1, normed=True, full_output=False, kwargs_bisect={}, kwargs_lasso={}) → tuple[ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]]] | ndarray[Any, dtype[_ScalarType_co]][source]#: TODO

reconstruct_minibatch(signals, *, sc_lambda, step=1, batchsize=4, normed=True, normed_windows=True, verbose=True, **kwargs)[source]#

TODO

Reconstruct multiple signals, each one as a sparse combination of dictionary atoms. Minibatch version.

reset()[source]#: Reset the dictionary to its initial (untrained) state.

save(file)[source]#

Save the current state of the DictionarySpams object to a file.

This method saves all attributes of the object as a .npz file. If the object has not been trained, certain attributes (lambda1, n_train, and t_train) are removed to avoid potential issues when reloading the state.

Parameters:

filestr or file-like object: The file path or file object where the state of the object will be saved. If a string is provided, it specifies the path to the .npz file. If a file-like object is given, it must be writable in binary mode.

train(patches, lambda1=None, n_iter=None, warm_start=False, verbose=False, threads=-1, **kwargs)[source]#

Train the dictionary.

Train the dictionary with the given patches.

This also allows a warm start using the previous components as initial dictionary, but only if the lambda1 parameter is the same. It can be thought of as adding more iterations to the training. Hence, providing different patches is discouraged and untested.

Parameters:

patches2d-array(signals, samples): Training patches.
lambda1float, optional: Regularization parameter of the learning algorithm. It is not needed if already specified at initialization.
n_iterint, optional: Total number of iterations to perform. If a negative number is provided it will perform the computation during the corresponding number of seconds. For instance n_iter = -5 trains the dictionary during 5 seconds.
warm_startbool: If True, use the previous components as initial dictionary. It can be thought of as adding more iterations to the training. Providing different patches is discouraged and untested.
verbosebool, optional: If True print the iterations (might not be shown in real time).
threadsint, optional: Number of threads to use during training, see [1].
**kwargs: Passed directly to ‘spams.trainDL’, see [1].