clawdia.dictionaries#

Main module for managing all SDL models.

This module serves as the central interface for handling dictionary models included in the CLAWDIA pipeline. It provides classes and functions to load, save, and manage different types of dictionary models used in Sparse Dictionary Learning (SDL). Support is included for both SPAMS-based dictionaries and Low-Rank Sparse Dictionary Learning (LRSDL) models, ensuring compatibility and ease of use.

class clawdia.dictionaries.DictionaryLRSDL(lambd=0.01, lambd2=0.01, eta=0.0001, k=10, k0=5, updateX_iters=100, updateD_iters=100)[source]#

Bases: LRSDL

Interface for the Low-Rank Shared Dictionary Learning class.

Notes

The authors of Dictol didn’t provide a seed parameter for the random initialization of the dictionary. If reproducibility is important, one must set the global numpy’s seed before callint LRSDL.__init__().

References

[1]

Vu, T. H.; Monga, V. (2017). Fast low-rank shared dictionary learning for image classification, IEEE Transactions on Image Processing, 26(11), 5160–5175. (https://doi.org/10.1109/TIP.2017.2729885)

Attributes:
t_trainfloat

Training time in seconds.

lambdfloat

See self.__init__() for details.

lambd2float

See self.__init__() for details.

etafloat

See self.__init__() for details.

Dndarray

Class-specific dictionary.

Xndarray

Class-specific coefficient vector of the training set given when calling self.fit().

Yndarray

Class-specific target vector (the training set) given when calling self.fit().

kint

See self.__init__() for details.

k0int

See self.__init__() for details.

updateX_itersint

See self.__init__() for details.

updateD_itersint

See self.__init__() for details.

D_rangelist[int]

Auxiliar list containing the range of indices of each class in D.

D0ndarray

Shared dictionary.

Y_rangelist[→nt]

Auxiliar list containing the range of indices of each class in Y. Derived directly from ‘train_label’, equivalent to the ‘y_true’ labels. Example: given train_label = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], then Y_range = [0, 4, 6]. The first value is always 0, marking the start of the first class, and the last value is always the number of classes + 1.

X0ndarray

Shared coefficient vector of the training set given when calling self.fit().

Methods

fit(X, *, y_true, l_atoms, iterations[, ...])

Train the LRSDL dictionary.

predict(X, *[, threshold, offset, with_losses])

Predict the class of each window in X.

save(file)

Save the dictionary to a file.

evaluate

loss

__init__(lambd=0.01, lambd2=0.01, eta=0.0001, k=10, k0=5, updateX_iters=100, updateD_iters=100)[source]#

Initialize the LRSDL dictionary.

This method sets up the parameters required for training class-specific and shared dictionaries. These dictionaries are used to represent data with sparsity and low-rank properties, which can be regularized by the parameters defined below.

Parameters:
lambdfloat

Regularization parameter for the sparsity term:

\[\lambda \|X\|_1\]

This encourages sparsity in the class-specific dictionary, similar to the LASSO regularization term.

lambd2float

Regularization parameter for the reconstruction term:

\[\frac{\lambda_2}{2} \|X^0 - M^0\|^2\]

This ensures that the shared vector (used to select shared atoms) is sparse and close to the mean shared vector, ensuring consistency across all \(X^0\).

etafloat

Regularization parameter for the low-rank term:

\[\eta \| D^0 \|_*\]

Here, \(\|\cdot\|_*\) is the nuclear norm, which enforces the shared dictionary to have low rank.

kint

Number of class-specific atoms for each class. The total number of atoms in the class-specific dictionary is given by \(k \times C\), where \(C\) is the number of classes.

k0int

Total number of shared atoms. A value of \(k_0 = 0\) indicates that no shared dictionary is used.

updateX_iters, updateD_itersint

These parameters are passed to the parent class LRSDL.__init__(). However, they are suspected to be dummy parameters because no usage of them could be found in the original implementation. They are retained here for compatibility but appear to have no functional effect in this class.

Warning

The updateX_iters and updateD_iters parameters are inherited from the parent class LRSDL, but they appear to be unused in this implementation. Consider verifying their relevance before relying on them.

Notes

  • The parameters lambd, lambd2, and eta control the sparsity and low-rank properties of the dictionaries.

  • Setting k0 = 0 disables the shared dictionary.

evaluate(data, label)#
fit(X, *, y_true, l_atoms, iterations, step=None, threshold=0, random_seed=None, verbose=False, show_after=5)[source]#

Train the LRSDL dictionary.

This method trains the dictionary using the provided data and allows for several configuration options:

  • Split the input data X into sliding windows of length l_atoms.

  • Use the entire input as a single window.

  • Discard training windows whose L2-norm is below a specified threshold.

The splitting behavior depends on the step parameter. If step is None, the entire input is treated as a single window. Otherwise, overlapping patches of size l_atoms are created with the specified step size.

Parameters:
Xndarray of shape (n_samples, n_features)

Training samples. The number of features must be equal to or greater than the dictionary’s atom size.

y_truendarray of shape (n_samples,)

Labels corresponding to the samples in X. The length of y_true must equal the number of samples in X.

l_atomsint

Length of the dictionary’s atoms.

iterationsint

Number of training iterations.

stepint, optional

The step size for splitting input samples into patches of length l_atoms. If not specified, it is set the same (step = l_atoms) so that all information abailable is extracted from X without any repetition (overlap).

thresholdfloat, optional

L2-norm threshold (relative to the maximum L2-norm in each strain). Training windows with L2-norm below this value will be discarded. Default is 0 (only null arrays are discarded).

random_seedint, optional

Random seed for reproducibility. Default is None.

verbosebool, optional

If True, print verbose output during training. Default is False.

show_afterint, optional

If verbose is True, progress will be displayed every show_after iterations. Default is 5.

Returns:
None

The method trains the dictionary in place.

Notes

Per-class sufficiency is now checked on the effective number of training windows (after patching and thresholding), not on the raw number of input strains per class.

loss()#
predict(X, *, threshold=0, offset=0, with_losses=False)[source]#

Predict the class of each window in X.

The class of a window is the class of the closest codeword to that window in the dictionary.

Parameters:
X2d-array, shape=(n_signals, n_samples)

Input signals, with equal or more samples than the atoms’.

thresholdfloat, optional

Loss threshold ABOVE which signals will be marked as “unknown” class, which corresponds to the label value -1. Zero by default, all signals will be classified.

offsetint, optional

Index i0 at which to crop the input signals X. The i1 will be offset + l_atoms. By default 0.

with_lossesbool, optional

If True, return a tuple with the class predictions and the corresponding losses.

Returns:
y_pred1d-array, shape=(n_signals)

Class predictions for each input signal.

losses1d-array, shape=(n_signals), optional

Losses of the closest codewords to each input signal. Only returned if with_losses=True.

save(file: str) None[source]#

Save the dictionary to a file.

Save the dictionary attributes using NumPy’s ‘np.savez()’.

Parameters:
filestr

Path to the file where to save the dictionary.

class clawdia.dictionaries.DictionarySpams(dict_init=None, model=None, signal_pool=None, a_length=None, d_size=None, wave_pos=None, patch_min=1, l2_normed=True, allow_allzeros=False, random_state=None, ignore_completeness=False, lambda1=None, batch_size=64, n_iter=None, n_train=None, trained=False, mode_traindl=0, modeD_traindl=0, mode_lasso=2, identifier='')[source]#

Bases: object

Sparse Dictionary Learning (SDL) model for waveform denoising via SPAMS.

This class provides an object-oriented implementation of a Sparse Dictionary Learning model, designed for the denoising and reconstruction of waveforms. At its core, it utilizes the trainDL function for dictionary learning and the lasso function for sparse coding from the SPAMS-python library [1].

It extends these core functionalities to arbitrarily long signals and minibatch processing for large datasets. Additionally, the class includes various utilities for signal preprocessing, composite models of denoising (such as iterative reconstruction), and the ability to easily save and load the dictionary’s state.

References

[1]

SPAMS (for python), (http://spams-devel.gforge.inria.fr/). Last accessed in October 2018.

Attributes:
dict_initndarray

Atoms of the initial dictionary. Remains unaltered after training.

componentsndarray

Atoms of the current (trained) dictionary.

modeltuple

SPAMS’ trainDL model components in the form (A, B, iter).

d_sizeint

Number of atoms in the dictionary (dictionary size).

a_lengthint

Length of each atom in the dictionary (patch size).

lambda1float

Regularization parameter for training the dictionary.

batch_sizeint

Batch size used in mini-batch training.

n_iterint

Number of iterations performed during training.

t_trainfloat

Total training time in seconds.

trainedbool

Indicates whether the dictionary has been trained.

n_trainint

Number of patches used during training.

mode_traindlint

Training mode for SPAMS’ trainDL function.

modeD_traindlint

Dictionary mode for SPAMS’ trainDL function.

mode_lassoint

Mode for SPAMS’ lasso function.

identifierstr

Optional identifier or note for distinguishing the dictionary.

Methods

copy()

Return a copy of the dictionary.

reconstruct(signal, sc_lambda[, step, ...])

Reconstruct a signal as a sparse combination of dictionary atoms.

reconstruct_batch(signals, sc_lambda[, out, ...])

TODO

reconstruct_iterative(signals[, sc_lambda, ...])

Reconstruct multiple signals using iterative residual subtraction.

reconstruct_loss_optimised(strain, *, reference)

Find the best reconstruction of a signal w.r.t.

reconstruct_margin_constrained(signal, *, ...)

TODO

reconstruct_minibatch(signals, *, sc_lambda)

TODO

reset()

Reset the dictionary to its initial (untrained) state.

save(file)

Save the current state of the DictionarySpams object to a file.

train(patches[, lambda1, n_iter, ...])

Train the dictionary.

__init__(dict_init=None, model=None, signal_pool=None, a_length=None, d_size=None, wave_pos=None, patch_min=1, l2_normed=True, allow_allzeros=False, random_state=None, ignore_completeness=False, lambda1=None, batch_size=64, n_iter=None, n_train=None, trained=False, mode_traindl=0, modeD_traindl=0, mode_lasso=2, identifier='')[source]#

Initialize the dictionary.

There are two ways to initialize the dictionary:

  1. By directly providing the initial dictionary with dict_init.

  2. By providing a collection of signals (signal_pool) from which atoms are randomly extracted to form the initial dictionary.

If the second option is used, a_length and d_size must be explicitly specified to define the size of the dictionary. Additional optional parameters provide more control over this process.

Parameters:
dict_initndarray of shape (d_size, a_length), optional

Atoms of the initial dictionary. If None, signal_pool must be provided.

modeldict, optional

SPAMS’ trainDL model components as a dictionary with elements {A, B, iter}. Must be provided if continuing training from a previous state.

signal_poolndarray of shape (n_signals, n_samples), optional

A collection of signals from which atoms are extracted to form the initial dictionary. Ignored if dict_init is provided.

a_lengthint, optional

Length of each atom in the dictionary (patch size). Required if signal_pool is provided.

d_sizeint, optional

Number of atoms in the dictionary. Required if signal_pool is provided.

wave_posarray-like of shape (n_signals, 2), optional

Positions of waveforms within signal_pool to extract atoms from. If None, the entire array is used.

patch_minint, default=1

Minimum number of samples for each extracted patch. Ignored if wave_pos is None.

l2_normedbool, default=True

If True, normalize extracted atoms to their L2 norm.

allow_allzerosbool, default=False

By default, random atoms with all zeros are excluded from the initial dictionary. If allow_allzeros=True, they are allowed.

random_stateint, optional

Seed for random sampling from signal_pool.

ignore_completenessbool, optional, default=False

If False, the dictionary must be overcomplete (d_size > a_length).

lambda1float, optional

Regularization parameter for training.

batch_sizeint, default=64

Batch size used during training.

n_iterint, optional

Total number of iterations for training. If None, this must be set when calling the train method.

n_trainint, optional

Number of patches used for training. Informational only.

trainedbool, default=False

Indicates whether the dictionary is already trained.

mode_traindlint, default=0

Training mode for SPAMS’ trainDL function. See SPAMS documentation.

modeD_traindlint, default=0

Dictionary mode for SPAMS’ trainDL function. See SPAMS documentation.

mode_lassoint, default=2

Mode for SPAMS’ lasso function. See SPAMS documentation.

identifierstr, optional

A note or label for identifying the dictionary.

Notes

This method initializes the dictionary but does not train it. Use the train method for training.

copy()[source]#

Return a copy of the dictionary.

Returns a new instance of the same dictionary with the same values and state.

Returns:
dico_copyDictionarySpams

A copy of the current dictionary.

reconstruct(signal, sc_lambda, step=1, normed=True, with_code=False, **kwargs)[source]#

Reconstruct a signal as a sparse combination of dictionary atoms.

Parameters:
signalndarray

Sample to be reconstructed.

sc_lambdafloat

Regularization parameter of the sparse coding transformation.

stepint, 1 by default

Sample interval between each patch extracted from signal. Determines the number of patches to be extracted. 1 by default.

normedboolean, True by default

Normalize the result to the maximum absolute value.

with_codeboolean, False by default.

If True, also returns the coefficients array.

**kwargs

Passed directly to the external learning function.

Returns:
signal_recarray

Reconstructed signal.

codearray(a_length, d_size), optional

Transformed data, encoded as a sparse combination of atoms. Returned when ‘with_code’ is True.

reconstruct_batch(signals, sc_lambda, out=None, step=1, normed=True, verbose=True, **kwargs)[source]#

TODO

Reconstruct multiple signals, each one as a sparse combination of dictionary atoms.

WARNING: Only viable for small ‘signals’ set, it is really memory expensive (all patches are stored in a single array in memory).

WARNING: ‘out’ deprecated, left for backwards compatibility but will be ignored if given.

reconstruct_iterative(signals, sc_lambda=0.01, step=1, batchsize=64, max_iter=100, threshold=0.001, normed=True, full_output=False, verbose=True, kwargs_lasso={})[source]#

Reconstruct multiple signals using iterative residual subtraction.

This method reconstructs each signal by iteratively updating and accumulating reconstructions. In the first iteration, the original input signal is reconstructed and then subtracted from itself to obtain the initial residual. In each subsequent iteration, a new reconstruction is generated from the current residual and subtracted from it, producing an updated residual for the next iteration, while also being added to the cumulative reconstruction. The process repeats until the Euclidean norm of the difference between consecutive residuals falls below a specified threshold, which sets the convergence criterion.

NOTE: In contrast with the usual procedure, the windows into which each signal is split are not normalized. This is needed to enhance the dictionary discrimination. Otherwise, the residuals are amplified at each iteration, the algorithm takes longer to converge, and some ad-hoc tests showed it also messes up with the resulting shape.

Parameters:
signalsndarray

Input signals to be reconstructed, with each signal along the first dimension.

sc_lambdafloat, optional

Sparsity control parameter for reconstruction.

stepint, optional

Step size for the reconstruction.

batchsizeint, optional

Number of signals processed in each minibatch.

max_iterint, optional

Maximum number of iterations before stopping.

thresholdfloat, optional

Convergence threshold based on the relative change in residuals.

normedbool, optional

If True, the reconstructed signals are normalized after convergence.

full_outputbool, optional

If True, returns additional output values (residuals and iteration counts).

verbosebool, optional

If True, prints progress information at each iteration.

kwargs_lassodict, optional

Additional arguments for the Lasso reconstruction method.

Returns:
ndarray or tuple

The final reconstructed signals. If full_output is True, also returns the residuals and the number of iterations per signal.

reconstruct_loss_optimised(strain, *, reference, step=1, limits=None, loss_func='match', normed=True, kwargs_minimize={'bounds': (-2, 1), 'method': 'bounded', 'options': {'maxiter': 100, 'xatol': 0.04}}, kwargs_lasso={}, verbose=False)[source]#

Find the best reconstruction of a signal w.r.t. a reference.

Find the lambda which produces a reconstruction of the input ‘strain’ closest to the given ‘reference’, according to a chosen loss function: Match, Overlap, SSIM, or a custom one.

The minimisation is performed by SciPy’s ‘minimize_scalar’, with options specified through kwargs_minimize.

Parameters:
strain: ndarray

Input strain to be reconstructed (and optimized).

reference: ndarray

Reference strain which to compare the reconstruction to.

step: int, optional

Separation in samples between each window into which the input strain is split up to be reconstructed by the dictionary. Defaults to 1.

limits: array-like, optional

Indices of limits to where compute the loss between the reconstruction and the reference strain.

loss_func: str | callable, optional

If ‘str’, can be ‘match’ (default), ‘overlap’ or ‘ssim’. In both cases, their pseudo-distance is used. Refer to their documentation in ‘clawdia.estimators’ for more details. If ‘callable’, it must be a symmetric function of 2 arguments, over whose the ‘reference’ signal and the denoised signal will be passed. It must return a distance-like score between 0 (best) and 1 (worst) to guide the minimisation algorithm.

normed: bool, optional

If True, returns the signal normed to its maximum absolute amplitude.

kwargs_minimize: dict

Passed to SciPy’s minimize_scalar(**kwargs_minimize). Bracket or boundary values must be passed as np.log10(bounds).

kwargs_lasso: dict, optional

Passed to Python-Spams’ lasso(**kwargs_lasso).

verbose: bool, optional

Set the maximum verbosity (‘disp’: 3) to SciPy’s minimize_scalar and print info about the minimization results. False by default.

Returns:
rec: ndarray

Optimum reconstruction found.

l_opt: float

Optimum value for lambda.

loss: float

dOverlap (1 - Overlap)/2 or DSSIM (1 - SSIM)/2 between the optimized reconstruction and the reference.

reconstruct_margin_constrained(signal: ndarray[Any, dtype[_ScalarType_co]], *, margin: int | tuple | list | ndarray[Any, dtype[_ScalarType_co]], lambda_lims: tuple | list, step: int = 1, normed=True, full_output=False, kwargs_bisect={}, kwargs_lasso={}) tuple[ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]]] | ndarray[Any, dtype[_ScalarType_co]][source]#

TODO

reconstruct_minibatch(signals, *, sc_lambda, step=1, batchsize=4, normed=True, normed_windows=True, verbose=True, **kwargs)[source]#

TODO

Reconstruct multiple signals, each one as a sparse combination of dictionary atoms. Minibatch version.

reset()[source]#

Reset the dictionary to its initial (untrained) state.

save(file)[source]#

Save the current state of the DictionarySpams object to a file.

This method saves all attributes of the object as a .npz file. If the object has not been trained, certain attributes (lambda1, n_train, and t_train) are removed to avoid potential issues when reloading the state.

Parameters:
filestr or file-like object

The file path or file object where the state of the object will be saved. If a string is provided, it specifies the path to the .npz file. If a file-like object is given, it must be writable in binary mode.

train(patches, lambda1=None, n_iter=None, warm_start=False, verbose=False, threads=-1, **kwargs)[source]#

Train the dictionary.

Train the dictionary with the given patches.

This also allows a warm start using the previous components as initial dictionary, but only if the lambda1 parameter is the same. It can be thought of as adding more iterations to the training. Hence, providing different patches is discouraged and untested.

Parameters:
patches2d-array(signals, samples)

Training patches.

lambda1float, optional

Regularization parameter of the learning algorithm. It is not needed if already specified at initialization.

n_iterint, optional

Total number of iterations to perform. If a negative number is provided it will perform the computation during the corresponding number of seconds. For instance n_iter = -5 trains the dictionary during 5 seconds.

warm_startbool

If True, use the previous components as initial dictionary. It can be thought of as adding more iterations to the training. Providing different patches is discouraged and untested.

verbosebool, optional

If True print the iterations (might not be shown in real time).

threadsint, optional

Number of threads to use during training, see [1].

**kwargs

Passed directly to ‘spams.trainDL’, see [1].

See also

clawdia.lib.extract_patches

Useful for generating the training patches.

clawdia.dictionaries.load(file)[source]#
clawdia.dictionaries.save(file, dico)[source]#

Same as using the dictionary’s save method.