mir_eval.alignment

Alignment models are given a sequence of events along with a piece of audio, and then return a sequence of timestamps, with one timestamp for each event, indicating the position of this event in the audio. The events are listed in order of occurrence in the audio, so that output timestamps have to be monotonically increasing. Evaluation usually involves taking the series of predicted and ground truth timestamps and comparing their distance, usually on a pair-wise basis, e.g. taking the median absolute error in seconds.

Conventions

Timestamps should be provided in the form of a 1-dimensional array of onset times in seconds in increasing order.

Metrics

  • mir_eval.alignment.absolute_error(): Median absolute error and average absolute error

  • mir_eval.alignment.percentage_correct(): Percentage of correct timestamps, where a timestamp is counted as correct if it lies within a certain tolerance window around the ground truth timestamp

  • mir_eval.alignment.pcs(): Percentage of correct segments: Percentage of overlap between predicted segments and ground truth segments, where segments are defined by (start time, end time) pairs

  • mir_eval.alignment.perceptual_metric(): metric based on human synchronicity perception as measured in the paper “User-centered evaluation of lyrics to audio alignment”, N. Lizé-Masclef, A. Vaglio, M. Moussallam, ISMIR 2021

References

mir_eval.alignment.validate(reference_timestamps: ndarray, estimated_timestamps: ndarray)

Check that the input annotations to a metric look like valid onset time arrays, and throws helpful errors if not.

Parameters:
reference_timestampsnp.ndarray

reference timestamp locations, in seconds

estimated_timestampsnp.ndarray

estimated timestamp locations, in seconds

mir_eval.alignment.absolute_error(reference_timestamps, estimated_timestamps)

Compute the absolute deviations between estimated and reference timestamps, and then returns the median and average over all events

Parameters:
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

Returns:
maefloat

Median absolute error

aae: float

Average absolute error

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> mae, aae = mir_eval.align.absolute_error(reference_onsets, estimated_timestamps)
mir_eval.alignment.percentage_correct(reference_timestamps, estimated_timestamps, window=0.3)

Compute the percentage of correctly predicted timestamps. A timestamp is predicted correctly if its position doesn’t deviate more than the window parameter from the ground truth timestamp.

Parameters:
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

windowfloat

Window size, in seconds (Default value = .3)

Returns:
pcfloat

Percentage of correct timestamps

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> pc = mir_eval.align.percentage_correct(reference_onsets, estimated_timestamps, window=0.2)
mir_eval.alignment.percentage_correct_segments(reference_timestamps, estimated_timestamps, duration: float | None = None)

Calculate the percentage of correct segments (PCS) metric.

It constructs segments out of predicted and estimated timestamps separately out of each given timestamp vector and calculates the percentage of overlap between correct segments compared to the total duration.

WARNING: This metrics behaves differently depending on whether “duration” is given!

If duration is not given (default case), the computation follows the MIREX lyrics alignment challenge 2020. For a timestamp vector with entries (t1,t2, … tN), segments with the following (start, end) boundaries are created: (t1, t2), … (tN-1, tN). After the segments are created, the overlap between the reference and estimated segments is determined and divided by the total duration, which is the distance between the first and last timestamp in the reference.

If duration is given, the segment boundaries are instead (0, t1), (t1, t2), … (tN, duration). The overlap is computed in the same way, but then divided by the duration parameter given to this function. This method follows the original paper [#fujihara2011] more closely, where the metric was proposed. As a result, this variant of the metrics punishes cases where the first estimated timestamp is too early or the last estimated timestamp is too late, whereas the MIREX variant does not. On the other hand, the MIREX metric is invariant to how long the eventless beginning and end parts of the audio are, which might be a desirable property.

Parameters:
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

durationfloat

Optional. Total duration of audio (seconds). WARNING: Metric is computed differently depending on whether this is provided or not - see documentation above!

Returns:
pcsfloat

Percentage of time where ground truth and predicted segments overlap

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> pcs = mir_eval.align.percentage_correct_segments(reference_timestamps, estimated_timestamps)
mir_eval.alignment.karaoke_perceptual_metric(reference_timestamps, estimated_timestamps)

Metric based on human synchronicity perception as measured in the paper “User-centered evaluation of lyrics to audio alignment” [#lizemasclef2021]

The parameters of this function were tuned on data collected through a user Karaoke-like experiment It reflects human judgment of how “synchronous” lyrics and audio stimuli are perceived in that setup. Beware that this metric is non-symmetrical and by construction it is also not equal to 1 at 0.

Parameters:
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

Returns:
perceptual_scorefloat

Perceptual score, averaged over all timestamps

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> score = mir_eval.align.karaoke_perceptual_metric(reference_onsets, estimated_timestamps)
mir_eval.alignment.evaluate(reference_timestamps, estimated_timestamps, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters:
reference_timestampsnp.ndarray

reference timestamp locations, in seconds

estimated_timestampsnp.ndarray

estimated timestamp locations, in seconds

**kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns:
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> duration = max(np.max(reference_timestamps), np.max(estimated_timestamps)) + 10
>>> scores = mir_eval.align.evaluate(reference_onsets, estimated_timestamps, duration)