mir_eval.alignment
Alignment models are given a sequence of events along with a piece of audio, and then return a sequence of timestamps, with one timestamp for each event, indicating the position of this event in the audio. The events are listed in order of occurrence in the audio, so that output timestamps have to be monotonically increasing. Evaluation usually involves taking the series of predicted and ground truth timestamps and comparing their distance, usually on a pair-wise basis, e.g. taking the median absolute error in seconds.
Conventions
Timestamps should be provided in the form of a 1-dimensional array of onset times in seconds in increasing order.
Metrics
mir_eval.alignment.absolute_error()
: Median absolute error and average absolute errormir_eval.alignment.percentage_correct()
: Percentage of correct timestamps, where a timestamp is counted as correct if it lies within a certain tolerance window around the ground truth timestampmir_eval.alignment.pcs()
: Percentage of correct segments: Percentage of overlap between predicted segments and ground truth segments, where segments are defined by (start time, end time) pairsmir_eval.alignment.perceptual_metric()
: metric based on human synchronicity perception as measured in the paper “User-centered evaluation of lyrics to audio alignment”, N. Lizé-Masclef, A. Vaglio, M. Moussallam, ISMIR 2021
References
- mir_eval.alignment.validate(reference_timestamps: ndarray, estimated_timestamps: ndarray)
Check that the input annotations to a metric look like valid onset time arrays, and throws helpful errors if not.
- Parameters:
- reference_timestampsnp.ndarray
reference timestamp locations, in seconds
- estimated_timestampsnp.ndarray
estimated timestamp locations, in seconds
- mir_eval.alignment.absolute_error(reference_timestamps, estimated_timestamps)
Compute the absolute deviations between estimated and reference timestamps, and then returns the median and average over all events
- Parameters:
- reference_timestampsnp.ndarray
reference timestamps, in seconds
- estimated_timestampsnp.ndarray
estimated timestamps, in seconds
- Returns:
- maefloat
Median absolute error
- aae: float
Average absolute error
Examples
>>> reference_timestamps = mir_eval.io.load_events('reference.txt') >>> estimated_timestamps = mir_eval.io.load_events('estimated.txt') >>> mae, aae = mir_eval.align.absolute_error(reference_onsets, estimated_timestamps)
- mir_eval.alignment.percentage_correct(reference_timestamps, estimated_timestamps, window=0.3)
Compute the percentage of correctly predicted timestamps. A timestamp is predicted correctly if its position doesn’t deviate more than the window parameter from the ground truth timestamp.
- Parameters:
- reference_timestampsnp.ndarray
reference timestamps, in seconds
- estimated_timestampsnp.ndarray
estimated timestamps, in seconds
- windowfloat
Window size, in seconds (Default value = .3)
- Returns:
- pcfloat
Percentage of correct timestamps
Examples
>>> reference_timestamps = mir_eval.io.load_events('reference.txt') >>> estimated_timestamps = mir_eval.io.load_events('estimated.txt') >>> pc = mir_eval.align.percentage_correct(reference_onsets, estimated_timestamps, window=0.2)
- mir_eval.alignment.percentage_correct_segments(reference_timestamps, estimated_timestamps, duration: float | None = None)
Calculate the percentage of correct segments (PCS) metric.
It constructs segments out of predicted and estimated timestamps separately out of each given timestamp vector and calculates the percentage of overlap between correct segments compared to the total duration.
WARNING: This metrics behaves differently depending on whether “duration” is given!
If duration is not given (default case), the computation follows the MIREX lyrics alignment challenge 2020. For a timestamp vector with entries (t1,t2, … tN), segments with the following (start, end) boundaries are created: (t1, t2), … (tN-1, tN). After the segments are created, the overlap between the reference and estimated segments is determined and divided by the total duration, which is the distance between the first and last timestamp in the reference.
If duration is given, the segment boundaries are instead (0, t1), (t1, t2), … (tN, duration). The overlap is computed in the same way, but then divided by the duration parameter given to this function. This method follows the original paper [#fujihara2011] more closely, where the metric was proposed. As a result, this variant of the metrics punishes cases where the first estimated timestamp is too early or the last estimated timestamp is too late, whereas the MIREX variant does not. On the other hand, the MIREX metric is invariant to how long the eventless beginning and end parts of the audio are, which might be a desirable property.
- Parameters:
- reference_timestampsnp.ndarray
reference timestamps, in seconds
- estimated_timestampsnp.ndarray
estimated timestamps, in seconds
- durationfloat
Optional. Total duration of audio (seconds). WARNING: Metric is computed differently depending on whether this is provided or not - see documentation above!
- Returns:
- pcsfloat
Percentage of time where ground truth and predicted segments overlap
Examples
>>> reference_timestamps = mir_eval.io.load_events('reference.txt') >>> estimated_timestamps = mir_eval.io.load_events('estimated.txt') >>> pcs = mir_eval.align.percentage_correct_segments(reference_timestamps, estimated_timestamps)
- mir_eval.alignment.karaoke_perceptual_metric(reference_timestamps, estimated_timestamps)
Metric based on human synchronicity perception as measured in the paper “User-centered evaluation of lyrics to audio alignment” [#lizemasclef2021]
The parameters of this function were tuned on data collected through a user Karaoke-like experiment It reflects human judgment of how “synchronous” lyrics and audio stimuli are perceived in that setup. Beware that this metric is non-symmetrical and by construction it is also not equal to 1 at 0.
- Parameters:
- reference_timestampsnp.ndarray
reference timestamps, in seconds
- estimated_timestampsnp.ndarray
estimated timestamps, in seconds
- Returns:
- perceptual_scorefloat
Perceptual score, averaged over all timestamps
Examples
>>> reference_timestamps = mir_eval.io.load_events('reference.txt') >>> estimated_timestamps = mir_eval.io.load_events('estimated.txt') >>> score = mir_eval.align.karaoke_perceptual_metric(reference_onsets, estimated_timestamps)
- mir_eval.alignment.evaluate(reference_timestamps, estimated_timestamps, **kwargs)
Compute all metrics for the given reference and estimated annotations.
- Parameters:
- reference_timestampsnp.ndarray
reference timestamp locations, in seconds
- estimated_timestampsnp.ndarray
estimated timestamp locations, in seconds
- **kwargs
Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.
- Returns:
- scoresdict
Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.
Examples
>>> reference_timestamps = mir_eval.io.load_events('reference.txt') >>> estimated_timestamps = mir_eval.io.load_events('estimated.txt') >>> duration = max(np.max(reference_timestamps), np.max(estimated_timestamps)) + 10 >>> scores = mir_eval.align.evaluate(reference_onsets, estimated_timestamps, duration)