mir_eval.transcription

The aim of a transcription algorithm is to produce a symbolic representation of a recorded piece of music in the form of a set of discrete notes. There are different ways to represent notes symbolically. Here we use the piano-roll convention, meaning each note has a start time, a duration (or end time), and a single, constant, pitch value. Pitch values can be quantized (e.g. to a semitone grid tuned to 440 Hz), but do not have to be. Also, the transcription can contain the notes of a single instrument or voice (for example the melody), or the notes of all instruments/voices in the recording. This module is instrument agnostic: all notes in the estimate are compared against all notes in the reference.

There are many metrics for evaluating transcription algorithms. Here we limit ourselves to the most simple and commonly used: given two sets of notes, we count how many estimated notes match the reference, and how many do not. Based on these counts we compute the precision, recall, f-measure and overlap ratio of the estimate given the reference. The default criteria for considering two notes to be a match are adopted from the MIREX Multiple fundamental frequency estimation and tracking, Note Tracking subtask (task 2):

“This subtask is evaluated in two different ways. In the first setup , a returned note is assumed correct if its onset is within +-50ms of a reference note and its F0 is within +- quarter tone of the corresponding reference note, ignoring the returned offset values. In the second setup, on top of the above requirements, a correct returned note is required to have an offset value within 20% of the reference note’s duration around the reference note’s offset, or within 50ms whichever is larger.”

In short, we compute precision, recall, f-measure and overlap ratio, once without taking offsets into account, and the second time with.

For further details see Salamon, 2013 (page 186), and references therein:

Salamon, J. (2013). Melody Extraction from Polyphonic Music Signals. Ph.D. thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.

IMPORTANT NOTE: the evaluation code in mir_eval contains several important differences with respect to the code used in MIREX 2015 for the Note Tracking subtask on the Su dataset (henceforth “MIREX”):

  1. mir_eval uses bipartite graph matching to find the optimal pairing of reference notes to estimated notes. MIREX uses a greedy matching algorithm, which can produce sub-optimal note matching. This will result in mir_eval’s metrics being slightly higher compared to MIREX.

  2. MIREX rounds down the onset and offset times of each note to 2 decimal points using new_time = 0.01 * floor(time*100). mir_eval rounds down the note onset and offset times to 4 decinal points. This will bring our metrics down a notch compared to the MIREX results.

  3. In the MIREX wiki, the criterion for matching offsets is that they must be within 0.2 * ref_duration or 0.05 seconds from each other, whichever is greater (i.e. offset_dif <= max(0.2 * ref_duration, 0.05). The MIREX code however only uses a threshold of 0.2 * ref_duration, without the 0.05 second minimum. Since mir_eval does include this minimum, it might produce slightly higher results compared to MIREX.

This means that differences 1 and 3 bring mir_eval’s metrics up compared to MIREX, whilst 2 brings them down. Based on internal testing, overall the effect of these three differences is that the Precision, Recall and F-measure returned by mir_eval will be higher compared to MIREX by about 1%-2%.

Finally, note that different evaluation scripts have been used for the Multi-F0 Note Tracking task in MIREX over the years. In particular, some scripts used < for matching onsets, offsets, and pitch values, whilst the others used <= for these checks. mir_eval provides both options: by default the latter (<=) is used, but you can set strict=True when calling mir_eval.transcription.precision_recall_f1_overlap() in which case < will be used. The default value (strict=False) is the same as that used in MIREX 2015 for the Note Tracking subtask on the Su dataset.

Conventions

Notes should be provided in the form of an interval array and a pitch array. The interval array contains two columns, one for note onsets and the second for note offsets (each row represents a single note). The pitch array contains one column with the corresponding note pitch values (one value per note), represented by their fundamental frequency (f0) in Hertz.

Metrics

  • mir_eval.transcription.precision_recall_f1_overlap(): The precision, recall, F-measure, and Average Overlap Ratio of the note transcription, where an estimated note is considered correct if its pitch, onset and (optionally) offset are sufficiently close to a reference note.

  • mir_eval.transcription.onset_precision_recall_f1(): The precision, recall and F-measure of the note transcription, where an estimated note is considered correct if its onset is sufficiently close to a reference note’s onset. That is, these metrics are computed taking only note onsets into account, meaning two notes could be matched even if they have very different pitch values.

  • mir_eval.transcription.offset_precision_recall_f1(): The precision, recall and F-measure of the note transcription, where an estimated note is considered correct if its offset is sufficiently close to a reference note’s offset. That is, these metrics are computed taking only note offsets into account, meaning two notes could be matched even if they have very different pitch values.

mir_eval.transcription.validate(ref_intervals, ref_pitches, est_intervals, est_pitches)

Check that the input annotations to a metric look like time intervals and a pitch list, and throws helpful errors if not.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

mir_eval.transcription.validate_intervals(ref_intervals, est_intervals)

Check that the input annotations to a metric look like time intervals, and throws helpful errors if not.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

mir_eval.transcription.match_note_offsets(ref_intervals, est_intervals, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False)

Compute a maximum matching between reference and estimated notes, only taking note offsets into account.

Given two note sequences represented by ref_intervals and est_intervals (see mir_eval.io.load_valued_intervals()), we seek the largest set of correspondences (i, j) such that the offset of reference note i has to be within offset_tolerance of the offset of estimated note j, where offset_tolerance is equal to offset_ratio times the reference note’s duration, i.e. offset_ratio * ref_duration[i] where ref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]. If the resulting offset_tolerance is less than offset_min_tolerance (50 ms by default) then offset_min_tolerance is used instead.

Every reference note is matched against at most one estimated note.

Note there are separate functions match_note_onsets() and match_notes() for matching notes based on onsets only or based on onset, offset, and pitch, respectively. This is because the rules for matching note onsets and matching note offsets are different.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

offset_ratiofloat > 0

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or 0.05 (50 ms), whichever is greater.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined.

strictbool

If strict=False (the default), threshold checks for offset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

Returns:
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription.match_note_onsets(ref_intervals, est_intervals, onset_tolerance=0.05, strict=False)

Compute a maximum matching between reference and estimated notes, only taking note onsets into account.

Given two note sequences represented by ref_intervals and est_intervals (see mir_eval.io.load_valued_intervals()), we see the largest set of correspondences (i,j) such that the onset of reference note i is within onset_tolerance of the onset of estimated note j.

Every reference note is matched against at most one estimated note.

Note there are separate functions match_note_offsets() and match_notes() for matching notes based on offsets only or based on onset, offset, and pitch, respectively. This is because the rules for matching note onsets and matching note offsets are different.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

strictbool

If strict=False (the default), threshold checks for onset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

Returns:
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription.match_notes(ref_intervals, ref_pitches, est_intervals, est_pitches, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False)

Compute a maximum matching between reference and estimated notes, subject to onset, pitch and (optionally) offset constraints.

Given two note sequences represented by ref_intervals, ref_pitches, est_intervals and est_pitches (see mir_eval.io.load_valued_intervals()), we seek the largest set of correspondences (i, j) such that:

  1. The onset of reference note i is within onset_tolerance of the onset of estimated note j.

  2. The pitch of reference note i is within pitch_tolerance of the pitch of estimated note j.

  3. If offset_ratio is not None, the offset of reference note i has to be within offset_tolerance of the offset of estimated note j, where offset_tolerance is equal to offset_ratio times the reference note’s duration, i.e. offset_ratio * ref_duration[i] where ref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]. If the resulting offset_tolerance is less than 0.05 (50 ms), 0.05 is used instead.

  4. If offset_ratio is None, note offsets are ignored, and only criteria 1 and 2 are taken into consideration.

Every reference note is matched against at most one estimated note.

This is useful for computing precision/recall metrics for note transcription.

Note there are separate functions match_note_onsets() and match_note_offsets() for matching notes based on onsets only or based on offsets only, respectively.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

pitch_tolerancefloat > 0

The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or 0.05 (50 ms), whichever is greater. If offset_ratio is set to None, offsets are ignored in the matching.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if offset_ratio is not None.

strictbool

If strict=False (the default), threshold checks for onset, offset, and pitch matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

Returns:
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription.precision_recall_f1_overlap(ref_intervals, ref_pitches, est_intervals, est_pitches, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, beta=1.0)

Compute the Precision, Recall and F-measure of correct vs incorrectly transcribed notes, and the Average Overlap Ratio for correctly transcribed notes (see average_overlap_ratio()). “Correctness” is determined based on note onset, pitch and (optionally) offset: an estimated note is assumed correct if its onset is within +-50ms of a reference note and its pitch (F0) is within +- quarter tone (50 cents) of the corresponding reference note. If offset_ratio is None, note offsets are ignored in the comparison. Otherwise, on top of the above requirements, a correct returned note is required to have an offset value within 20% (by default, adjustable via the offset_ratio parameter) of the reference note’s duration around the reference note’s offset, or within offset_min_tolerance (50 ms by default), whichever is larger.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

pitch_tolerancefloat > 0

The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or offset_min_tolerance (0.05 by default, i.e. 50 ms), whichever is greater. If offset_ratio is set to None, offsets are ignored in the evaluation.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if offset_ratio is not None.

strictbool

If strict=False (the default), threshold checks for onset, offset, and pitch matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns:
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

avg_overlap_ratiofloat

The computed Average Overlap Ratio score

Examples

>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals(
...     'reference.txt')
>>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals(
...     'estimated.txt')
>>> (precision,
...  recall,
...  f_measure) = mir_eval.transcription.precision_recall_f1_overlap(
...      ref_intervals, ref_pitches, est_intervals, est_pitches)
>>> (precision_no_offset,
...  recall_no_offset,
...  f_measure_no_offset) = (
...      mir_eval.transcription.precision_recall_f1_overlap(
...          ref_intervals, ref_pitches, est_intervals, est_pitches,
...          offset_ratio=None))
mir_eval.transcription.average_overlap_ratio(ref_intervals, est_intervals, matching)

Compute the Average Overlap Ratio between a reference and estimated note transcription. Given a reference and corresponding estimated note, their overlap ratio (OR) is defined as the ratio between the duration of the time segment in which the two notes overlap and the time segment spanned by the two notes combined (earliest onset to latest offset):

>>> OR = ((min(ref_offset, est_offset) - max(ref_onset, est_onset)) /
...     (max(ref_offset, est_offset) - min(ref_onset, est_onset)))

The Average Overlap Ratio (AOR) is given by the mean OR computed over all matching reference and estimated notes. The metric goes from 0 (worst) to 1 (best).

Note: this function assumes the matching of reference and estimated notes (see match_notes()) has already been performed and is provided by the matching parameter. Furthermore, it is highly recommended to validate the intervals (see validate_intervals()) before calling this function, otherwise it is possible (though unlikely) for this function to attempt a divide-by-zero operation.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

Returns:
avg_overlap_ratiofloat

The computed Average Overlap Ratio score

mir_eval.transcription.onset_precision_recall_f1(ref_intervals, est_intervals, onset_tolerance=0.05, strict=False, beta=1.0)

Compute the Precision, Recall and F-measure of note onsets: an estimated onset is considered correct if it is within +-50ms of a reference onset. Note that this metric completely ignores note offset and note pitch. This means an estimated onset will be considered correct if it matches a reference onset, even if the onsets come from notes with completely different pitches (i.e. notes that would not match with match_notes()).

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

strictbool

If strict=False (the default), threshold checks for onset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns:
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

Examples

>>> ref_intervals, _ = mir_eval.io.load_valued_intervals(
...     'reference.txt')
>>> est_intervals, _ = mir_eval.io.load_valued_intervals(
...     'estimated.txt')
>>> (onset_precision,
...  onset_recall,
...  onset_f_measure) = mir_eval.transcription.onset_precision_recall_f1(
...      ref_intervals, est_intervals)
mir_eval.transcription.offset_precision_recall_f1(ref_intervals, est_intervals, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, beta=1.0)

Compute the Precision, Recall and F-measure of note offsets: an estimated offset is considered correct if it is within +-50ms (or 20% of the ref note duration, which ever is greater) of a reference offset. Note that this metric completely ignores note onsets and note pitch. This means an estimated offset will be considered correct if it matches a reference offset, even if the offsets come from notes with completely different pitches (i.e. notes that would not match with match_notes()).

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or offset_min_tolerance (0.05 by default, i.e. 50 ms), whichever is greater.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined.

strictbool

If strict=False (the default), threshold checks for onset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns:
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

Examples

>>> ref_intervals, _ = mir_eval.io.load_valued_intervals(
...     'reference.txt')
>>> est_intervals, _ = mir_eval.io.load_valued_intervals(
...     'estimated.txt')
>>> (offset_precision,
...  offset_recall,
...  offset_f_measure) = mir_eval.transcription.offset_precision_recall_f1(
...      ref_intervals, est_intervals)
mir_eval.transcription.evaluate(ref_intervals, ref_pitches, est_intervals, est_pitches, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters:
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

**kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns:
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals(
...    'reference.txt')
>>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals(
...    'estimate.txt')
>>> scores = mir_eval.transcription.evaluate(ref_intervals, ref_pitches,
...     est_intervals, est_pitches)