custom user defined <star> tokens #25

RitariME · 2024-10-02T00:08:57Z

RitariME
Oct 2, 2024

Hey, I got better accuracy of word level timestamps for my purposes by adding <star> tokens per line (\n). The implementation of mine is kinda incoherent so I didn't mind to PR. A better implementation would just be replacing \n with <star> and somehow escaping the <star> so that romanization doesn't strip em.

in text_utils.py:

def split_text(text: str, split_size: str = "word", star_frequency: str = "segment"):

    if split_size == "sentence":
        from nltk.tokenize import PunktSentenceTokenizer

        sentence_checker = PunktSentenceTokenizer()
        sentences = sentence_checker.sentences_from_text(text)
        return sentences

    elif split_size == "word":
        if star_frequency == "lines":
            text = re.sub(r'\n\s*\n', '\n', text.strip())
            return [item for item in re.split(r'(\s+|\n)', text) if item and item != ' '] 
        return text.split()
    elif split_size == "char":
        return list(text)


def preprocess_text(
    text, romanize, language, split_size="word", star_frequency="segment"
):
    assert split_size in [
        "sentence",
        "word",
        "char",
    ], "Split size must be sentence, word, or char"
    assert star_frequency in [
        "segment",
        "edges",
        "lines",
    ], "Star frequency must be segment or edges"
    if language in ["jpn", "chi"]:
        split_size = "char"
    text_split = split_text(text, split_size, star_frequency)
    print(text_split)
    newline_indices = [i for i, item in enumerate(text_split) if '\n' in item]
    norm_text = [text_normalize(line.strip(), language) for line in text_split]

    if romanize:
        tokens = get_uroman_tokens(norm_text, language)
    else:
        tokens = [" ".join(list(word)) for word in norm_text]

    # add <star> token to the tokens and text
    # it's used extensively here but I found that it produces more accurate results
    # and doesn't affect the runtime
    if star_frequency == "segment":

        tokens_starred = []
        [tokens_starred.extend(["<star>", token]) for token in tokens]

        text_starred = []
        [text_starred.extend(["<star>", chunk]) for chunk in text_split]

    elif star_frequency == "edges":
        tokens_starred = ["<star>"] + tokens + ["<star>"]
        text_starred = ["<star>"] + text_split + ["<star>"]
    elif star_frequency == "lines":
        tokens_starred = tokens
        text_starred = [item.replace('\n', '<star>') for item in text_split]
        for index in sorted(newline_indices, reverse=True):
            tokens_starred[index] = "<star>"
        tokens_starred = ["<star>"] + tokens_starred + ["<star>"]
        text_starred = ["<star>"] + text_starred + ["<star>"]


    print(tokens_starred)
    print(text_starred)

    return tokens_starred, text_starred

MahmoudAshraf97 · 2024-10-02T06:34:59Z

MahmoudAshraf97
Oct 2, 2024
Maintainer

Hello, did you compare this to segment star frequency along with word split size?

0 replies

RitariME · 2024-10-02T14:40:54Z

RitariME
Oct 2, 2024
Author

Yeah, with segmentand word the timestamps are sometimes too late/early (that is atleast in Finnish songs). However, only like fraction of a second but I need them to be precise as possible (I am doing an karaoke app). They're probably more precise in a more popular and trained language like English.

I have some examples but being in Finnish it probably doesn't help. I have to watch in slow motion to see the difference. I will soon experiment with English language.

Instruments are suppressed.

word, lines:

lines.mp4

word, segment:

segment.mp4

0 replies

MahmoudAshraf97 · 2024-10-04T06:41:54Z

MahmoudAshraf97
Oct 4, 2024
Maintainer

That's interesting, the problem with generalizing this is that not all text comes with newline separators and even if it exists, it might not be in a semantically meaningful position, an alternative is to use both word and sentence splitting, and inserting <star> between sentences, that calls for better sentence splitting though.
Btw, do you run the audio on the audio before instrument suppression or after it?

0 replies

RitariME · 2024-10-04T07:34:10Z

RitariME
Oct 4, 2024
Author

Hmm... In my use case the verses are always seperated by a newline and it makes the most sense to just follow the lyrics provider. I think that a good solution would be that the user has an option to insert the <star> tokens into the text (where ever they wants, for example every \n character) which gets inputted to the aligner. That would also support other possible use cases and make the aligner more versatile. Am I right that currently adding <star> tokens to the text inputted into the API would just get stripped or just interpreted as normal text?

What do you mean " Btw, do you run the audio on the audio before instrument suppression or after it?" ?
I supress the insruments with demucs and then run just the vocals with the aligner.

0 replies

MahmoudAshraf97 · 2024-10-05T09:47:51Z

MahmoudAshraf97
Oct 5, 2024
Maintainer

It's already doable, you can break down preprocess_text functions to its components and add, remove, or customize any step you want, if you inserted your custom star tokens after romanization/normalization, it will not be removed

1 reply

RitariME Oct 5, 2024
Author

Yeah that's what I'm already doing, but what I'd want is that <star> tokens added by a user into the text which is inputted to preprocess_text would never get stripped in the first place. That would be more intuitive, wouldn't it?

I'll soon experiment with this and PR if you're ok with it.

RitariME · 2024-10-10T21:38:44Z

RitariME
Oct 10, 2024
Author

By applying the commit one can set <star> frequency to their liking if star_frequency is set to custom. That is, no predefined <star> tokens in the input text will get stripped off in the alignment process.

For example I wanted to have them every newline \n.

from ctc_forced_aligner import (
    load_audio,
    load_alignment_model,
    generate_emissions,
    preprocess_text,
    get_alignments,
    get_spans,
    postprocess_results,
)
import re

audio_path = "test.wav"
text_path = "test.txt"
language = "fi"
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16


alignment_model, alignment_tokenizer, = load_alignment_model(
    device,
    dtype=torch.float32 if device == "cuda" else torch.float32,
)

audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)

text = ""
with open(text_path, "r") as f:
    text = f.read()

text = re.sub(r"\n+", "<star>", text)
#Replace every, or every consecutive newline characters to <star>
print(text)


emissions, stride = generate_emissions(
    alignment_model, audio_waveform, batch_size=batch_size,
)

tokens_starred, text_starred = preprocess_text(
    text,
    romanize=True,
    language=language,
    star_frequency="custom", #This must be set to "custom"
)

segments, scores, blank_id = get_alignments(
    emissions,
    tokens_starred,
    alignment_tokenizer,
)

spans = get_spans(tokens_starred, segments, blank_id)

word_timestamps = postprocess_results(text_starred, spans, stride, scores)
print(word_timestamps)

0 replies

custom user defined <star> tokens #25

Uh oh!

Uh oh!

RitariME Oct 2, 2024

Replies: 6 comments · 1 reply

Uh oh!

MahmoudAshraf97 Oct 2, 2024 Maintainer

Uh oh!

RitariME Oct 2, 2024 Author

Uh oh!

MahmoudAshraf97 Oct 4, 2024 Maintainer

Uh oh!

Uh oh!

RitariME Oct 4, 2024 Author

Uh oh!

MahmoudAshraf97 Oct 5, 2024 Maintainer

Uh oh!

RitariME Oct 5, 2024 Author

Uh oh!

Uh oh!

RitariME Oct 10, 2024 Author

RitariME
Oct 2, 2024

Replies: 6 comments 1 reply

MahmoudAshraf97
Oct 2, 2024
Maintainer

RitariME
Oct 2, 2024
Author

MahmoudAshraf97
Oct 4, 2024
Maintainer

RitariME
Oct 4, 2024
Author

MahmoudAshraf97
Oct 5, 2024
Maintainer

RitariME Oct 5, 2024
Author

RitariME
Oct 10, 2024
Author