You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ColossalAI/applications/ColossalQA/tests/test_text_splitter.py

12 lines
1.1 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

from colossalqa.text_splitter.chinese_text_splitter import ChineseTextSplitter
def test_text_splitter():
# unit test
spliter = ChineseTextSplitter(chunk_size=30, chunk_overlap=0)
out = spliter.split_text(
"移动端语音唤醒模型检测关键词为“小云小云”。模型主体为4层FSMN结构使用CTC训练准则参数量750K适用于移动端设备运行。模型输入为Fbank特征输出为基于char建模的中文全集token预测测试工具根据每一帧的预测数据进行后处理得到输入音频的实时检测结果。模型训练采用“basetrain + finetune”的模式basetrain过程使用大量内部移动端数据在此基础上使用1万条设备端录制安静场景“小云小云”数据进行微调得到最终面向业务的模型。后续用户可在basetrain模型基础上使用其他关键词数据进行微调得到新的语音唤醒模型但暂时未开放模型finetune功能。"
)
print(len(out))
assert len(out) == 4 # ChineseTextSplitter will not break sentence. Hence the actual chunk size is not 30