NLP——LLM-sentencepiece包的使用

本文主要介绍LLM的Tokenization,sentencepiece的使用


打印词表编码

  • 代码

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    import sentencepiece as spm

    def export_vocab_to_file(model_path, output_file):
    # 加载 SentencePiece 模型
    sp = spm.SentencePieceProcessor()
    sp.load(model_path)

    # 打开文件准备写入
    with open(output_file, 'w', encoding='utf-8') as f:
    # 遍历词汇表中的每一个词汇和其索引
    for piece_id in range(sp.get_piece_size()):
    piece = sp.id_to_piece(piece_id)
    score = sp.get_score(piece_id)
    # 将词汇和其索引(或分数)写入文件
    f.write(f'{piece_id}\t{piece}\t{score}\n')

    export_vocab_to_file('chinese_sp.model', 'vocab.txt')
  • vocab.txt文件格式

    1
    2
    3
    4
    5
    6
    7
    8
    9
    0	<unk>	0.0
    1 <s> 0.0
    2 </s> 0.0
    3 , -2.814912796020508
    4-3.7144806385040283
    5-3.715141534805298
    6-3.7526743412017822
    7-4.614748001098633
    ...

已有词表扩展

  • 代码如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    import sentencepiece as spm
    import os
    os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
    from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model

    chinese_sp_model_path = './chinese_sp.model'
    new_chinese_sp_model_path = './new_chinese_sp.model'
    chinese_sp_model = spm.SentencePieceProcessor()
    chinese_sp_model.Load(chinese_sp_model_path)

    print(len(chinese_sp_model))

    chinese_sp_model_mp = sp_pb2_model.ModelProto()
    chinese_sp_model_mp.ParseFromString(chinese_sp_model.serialized_model_proto())
    tokens_set=set(p.piece for p in chinese_sp_model_mp.pieces)

    ## 将特殊字符添加到词表中
    new_pieces = ['#_#_#', '$$%%##']
    for piece in new_pieces:
    if piece not in tokens_set:
    new_p = sp_pb2_model.ModelProto().SentencePiece()
    new_p.piece = piece
    # score越大,匹配的优先级越高,score为无穷小时约等于没有添加
    new_p.score = 0
    # 使用append添加词,token编码ID会累计+1
    chinese_sp_model_mp.pieces.append(new_p)
    # print(new_p)

    ## Save
    with open(new_chinese_sp_model_path, 'wb') as f:
    f.write(chinese_sp_model_mp.SerializeToString())

    ## load
    new_chinese_sp_model = spm.SentencePieceProcessor()
    new_chinese_sp_model.Load(new_chinese_sp_model_path)
    text='''#_#_#,$$%%##'''
    print("Test text:\n",text)
    print(f"Tokenized by Chinese-LLaMA tokenizer:{new_chinese_sp_model.tokenize(text)}")

sentencepiece 包使用总结

  • sentencepiece包中对分词模型的存储只需要一个文件xx.model
  • 分词模型存储文件中包含了词表信息,包括了piece,score等,同时按照顺序存储,也暗含了词表的编码信息
    • 排序按照score从大到小,score一般小于等于0
  • 想要在词表中增加编码,可以直接构建SentencePiece类对象,并添加到词表反序列化后的尾部即可
    • SentencePiece类对象包含两个关键信息piece,score