本文主要介绍LLM的Tokenization,sentencepiece的使用
- 源码:google/sentencepiece
- 本文测试版本号
sentencepiece=0.2.0
打印词表编码
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import sentencepiece as spm
def export_vocab_to_file(model_path, output_file):
# 加载 SentencePiece 模型
sp = spm.SentencePieceProcessor()
sp.load(model_path)
# 打开文件准备写入
with open(output_file, 'w', encoding='utf-8') as f:
# 遍历词汇表中的每一个词汇和其索引
for piece_id in range(sp.get_piece_size()):
piece = sp.id_to_piece(piece_id)
score = sp.get_score(piece_id)
# 将词汇和其索引(或分数)写入文件
f.write(f'{piece_id}\t{piece}\t{score}\n')
export_vocab_to_file('chinese_sp.model', 'vocab.txt')vocab.txt文件格式1
2
3
4
5
6
7
8
90 <unk> 0.0
1 <s> 0.0
2 </s> 0.0
3 , -2.814912796020508
4 ▁ -3.7144806385040283
5 。 -3.715141534805298
6 的 -3.7526743412017822
7 、 -4.614748001098633
...
已有词表扩展
- 代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38import sentencepiece as spm
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
chinese_sp_model_path = './chinese_sp.model'
new_chinese_sp_model_path = './new_chinese_sp.model'
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_path)
print(len(chinese_sp_model))
chinese_sp_model_mp = sp_pb2_model.ModelProto()
chinese_sp_model_mp.ParseFromString(chinese_sp_model.serialized_model_proto())
tokens_set=set(p.piece for p in chinese_sp_model_mp.pieces)
## 将特殊字符添加到词表中
new_pieces = ['#_#_#', '$$%%##']
for piece in new_pieces:
if piece not in tokens_set:
new_p = sp_pb2_model.ModelProto().SentencePiece()
new_p.piece = piece
# score越大,匹配的优先级越高,score为无穷小时约等于没有添加
new_p.score = 0
# 使用append添加词,token编码ID会累计+1
chinese_sp_model_mp.pieces.append(new_p)
# print(new_p)
## Save
with open(new_chinese_sp_model_path, 'wb') as f:
f.write(chinese_sp_model_mp.SerializeToString())
## load
new_chinese_sp_model = spm.SentencePieceProcessor()
new_chinese_sp_model.Load(new_chinese_sp_model_path)
text='''#_#_#,$$%%##'''
print("Test text:\n",text)
print(f"Tokenized by Chinese-LLaMA tokenizer:{new_chinese_sp_model.tokenize(text)}")
sentencepiece 包使用总结
sentencepiece包中对分词模型的存储只需要一个文件xx.model- 分词模型存储文件中包含了词表信息,包括了
piece,score等,同时按照顺序存储,也暗含了词表的编码信息- 排序按照
score从大到小,score一般小于等于0
- 排序按照
- 想要在词表中增加编码,可以直接构建
SentencePiece类对象,并添加到词表反序列化后的尾部即可SentencePiece类对象包含两个关键信息piece,score