复制C:\Users\tellw\AppData\Local\Programs\Python\Python310\Lib\site-packagesjieba目录到custom-jieba目录

目录custom-jieba

main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import jieba
import sys
import os
import time

limit=10

with open('done.txt',encoding='utf8') as f:
ls=f.readlines()

info=[l.split() for l in ls]
ti=None
for i in range(len(info)):
if len(info[i])<4 or os.path.getmtime(info[i][0])>float(info[i][2]) or info[i][3]!='1':
ti=info[i]
break
if ti is None:
print('不用做任何事情')
sys.exit(0)

if len(ti)<2:
start=0
else:
start=int(ti[1])
jieba.load_userdict('user_dict.txt')
with open(ti[0],encoding='utf8') as f:
sentences=f.readlines()[start:start+limit]
for sentence in sentences:
sent_word=list(jieba.cut(sentence))
print(sent_word)
if len(ti)<4:
info[i].append(start+len(sentences))
info[i].append(time.time())
info[i].append(0)
else:
info[i][2]=time.time()
info[i][1]=start+len(sentences)
if len(sentences)==0:
info[i][3]=1
else:
info[i][3]=0

with open('done.txt','w',encoding='utf8') as f:
for j in range(len(info)):
if j==i:
f.write(f'{info[i][0]} {info[i][1]} {info[i][2]} {info[i][3]}\n')
else:
f.write(' '.join(info[j]))

corpus目录存放语料

done.txt存放语料信息,预料文件名,读到第10行,上次处理完语料的时间(语料内容被更新后需要对新内容进行分词),是否已完成整个语料的分词。

1
corpus/转生王女与天才千金.txt 10 1678893879.812169 0

user_dict.txt,学官方分词配置文件jieba/dict.txt的结构,每一行,分词 频度 词性,词性见上篇博客,或对jieba/dict.txt照猫画虎。我的经验是,比如官方分词配置文件里没有“突然间”这个词组,但有“突然”词组,频度14998,词性ad,则在user_dict.txt添加一行突然间 14999 ad

项目见tellw / custom-jieba

创建于2023.3.16/0.11,修改于2023.3.16/0.11