开源项目kcws,包含IDCNN+CRF和 BiLSTM+CRF两种中文分词加词性标注模型,两者精确度差不多,就速度而言,IDCNN更快一点。
kcws项目代码的运用
论文链接
BiLSTM+CRF参考论文:http://www.aclweb.org/anthology/N16-1030
IDCNN+CRF参考论文:https://arxiv.org/abs/1702.02098
kcws编译、训练、测试
语料准备:
kcws项目给的是2014年人民日报的语料库,需要的话可以到github下载:https://github.com/koth/kcws
为了增加语料库,本人使用的是1998年和2014年的人民日报语料库,可能1998年的预料标注不太准,效果没有kcws作者训练的demo好。
编译:
在这一步开始之前需要保证环境已经配置好,相关配置请参考我的上一篇文章,这里编译步骤主要参考原作者的README,大同小异,主要我把遇到的问题在此阐述一下:
构建训练语料工具
1
bazel build kcws/train:generate_training
这一步,出现错误:
1
2
3ERROR: /home/m/kcws/kcws/train/BUILD:1:1: Converting to Python 3: kcws/train/generate_training.py failed: 2to3 failed: error executing command bazel-out/host/bin/external/bazel_tools/tools/python/2to3 --no-diffs --nobackups --write --output-dir bazel-out/python3/kcws/train --write-unchanged-files ... (remaining 1 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
Target //kcws/train:generate_training failed to build
Use --verbose_failures to see the command lines of failed build steps.这是由于该项目只能是python2.7,如果python3环境会报此错误。
编译测试接口
1
bazel build kcws/cc:seg_backend_api
这一步本文并没有做,因为是个服务不好接入项目,就引用了另一位大神的思路,改写seg_backend_api.cc文件,重新在本地进行编译运行,具体可查看https://github.com/forever1dream/cplus-kcws
在进行本地编译接口出现的错误:
1
2
3
4
5
6
7
8
9
10
11./kcws/cc/seg_backend_api.cc:14:23: fatal error: base/base.h: No such file or directory
#include "base/base.h"
^
compilation terminated.
./kcws/cc/pos_tagger.cc:10:23: fatal error: base/base.h: No such file or directory
#include "base/base.h"
^
compilation terminated.
In file included from ./kcws/cc/sentence_breaker.cc:10:0:
./kcws/cc/sentence_breaker.h:16:37: fatal error: utils/basic_string_util.h: No such file or directory
#include "utils/basic_string_util.h"需要在build.sh 加一行 -I./ \ ,也就是添加一下路径。但是在按照大神的编译脚本编译完成后,运行./seg_backend_api时,又出现了了libtensorflow_cc.so文件不存在的错误,但明明路径写的对着啊,确实存在啊,怎么有报错呢,又查了查说动态链接库需要动态添加,所以需要:
1
-L ../tensorflow/bazel-bin/tensorflow -ltensorflow_cc -Wl,rpath=../tensorflow/bazel-bin/tensorflow\
注意逗号之后千万不要加空格,要不会报找不到rpath命令错误。
到此可以运行测试接口,在此声明训练过程本文没有报错。
测试样本时报错,简直是晴天霹雳:
1
2
3
4
5
6please input query:我们是中国人
2019-07-03 09:25:28.294904: E tensorflow/core/common_runtime/executor.cc:645] Executor failed to create kernel. Invalid argument: NodeDef mentions attr 'batch_dims' not in Op<name=GatherV2; signature=params:Tparams, indices:Tindices, axis:Taxis -> output:Tparams; attr=Tparams:type; attr=Tindices:type,allowed=[DT_INT32, DT_INT64]; attr=Taxis:type,allowed=[DT_INT32, DT_INT64]>; NodeDef: embedding_lookup_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@words"], _output_shapes=[[?,80,50]], batch_dims=0, _device="/job:localhost/replica:0/task:0/device:CPU:0"](words, _arg_input_placeholder_0_0, embedding_lookup_1/axis). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
[[Node: embedding_lookup_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@words"], _output_shapes=[[?,80,50]], batch_dims=0, _device="/job:localhost/replica:0/task:0/device:CPU:0"](words, _arg_input_placeholder_0_0, embedding_lookup_1/axis)]]
E0703 09:25:28.295219 46256 tfmodel.cc:88] Error during inference: Invalid argument: NodeDef mentions attr 'batch_dims' not in Op<name=GatherV2; signature=params:Tparams, indices:Tindices, axis:Taxis -> output:Tparams; attr=Tparams:type; attr=Tindices:type,allowed=[DT_INT32, DT_INT64]; attr=Taxis:type,allowed=[DT_INT32, DT_INT64]>; NodeDef: embedding_lookup_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@words"], _output_shapes=[[?,80,50]], batch_dims=0, _device="/job:localhost/replica:0/task:0/device:CPU:0"](words, _arg_input_placeholder_0_0, embedding_lookup_1/axis). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
[[Node: embedding_lookup_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@words"], _output_shapes=[[?,80,50]], batch_dims=0, _device="/job:localhost/replica:0/task:0/device:CPU:0"](words, _arg_input_placeholder_0_0, embedding_lookup_1/axis)]]
2019-07-03 09:25:28.295281: E ./kcws/cc/tf_seg_model.cc:320] Error during inference:在此很感谢大神细心及时地回复我的问题,虽然没解决我的问题但是还是非常感谢大神的,大神很nice,再次给出大神github链接:https://github.com/forever1dream/cplus-kcws
言归正传,这个问题就是tensorflow版本导致的,因为我训练用的tensorflow-gpu-1.14.0,测试用的tensorflow是没有gpu的1.6版本,最后又换版本解决了此问题,所以说测试版本一定要大于等于训练版本,否则会报错。
至此,kcws使用成功,测了一下速度,cpu分词加标注平均大概60ms,有点慢,哎。。。最终没有用到项目里面,但是经过此次学习,学习很多,bug使人成长啊!
最后附上另一位大神封装的python版本的kcws,大家可以学习一下:https://github.com/AlleyEli/kcws