chnsenticorp数据集及其处理
数据集下载:链接:https://pan.baidu.com/s/1PGCIz-yub3ugXYuNivlZzw提取码:nuwl提取出来四个数据集,其中chnsenticorp是主要数据处理:chnsenticorp分为四类:ChnSentiCorp_htl_ba_2000:2000条旅店住宿review,label均衡ChnSentiCorp_htl_ba_4000:4000条旅店住宿revie
·
数据集下载:
链接:https://pan.baidu.com/s/1PGCIz-yub3ugXYuNivlZzw
提取码:nuwl
提取出来四个数据集,其中chnsenticorp是主要数据
处理:
chnsenticorp分为四类:
- ChnSentiCorp_htl_ba_2000:2000条旅店住宿review,label均衡
- ChnSentiCorp_htl_ba_4000:4000条旅店住宿review,label均衡
- ChnSentiCorp_htl_ba_6000:6000条旅店住宿review,label均衡
- ChnSentiCorp_htl_unba_10000(其实应该只有7000左右,解压的时候会把报错): 7000条,只有pos
这里以6000的为例,有pos和neg两个文件夹,每个文件夹下各3000 .txt文档,每个文档是一条对应情感的review:
准备将其处理成两个.txt文档,方便后续使用:
import os
import codecs
folder=["./neg","./pos"]
record=dict()
for fold in folder:
record[fold]=0
out_file = fold + "_6000.txt"
out = codecs.open(out_file,"w",errors="ignore",encoding="gbk")
for _,_,filenames in os.walk(fold):
for filename in filenames:
file=codecs.open(os.path.join(fold, filename).replace("\\",'/'), "r",errors="ignore",encoding="gbk")
context = file.read()
file.close()
context=context.replace('\n', '').replace('\r', '')+"\n"
out.writelines(context)
record[fold]+=1
out.close()
print("record:",record)
再将其处理成json格式,为每一条sentence再给一个id号(因为本人需要后续使用Transformer,读者可以不用):
import json
import random
def shuffle2list(a: list, b: list):
# shuffle two list with same rule, you can also use sklearn.utils.shuffle package
c = list(zip(a, b))
random.shuffle(c)
a[:], b[:] = zip(*c)
return a, b
sen_lis=[]
label_lis=[]
# pos:1;
# neg:0;
res=[]
with open("./pos_6000.txt","r",errors="ignore",encoding="gbk") as pos,open(
"./neg_6000.txt","r",errors="ignore",encoding="gbk") as neg:
lines=pos.readlines()
for line in lines:
sen_lis.append(line.strip("\n"))
label_lis.append(1)
lines=neg.readlines()
for line in lines:
sen_lis.append(line.strip("\n"))
label_lis.append(0)
sen_lis,label_lis=shuffle2list(sen_lis,label_lis)
for i in range(len(sen_lis)):
item=dict()
item["guid"]=i
item["text_a"]=sen_lis[i]
item["label"]=label_lis[i]
res.append(item)
print("all of %d instances"%(i+1))
with open("./ChnSenticrop.json","w") as jfile:
json.dump(res,jfile,ensure_ascii=False)
洗好之后:

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。
更多推荐
所有评论(0)