chnsenticorp数据集及其处理

数据集下载：链接：https://pan.baidu.com/s/1PGCIz-yub3ugXYuNivlZzw提取码：nuwl提取出来四个数据集，其中chnsenticorp是主要数据处理：chnsenticorp分为四类：ChnSentiCorp_htl_ba_2000:2000条旅店住宿review，label均衡ChnSentiCorp_htl_ba_4000:4000条旅店住宿revie

Reza.

10094人浏览 · 2020-11-23 17:04:26

Reza. · 2020-11-23 17:04:26 发布

数据集下载：

链接：https://pan.baidu.com/s/1PGCIz-yub3ugXYuNivlZzw
提取码：nuwl

提取出来四个数据集，其中chnsenticorp是主要数据
在这里插入图片描述

处理：

chnsenticorp分为四类：

ChnSentiCorp_htl_ba_2000:2000条旅店住宿review，label均衡
ChnSentiCorp_htl_ba_4000:4000条旅店住宿review，label均衡
ChnSentiCorp_htl_ba_6000:6000条旅店住宿review，label均衡
ChnSentiCorp_htl_unba_10000(其实应该只有7000左右，解压的时候会把报错): 7000条，只有pos

这里以6000的为例，有pos和neg两个文件夹，每个文件夹下各3000 .txt文档，每个文档是一条对应情感的review：

在这里插入图片描述

准备将其处理成两个.txt文档，方便后续使用：

import os
import codecs

folder=["./neg","./pos"]
record=dict()

for fold in folder:
    record[fold]=0
    out_file = fold + "_6000.txt"
    out = codecs.open(out_file,"w",errors="ignore",encoding="gbk")
    for _,_,filenames in os.walk(fold):
        for filename in filenames:
            file=codecs.open(os.path.join(fold, filename).replace("\\",'/'), "r",errors="ignore",encoding="gbk")
            context = file.read()
            file.close()
            context=context.replace('\n', '').replace('\r', '')+"\n"
            out.writelines(context)
            record[fold]+=1
    out.close()

print("record:",record)

再将其处理成json格式，为每一条sentence再给一个id号（因为本人需要后续使用Transformer,读者可以不用）：

import json
import random

def shuffle2list(a: list, b: list):
    # shuffle two list with same rule, you can also use sklearn.utils.shuffle package
    c = list(zip(a, b))
    random.shuffle(c)
    a[:], b[:] = zip(*c)
    return a, b

sen_lis=[]
label_lis=[]
# pos:1;
# neg:0;
res=[]
with open("./pos_6000.txt","r",errors="ignore",encoding="gbk") as pos,open(
    "./neg_6000.txt","r",errors="ignore",encoding="gbk") as neg:
    lines=pos.readlines()
    for line in lines:
        sen_lis.append(line.strip("\n"))
        label_lis.append(1)

    lines=neg.readlines()
    for line in lines:
        sen_lis.append(line.strip("\n"))
        label_lis.append(0)

    sen_lis,label_lis=shuffle2list(sen_lis,label_lis)

    for i in range(len(sen_lis)):
        item=dict()
        item["guid"]=i
        item["text_a"]=sen_lis[i]
        item["label"]=label_lis[i]

        res.append(item)

    print("all of %d instances"%(i+1))

with open("./ChnSenticrop.json","w") as jfile:
    json.dump(res,jfile,ensure_ascii=False)