BUAA 数据结构总结——大作业(文本摘要生成)
大作业综合作业1. 文本摘要生成(综合)综合作业1. 文本摘要生成(综合)【问题描述】在自然语言文本处理中,有一种分析文本、自动抽取文本主题思想的方法(通常用于文本摘要生成),其方法如下:首先分析文本中非停用词(stop-word)的出现频度;统计文本中每个句子中非停用词频度之和。若某个非停用词在一个句子中出现多次,则都要计算;按非停用词频度之和由高至低输出前N个句子。注:单词为仅由字母组成的字符
文章目录
综合作业
1. 文本摘要生成(综合)
【问题描述】
在自然语言文本处理中,有一种分析文本、自动抽取文本主题思想的方法(通常用于文本摘要生成),其方法如下:
-
首先分析文本中非停用词(stop-word)的出现频度;
-
统计文本中每个句子中非停用词频度之和。若某个非停用词在一个句子中出现多次,则都要计算;
-
按非停用词频度之和由高至低输出前N个句子。
注:
-
单词为仅由字母组成的字符序列。包含大写字母的单词应将大写字母转换为小写字母后进行词频统计。
-
句子是由下面符号分隔的段落:句号(.)、问号(?)和惊叹号(!)。
-
在自然语言处理中,停用词(stop-word)指的是文本分析时不会提供额外语义信息的词的列表,如英文单词a,an,he,you等就是停用词。
【输入形式】
根据当前目录下停用词文件“stopwords.txt”,打开当前目录下文件“article.txt”,并从标准输入读入需要生成至文件的句子数N。按上面要求进行文本分析,抽取相关文本主题思想。
【输出形式】
在标准输出上按频度之和由高至低输出前5个句子的频度之和与句子。输出时先输出句子的频度和,然后空一个空格再输出整个句子,每个句子最后有一个回车。同时按频度之和由高至低输出前N个句子的频度之和与句子输出到文件“results.txt”中,输出要求同标准输出。输出时,若两个句子频度和相同,则按原文本中出现次序输出。
说明:输出句子时,从第一个非空字符开始,句子中每个组成部分形态(包括单词中字母大小写、间隔)应与原文本中一致
【样例输入】
200
(其它可从课程下载区下载样例文件“article.txt”和停用词表文件“stopwords.txt”到本地作为程序输入文件。)
【样例输出】
程序运行后,屏幕上输出结果为:
50286 James’s eyes were hazel, his nose was slightly longer than Harry’s and there was no scar on his forehead, but they had the same thin face, same mouth, same eyebrows; James’s hair stuck up at the back exactly as Harry’s did, his hands could have been Harry’s and Harry could tell that, when James stood up, they would be within an inch of each other in height.
48188 I didn’t practise, I didn’t bother, I could’ve stopped myself having those dreams, Hermione kept telling me to do it, if I had he’d never have been able to show me where to go, and-Sirius wouldn’t-Sirius wouldn’t-'Something was erupting inside Harry’s head: a need to justify himself, to explain-I tried to check he’d really taken Sirius, I went to Umbridge’s office, I spoke to Kreacher in the fire and he said Sirius wasn’t there, he said he’d gone!
39986 Little Ginny’s been writing in it for months and months, telling me all her pitiful worries and woes - how her brothers tease her, how she had to come to school with secondhand robes and books, how - Riddle’s eyes glinted - how she didn’t think famous, good, great Harry Potter would ever like herAll the time he spoke, Riddle’s eyes never left Harry’s face.
39455 I mean, it was really great of you and everything,’ said Hermione quickly, looking positively petrified at the look on Harry’s face, everyone thought it was a wonderful thing to do-‘That’s funny,’ said Harry through gritted teeth, because I definitely remember Ron saying I’d wasted time acting the hero .
39438 Harry, Ginny and Neville and each of the Death Eaters turned in spite of themselves to watch the top of the tank as a brain burst from the green liquid like a leaping fish: for a moment it seemed suspended in midair, then it soared towards Ron, spinning as it came, and what looked like ribbons of moving images flew from it, unravelling like rolls of film-Ha ha ha, Harry, look at it-’ said Ron, watching it disgorge its gaudy innards, Harry, come and touch it; bet it’s weird-'RON, NO!
所生成的结果文件“results.txt”内容应与下载区样例文件“results(example).txt”完全相同。
【样例说明】
程序运行时,对给定的文本文件article.txt按要求进行文本分析,抽取相关文本主题思想。
在本样例中,按输出格式要求,在屏幕上输出了频率之和由高至低的前5个句子(即results.txt文件中的前5个句子);results.txt则包含了200个按频度之和由高到低排列的所有句子的频度之和与句子文本。
频率最高的句子中包含的非停用词及其出现频度如下:(james:78)(s:4750)(eyes:480)(hazel:1)(nose:115)(slightly:164)(longer:59)(harry:5732)(s:4750)(scar:90)(forehead:42)(face:521)(mouth:182)(eyebrows:45)(james:78)(s:4750)(hair:166)(stuck:35)(exactly:101)(harry:5732)(s:4750)(did:667)(hands:175)(harry:5732)(s:4750)(harry:5732)(tell:332)(james:78)(stood:156)(inch:28)(height:15),这些非停用词的出现频度之和为50286。
频率次高的句子中包含的非停用词及其出现频度如下:(didn:400)(t:2308)(practise:20)(didn:400)(t:2308)(bother:16)(ve:693)(stopped:120)(having:103)(dreams:33)(hermione:1615)(kept:113)(telling:93)(d:621)(able:126)(sirius:647)(wouldn:113)(t:2308)(sirius:647)(wouldn:113)(t:2308)(erupting:2)(inside:198)(harry:5732)(s:4750)(head:490)(need:183)(justify:2)(explain:38)(tried:128)(check:35)(d:621)(really:288)(taken:79)(sirius:647)(went:198)(umbridge:571)(s:4750)(office:170)(spoke:65)(kreacher:118)(said:5086)(sirius:647)(wasn:141)(t:2308)(said:5086)(d:621)(gone:129),这些非停用词的出现频度之和为48188。
算法
二维数组形式的 字典树 + 快排
性能:

(后续可以考虑桶排来进一步提高速率)
你是否debug抵到头秃?推荐俩个在线txt文本比较网站:
https://www.jq22.com/textDifference 和
https://www.qianbo.com.cn/Tool/Text-Difference/
「代码」
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define MAXWORD 25
#define MAXSENTENCE 215000
#define MAXSENTENCEWORD 1500
int getword(FILE *bfp,char word[]);//读入一个单词
void insert_stopwords(char *str);
int search_stopwords(char *str);
void insert_articlewords(char *str);
int search_articlewords(char *str);
struct Sentence{
char sentence[MAXSENTENCEWORD];
int num;
}sentence[MAXSENTENCE];
typedef struct Sentence st;
int getsentence(FILE *bfp,struct Sentence *item);
int cmp(const void* a, const void* b)
{
st* pa = (st*)a;
st* pb = (st*)b;
return (int)pb->num - pa->num; // 从大到小
}
int main(int argc, const char * argv[]) {
/*
首先分析文本中非停用词(stop-word)的出现频度;
统计文本中每个句子中非停用词频度之和。若某个非停用词在一个句子中出现多次,则都要计算;
按非停用词频度之和由高至低输出前N个句子。
注:
大写字母转换为小写字母后进行词频统计
句子是由 句号(.) 问号(?) 惊叹号(!) 分隔
*/
//根据当前目录下停用词文件“stopwords.txt”,打开当前目录下文件“article.txt”,并从标准输入读入需要生成至文件的句子数N。
FILE *file_stopwords,*file_article;
if((file_stopwords = fopen("stopwords.txt", "r")) == NULL){ //打开停用词文件
fprintf(stderr, "stopwords.txt can’t open!\n");
return -1;
}
if((file_article = fopen("article.txt", "r")) == NULL){ //打开文章文件
fprintf(stderr, "article.txt can’t open!\n");
return -1;
}
int N;
scanf("%d",&N); //需要生成至文件的句子数N
char word[MAXWORD];
while( fscanf(file_stopwords,"%s",word) != EOF) //从文件中读入停用词
insert_stopwords(word);
while( getword(file_article,word) != EOF) //从文件中读入一个单词
if (!search_stopwords(word))
insert_articlewords(word);
fclose(file_stopwords);
if((file_article = fopen("article.txt", "r")) == NULL){ //再次打开文章文件
fprintf(stderr, "article.txt can’t open!\n");
return -1;
}
int i=0;
while( getsentence(file_article,&sentence[i]) != EOF) i++;
/* 对结构体进行快排,⚠️请使用<stdlib.h>头文件自带的qsort(),
否则会导致排序不稳定(即两个句子频度和相同,不按原文本中出现次序输出)。
进一步了解该函数可参考“CSDN“文章https://blog.csdn.net/qq_16933601/article/details/107214404 或 https://blog.csdn.net/qq_38789531/article/details/94358602 */
qsort(sentence,i,sizeof(sentence[0]),cmp);
//在标准输出上按频度之和由高至低输出前5个句子的频度之和与句子。输出时先输出句子的频度和,然后空一个空格再输出整个句子,每个句子最后有一个回车。同时按频度之和由高至低输出前N个句子的频度之和与句子输出到文件“results.txt”中,输出要求同标准输出。输出时,若两个句子频度和相同,则按原文本中出现次序输出。
FILE *file_results;
file_results = fopen("results.txt", "w");
for(int i=0;i<5;i++)
printf("%d %s\n",sentence[i].num,sentence[i].sentence);
for(int i=0;i<N;i++)
fprintf(file_results,"%d %s\n",sentence[i].num,sentence[i].sentence);
return 0;
}
//Trie字典树(停用词)
int trie_stopwords[10010][26]={0};
int num_stopwords[10010]={0};
int pos_stopwords=0;
void insert_stopwords(char *str)
{
int p=0;
for(int i=0; str[i]; i++)
{
int n=str[i]-'a';
if(trie_stopwords[p][n]==0)
trie_stopwords[p][n]=++pos_stopwords;
p=trie_stopwords[p][n];
}
num_stopwords[p]=1; //在p结尾的单词数量+1 单词唯一
}
int search_stopwords(char *str)
{
int p=0;
for(int i=0;str[i];i++)
{
int n=str[i]-'a';
if(trie_stopwords[p][n]==0) return 0;//不存在
p=trie_stopwords[p][n];
}
return num_stopwords[p]==1;
}
int getword(FILE *bfp,char word[])
{
int i=0;
char temp;
while((temp=fgetc(bfp))!=EOF){
if(isalpha(temp)){
word[i]=tolower(temp);
i++;
}
else if(i>0){ //说明i中已经至少有一个字符
word[i]='\0';
return 0;
}
}
return EOF;
}
//Trie字典树(文章词)
int trie_articlewords[1000010][26]={0};
int num_articlewords[1000010]={0};
int pos_articlewords=0;
void insert_articlewords(char *str)
{
int p=0;
for(int i=0; str[i]; i++)
{
int n=str[i]-'a';
if(trie_articlewords[p][n]==0)
trie_articlewords[p][n]=++pos_articlewords;
p=trie_articlewords[p][n];
}
num_articlewords[p]++; //在p结尾的单词数量+1 单词唯一
}
int search_articlewords(char *str)
{
int p=0;
for(int i=0;str[i];i++)
{
int n=str[i]-'a';
// if(trie_articlewords[p][n]==0) return 0;//不存在
p=trie_articlewords[p][n];
}
return num_articlewords[p];
}
int getsentence(FILE *bfp,struct Sentence *item)
{
int i=0,j=0;
char temp,word[MAXWORD];
while((temp=fgetc(bfp))!=EOF){
if (i==0&&temp==' ') continue;//注意:处理句首空格
item->sentence[i++]=temp;
if(isalpha(temp)){
word[j]=tolower(temp);
j++;
}
else if(j>0){ //说明i中已经至少有一个字符
word[j]='\0';
if (!search_stopwords(word))
item->num=item->num+search_articlewords(word);
j=0;
memset(word, 0, sizeof(word));
}
if(temp=='.'||temp=='?'||temp=='!'){ //句子结束
item->sentence[i+1]='\0';
return 0;
}
}
return EOF;
}
DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。
更多推荐


所有评论(0)