头歌：Scrapy爬虫之热门网站数据爬取

【代码】头歌：Scrapy爬虫之热门网站数据爬取。

hngcxy2022

1393人浏览 · 2025-05-13 09:12:06

hngcxy2022 · 2025-05-13 09:12:06 发布

头歌：Scrapy爬虫之热门网站数据爬取

第1关：猫眼电影排行TOP100信息爬取

编程要求
- 在items.py中定义需要存储的数据，包括电影名字name，主演starts，上映时间releasetime，评分score。
- 在主爬虫movies.py中，使用xpath匹配到首页的10部电影，通过for循环遍历这10部电影，获取每部电影的电影名、主演明星、上映时间和评分，获取完成后，用yield返回item。设置链接的offset自加，重新生成请求，实现翻页功能。
- 在pipelines.py中，连接数据库，在数据库中建立mymovies表，插入item的数据到表中，插入数据完成后关闭数据库连接。
- 注意：平台对链接http://maoyan.com/board/4?offset=0进行了处理，实训中使用的链接为"http://127.0.0.1:8080/board/4?offset=0"。
测试说明
- 平台会对编写的代码进行测试，当评测出现 Django 启动失败时，重新评测即可。
- 本关测评无输入，平台会对是否生成mymovies表进行判断，如果生成，输出：爬取成功。
代码示例
- step1/maoyan/maoyan/items.py

# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
class MaoyanItem(scrapy.Item):
 
    #********** Begin **********#
    name = scrapy.Field()
    starts = scrapy.Field()
    releasetime = scrapy.Field()
    score = scrapy.Field()
    
    #********** End **********#

- `step1/maoyan/maoyan/pipelines.py`

# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from maoyan import settings
class MaoyanPipeline(object):
    def process_item(self, item, spider):
        #********** Begin **********#
        #1.连接数据库
        connection = pymysql.connect(
            host='localhost',  # 连接的是本地数据库
            port=3306,         #数据库端口名
            user='root',        # 自己的mysql用户名
            passwd='123123',  # 自己的密码
            db='mydb',      # 数据库的名字
            charset='utf8',     # 默认的编码方式
        )                    
        #2.建表、给表插入数据，完成后关闭数据库连接，return返回item
        name = item['name']
        starts = item['starts']
        releasetime = item['releasetime']
        score = item['score']
        try:
            with connection.cursor() as cursor:
                sql1 = 'Create Table If Not Exists mymovies(name varchar(50) CHARACTER SET utf8 NOT NULL,starts text CHARACTER SET utf8 NOT NULL,releasetime varchar(50) CHARACTER SET utf8 DEFAULT NULL,score varchar(20) CHARACTER SET utf8 NOT NULL,PRIMARY KEY(name))'
                # 单章小说的写入
                sql2 = 'Insert into mymovies values (\'%s\',\'%s\',\'%s\',\'%s\')' % (name, starts, releasetime, score)
                cursor.execute(sql1)
                cursor.execute(sql2)
            # 提交本次插入的记录
            connection.commit()
        finally:
            # 关闭连接
            connection.close()    
            return item
        #********** End **********#

- `step1/maoyan/maoyan/spiders/movies.py`

# -*- coding: utf-8 -*-
import scrapy
from maoyan.items import MaoyanItem
 
class MoviesSpider(scrapy.Spider):
    name = 'movies'
    allowed_domains = ['127.0.0.1']
    offset = 0
    url = "http://127.0.0.1:8080/board/4?offset="
    #********** Begin **********#
    #1.对url进行定制，为翻页做准备
    start_urls = [url + str(offset)]
    #2.定义爬虫函数parse()
    def parse(self, response):
        item = MaoyanItem()
        movies = response.xpath("//div[ @class ='board-item-content']")
        for each in movies:
            #电影名
            name = each.xpath(".//div/p/a/text()").extract()[0]
            #主演明星
            starts = each.xpath(".//div[1]/p/text()").extract()[0]
            #上映时间
            releasetime = each.xpath(".//div[1]/p[3]/text()").extract()[0]
            score1 = each.xpath(".//div[2]/p/i[1]/text()").extract()[0]
            score2 = each.xpath(".//div[2]/p/i[2]/text()").extract()[0]
            #评分
            score = score1 + score2
            item['name'] = name
            item['starts'] = starts
            item['releasetime'] = releasetime
            item['score'] = score
            yield item
    #3.在函数的最后offset自加10，然后重新发出请求实现翻页功能
        if self.offset < 90:
            self.offset += 10
            yield scrapy.Request(self.url+str(self.offset), callback=self.parse)
 
    #********** End **********#

第2关：小说网站玄幻分类第一页小说爬取

编程要求
- 在item.py中定义NovelprojectItem类，存放小说的书名name，作者author，小说状态state，简介description；定义NovelprojectItem2类，存放小说章节的表命名tablename和章节名title。
- 在pipelines.py中，连接数据库，创建相应的表，插入小说信息和章节信息到对应的表中，插入数据完成后关闭数据库连接。
- 在主爬虫程序novel.py中，使用xpath匹配到玄幻分类首页的3本小说，通过for循环遍历这3本小说，获取小说的书名、作者、状态、简介和每章的章节名，获取完成后，用yield返回对应的item。
- 注意：平台对链接http://www.quanshuwang.com/list/1_1.html进行了处理，实训中使用的链接为http://127.0.0.1:8000/list/1_1.html。
测试说明
- 平台会对编写的代码进行测试，当评测出现 Django 启动失败时，重新评测即可。
- 爬虫爬完这3本小说的信息和章节名大概要半分钟左右，请耐心等待，不要切换代码文件（会造成测评停止）。
- 在测试行会输出mydb数据库中表的个数，应该有4个表，预期输出为：

COUNT(*)
4

代码示例
- step2/NovelProject/NovelProject/items.py

# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
#存放全部小说信息
class NovelprojectItem(scrapy.Item):
    #********** Begin **********#
    name = scrapy.Field()
    author = scrapy.Field()
    state = scrapy.Field()
    description = scrapy.Field()
    
    
    #********** End **********#
 
#单独存放小说章节
class NovelprojectItem2(scrapy.Item):
    #********** Begin **********#
    tablename = scrapy.Field()           # 表命名时需要
    title = scrapy.Field()
    
    #********** End **********#

- `step2/NovelProject/NovelProject/pipelines.py`

# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from NovelProject.items import NovelprojectItem,NovelprojectItem2
class NovelprojectPipeline(object):
    def process_item(self, item, spider):
 
        #********** Begin **********#
    
        #1.和本地的数据库mydb建立连接        
        connection = pymysql.connect(
            host='localhost',    # 连接的是本地数据库
            port = 3306,         # 端口号
            user='root',         # 自己的mysql用户名
            passwd='123123',     # 自己的密码
            db='mydb',           # 数据库的名字
            charset='utf8',      # 默认的编码方式：
        )
        
        
        #2.处理来自NovelprojectItem的item（处理完成后return返回item）
        if isinstance(item, NovelprojectItem):
            # 从items里取出数据
            name = item['name']
            author = item['author']
            state = item['state']
            description = item['description']
            try:
                with connection.cursor() as cursor:
                    # 小说信息写入
                    sql1 = 'Create Table If Not Exists novel(name varchar(20) CHARACTER SET utf8 NOT NULL,author varchar(10) CHARACTER SET utf8,state varchar(20) CHARACTER SET utf8,description text CHARACTER SET utf8,PRIMARY KEY (name))'
                    sql2 = 'Insert into novel values (\'%s\',\'%s\',\'%s\',\'%s\')' % (name, author, state, description)
                    cursor.execute(sql1)
                    cursor.execute(sql2)
                # 提交本次插入的记录
                connection.commit()
            finally:
                # 关闭连接
                connection.close()
            return item       
        
        #3.处理来自NovelprojectItem2的item（处理完成后return返回item）
        elif isinstance(item, NovelprojectItem2):
            tablename = item['tablename']
            title = item['title']
            try:
                with connection.cursor() as cursor:
                    # 小说章节的写入
                    sql3 = 'Create Table If Not Exists %s(title varchar(20) CHARACTER SET utf8 NOT NULL,PRIMARY KEY (title))' % tablename
                    sql4 = 'Insert into %s values (\'%s\')' % (tablename, title)
                    cursor.execute(sql3)
                    cursor.execute(sql4)
                connection.commit()
            finally:
                connection.close()
            return item
        
        #********** End **********#

- `step2/NovelProject/NovelProject/spiders/novel.py`

import scrapy
import re
from scrapy.http import Request
from NovelProject.items import NovelprojectItem
from NovelProject.items import NovelprojectItem2

class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['127.0.0.1']
    start_urls = ['http://127.0.0.1:8000/list/1_1.html']   #全书网玄幻魔法类第一页

    #********** Begin **********#
    #1.定义函数，通过'马上阅读'获取每一本书的 URL
    def parse(self, response):
        book_urls = response.xpath('//li/a[@class="l mr10"]/@href').extract()
        three_book_urls = book_urls[0:3]  # 只取3本
        for book_url in three_book_urls:
            yield Request(book_url, callback=self.parse_read)
    #2.定义函数，进入小说简介页面，获取信息，得到后yield返回给pipelines处理，并获取'开始阅读'的url，进入章节目录
    def parse_read(self, response):
        item = NovelprojectItem()
        # 小说名字
        name = response.xpath('//div[@class="b-info"]/h1/text()').extract_first()
        #小说简介
        description = response.xpath('//div[@class="infoDetail"]/div/text()').extract_first()
        # 小说连载状态
        state = response.xpath('//div[@class="bookDetail"]/dl[1]/dd/text()').extract_first()
        # 作者名字
        author = response.xpath('//div[@class="bookDetail"]/dl[2]/dd/text()').extract_first()
        item['name'] = name
        item['description'] = description
        item['state'] = state
        item['author'] = author
        yield item
        # 获取开始阅读按钮的URL，进入章节目录
        read_url = response.xpath('//a[@class="reader"]/@href').extract()[0]
        yield Request(read_url, callback=self.parse_info)
    #3.定义函数，进入章节目录，获取小说章节名并yield返回
    def parse_info(self, response):
        item = NovelprojectItem2()
        tablename = response.xpath('//div[@class="main-index"]/a[3]/text()').extract_first()
        titles = response.xpath('//div[@class="clearfix dirconone"]/li')
        for each in titles:
            title = each.xpath('.//a/text()').extract_first()
            item['tablename'] = tablename
            item['title'] = title
            yield item

DAMO开发者矩阵

DAMO开发者矩阵，由阿里巴巴达摩院和中国互联网协会联合发起，致力于探讨最前沿的技术趋势与应用成果，搭建高质量的交流与分享平台，推动技术创新与产业应用链接，围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐

【路径规划】基于RRT和RPM两种算法实现二维连杆机器人路径规划附matlab代码

一、引言：二维连杆机器人路径规划的工程需求与研究背景二维连杆机器人（如平面二连杆、三连杆机械臂）作为工业装配、物流分拣、医疗辅助等领域的核心执行机构，其路径规划的核心目标是在复杂二维环境中，寻找一条从起始姿态到目标姿态的无碰撞路径，同时满足机器人运动学约束（如关节角度限制、连杆长度约束）与工程要求（如路径平滑、规划高效）。

DAMO开发者矩阵

【图像加密】基于正弦余弦混沌映射生成随机序列，对图像 RGB 三通道分别执行 “行移位 - 列移位 - XOR 异或” 操作实现图像加密解密附matlab代码

一、引言1.1 研究背景与意义在数字化时代，图像作为信息传递的核心载体，广泛应用于军事通信、医疗影像、金融数据等关键领域。然而，网络传输的开放性导致图像信息面临窃取、篡改、伪造等安全威胁，传统加密算法（如 AES、DES）因图像数据的高冗余性、强相关性和大容量特性，存在加密效率低、密钥空间不足等问题，难以满足实时安全需求。混沌系统因对初始条件和参数的极端敏感性、伪随机性强、遍历性好等特性，与密码学