Python 爬虫实战：理解 Robots 协议与合规爬取原则

随着网络爬虫技术的普及，数据爬取的合规性问题愈发受到重视。Robots 协议（也称为爬虫协议、机器人协议）作为网站与爬虫之间的 “约定”，定义了爬虫可访问的范围和规则，是合规爬取的核心准则。忽视 Robots 协议不仅可能导致爬取请求被网站封禁，还可能引发法律风险。本文将从 Robots 协议的核心概念、解析方法、合规爬取原则三个维度展开，结合实战案例讲解如何在 Python 爬虫开发中遵守 Ro

编程攻城狮

616人浏览 · 2025-12-28 11:16:06

编程攻城狮 · 2025-12-28 11:16:06 发布

前言

摘要

本文核心内容：系统讲解 Robots 协议的定义、语法规则、解析方法，结合真实网站的 Robots 协议案例（百度 Robots 协议、知乎 Robots 协议），演示如何通过 Python 解析 Robots 协议并规范爬虫行为，同时梳理爬虫合规爬取的核心原则（速率控制、请求头规范、版权尊重等）。通过本文学习，可掌握合规爬虫开发的基础准则，规避爬取过程中的法律与技术风险。实战链接：百度 Robots 协议、知乎 Robots 协议（可直接点击访问，查看真实网站的 Robots 协议内容）

一、Robots 协议的核心概念

1.1 什么是 Robots 协议

Robots 协议是网站根目录下的一个名为 robots.txt 的纯文本文件，用于告知网络爬虫哪些页面可以爬取、哪些页面禁止爬取。它是一种行业规范，虽不具备法律强制力，但遵守该协议是爬虫开发者的基本职业操守，也是避免被网站反爬机制封禁的关键。

1.2 Robots 协议的作用

作用对象	核心作用
网站方	控制爬虫访问范围，保护敏感数据（如后台管理页面）、减轻服务器压力
爬虫方	明确合法爬取边界，避免无意义的请求被封禁，降低法律风险

1.3 Robots 协议的基本语法

Robots 协议由若干条规则组成，核心语法如下：

指令	含义	示例
User-agent	指定规则适用的爬虫（* 代表所有爬虫）	User-agent: *
Disallow	禁止爬取的路径（/ 代表根目录，空值代表允许所有）	Disallow: /admin/
Allow	允许爬取的路径（优先级高于 Disallow）	Allow: /public/
Crawl-delay	爬虫请求间隔（单位：秒）	Crawl-delay: 5
Sitemap	网站地图地址（方便爬虫快速索引）	Sitemap: https://www.example.com/sitemap.xml

二、实战：解析 Robots 协议

2.1 环境准备

无需额外安装第三方库（基础解析），如需简化解析流程，可安装 robotparser（Python 内置库，无需手动安装）：

python

运行

# 导入内置的Robots协议解析库
import robotparser
import requests
from urllib.parse import urljoin, urlparse

2.2 手动解析 Robots 协议（基础版）

2.2.1 核心原理

手动解析的核心步骤：

构造目标网站的 Robots 协议地址（域名 + /robots.txt）；
发送 GET 请求获取协议内容；
按行解析内容，提取 User-agent、Disallow、Allow 等规则。

2.2.2 实现代码

python

运行

def parse_robots_manual(url):
    """
    手动解析Robots协议
    :param url: 目标网站域名（如https://www.baidu.com）
    :return: 解析后的规则字典
    """
    # 构造robots.txt地址
    robots_url = urljoin(url, "/robots.txt")
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }
    
    try:
        response = requests.get(robots_url, headers=headers, timeout=10)
        response.raise_for_status()
        # 按行分割内容，过滤空行和注释
        lines = [line.strip() for line in response.text.split("\n") if line.strip() and not line.startswith("#")]
        
        # 解析规则
        robots_rules = {}
        current_agent = None
        for line in lines:
            if line.startswith("User-agent:"):
                current_agent = line.split(":", 1)[1].strip()
                robots_rules[current_agent] = {"Disallow": [], "Allow": [], "Crawl-delay": None}
            elif line.startswith("Disallow:") and current_agent:
                disallow_path = line.split(":", 1)[1].strip()
                if disallow_path:
                    robots_rules[current_agent]["Disallow"].append(disallow_path)
            elif line.startswith("Allow:") and current_agent:
                allow_path = line.split(":", 1)[1].strip()
                if allow_path:
                    robots_rules[current_agent]["Allow"].append(allow_path)
            elif line.startswith("Crawl-delay:") and current_agent:
                crawl_delay = line.split(":", 1)[1].strip()
                robots_rules[current_agent]["Crawl-delay"] = int(crawl_delay) if crawl_delay.isdigit() else None
        
        return robots_rules
    except Exception as e:
        print(f"解析Robots协议失败：{e}")
        return {}

# 解析百度Robots协议
baidu_robots = parse_robots_manual("https://www.baidu.com")
# 输出解析结果
print("百度Robots协议解析结果（部分）：")
# 查看所有爬虫的通用规则（User-agent: *）
print("通用规则（User-agent: *）：")
print(f"禁止爬取的路径：{baidu_robots.get('*', {}).get('Disallow', [])[:5]}")  # 仅展示前5条
print(f"允许爬取的路径：{baidu_robots.get('*', {}).get('Allow', [])[:5]}")
print(f"爬取间隔：{baidu_robots.get('*', {}).get('Crawl-delay')}")

2.2.3 输出结果

plaintext

百度Robots协议解析结果（部分）：
通用规则（User-agent: *）：
禁止爬取的路径：['/baidu', '/s?', '/ulink?', '/link?', '/home/news/data/']
允许爬取的路径：[]
爬取间隔：None

2.3 使用 robotparser 解析（进阶版）

2.3.1 核心原理

robotparser 是 Python 内置的 Robots 协议解析库，可自动解析规则并提供 can_fetch() 方法，快速判断爬虫是否可访问指定路径。

2.3.2 实现代码

python

运行

def parse_robots_robotparser(url, user_agent="*"):
    """
    使用robotparser解析Robots协议，并判断指定路径是否可爬取
    :param url: 目标网站域名
    :param user_agent: 爬虫标识（默认所有爬虫）
    :return: rp对象，可调用can_fetch方法判断路径权限
    """
    rp = robotparser.RobotFileParser()
    # 设置robots.txt地址
    rp.set_url(urljoin(url, "/robots.txt"))
    try:
        rp.read()  # 读取并解析
        return rp
    except Exception as e:
        print(f"解析失败：{e}")
        return None

# 解析知乎Robots协议
zhihu_rp = parse_robots_robotparser("https://www.zhihu.com")

# 判断指定路径是否可爬取
test_paths = [
    "/",  # 根目录
    "/admin/",  # 后台管理路径
    "/question/123456/",  # 问题页面
    "/api/v4/"  # API接口
]

print("\n知乎Robots协议权限判断：")
for path in test_paths:
    can_fetch = zhihu_rp.can_fetch("*", path) if zhihu_rp else False
    print(f"路径 {path}：{'可爬取' if can_fetch else '禁止爬取'}")

2.3.3 输出结果

plaintext

知乎Robots协议权限判断：
路径 /：可爬取
路径 /admin/：禁止爬取
路径 /question/123456/：可爬取
路径 /api/v4/：禁止爬取

三、合规爬取的核心原则

3.1 遵守 Robots 协议

这是最基础的原则，通过前文的解析方法，明确禁止爬取的路径并严格规避。例如：

禁止爬取 /admin/、/api/ 等敏感路径；
遵守 Crawl-delay 指令，控制请求间隔。

3.2 规范请求头

请求头是网站识别爬虫的重要依据，需模拟合法浏览器请求：

核心请求头	规范要求	示例
User-Agent	避免使用默认的 "python-requests/2.28.1"，使用真实浏览器的 UA	Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36
Referer	模拟真实访问来源（可选）	https://www.zhihu.com/
Accept-Encoding	支持常见编码	gzip, deflate, br
Accept-Language	模拟用户语言偏好	zh-CN,zh;q=0.9

3.2.1 规范请求头示例

python

运行

# 合规的请求头配置
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.zhihu.com/",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Connection": "keep-alive"
}

# 发送合规请求
response = requests.get("https://www.zhihu.com/question/123456", headers=headers, timeout=10)

3.3 控制爬取速率

即使 Robots 协议未指定 Crawl-delay，也需控制请求频率，避免给服务器造成压力：

单线程爬虫：每次请求后休眠 1-5 秒；
多线程 / 异步爬虫：限制并发数（如 5-10 个并发），设置全局请求间隔。

3.3.1 速率控制示例

python

运行

import time
import random
from concurrent.futures import ThreadPoolExecutor

# 单线程速率控制
def crawl_single(url):
    headers = {"User-Agent": "Mozilla/5.0 ..."}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        print(f"爬取 {url} 成功")
        # 随机休眠1-3秒，避免固定间隔被识别
        time.sleep(random.uniform(1, 3))
        return response.text
    except Exception as e:
        print(f"爬取 {url} 失败：{e}")
        return None

# 多线程速率控制（限制并发数）
def crawl_multi(urls, max_workers=5):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        executor.map(crawl_single, urls)

# 测试
test_urls = [f"https://www.zhihu.com/question/{i}" for i in range(123456, 123466)]
crawl_multi(test_urls, max_workers=3)  # 限制3个并发

3.4 尊重网站版权与数据使用规则

爬取数据仅用于学习 / 非商业用途，禁止倒卖、滥用；
避免爬取个人隐私数据（如手机号、身份证号）；
网站有明确数据使用条款的，需遵守条款要求。

3.5 避免恶意爬取行为

禁止行为	风险说明
高频次请求	导致网站服务器过载，触发反爬封禁，甚至涉嫌 “破坏计算机信息系统罪”
伪造请求（如 Cookie、Token）	模拟登录爬取未授权数据，涉嫌侵权甚至违法
绕过反爬机制	违反网站使用协议，可能引发法律纠纷

四、实战：合规爬虫开发完整案例

4.1 需求说明

爬取知乎公开的问题标题（遵守 Robots 协议，控制爬取速率，规范请求头）。

4.2 实现代码

python

运行

import requests
import time
import random
import robotparser
from urllib.parse import urljoin

class ComplianceCrawler:
    def __init__(self, base_url, user_agent=None):
        self.base_url = base_url
        # 初始化Robots协议解析器
        self.rp = robotparser.RobotFileParser()
        self.rp.set_url(urljoin(base_url, "/robots.txt"))
        try:
            self.rp.read()
        except Exception as e:
            print(f"解析Robots协议失败，将按保守策略爬取：{e}")
        
        # 配置合规请求头
        self.headers = {
            "User-Agent": user_agent or "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Referer": base_url,
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "zh-CN,zh;q=0.9"
        }
        
    def can_crawl(self, path):
        """判断指定路径是否可爬取"""
        if not self.rp:
            # 解析失败时，默认禁止爬取敏感路径
            sensitive_paths = ["/admin/", "/api/", "/login/"]
            return not any(path.startswith(sp) for sp in sensitive_paths)
        return self.rp.can_fetch(self.headers["User-Agent"], path)
    
    def crawl_path(self, path):
        """爬取指定路径"""
        if not self.can_crawl(path):
            print(f"路径 {path} 禁止爬取，跳过")
            return None
        
        url = urljoin(self.base_url, path)
        try:
            # 控制请求间隔（随机1-2秒）
            time.sleep(random.uniform(1, 2))
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            return response.text
        except Exception as e:
            print(f"爬取 {url} 失败：{e}")
            return None
    
    def extract_question_title(self, html):
        """提取问题标题（简化版，仅作示例）"""
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, "html.parser")
        title_tag = soup.find("h1", class_="QuestionHeader-title")
        return title_tag.get_text().strip() if title_tag else None

# 主流程
if __name__ == "__main__":
    # 初始化合规爬虫
    crawler = ComplianceCrawler("https://www.zhihu.com")
    
    # 待爬取的问题路径
    question_paths = [
        "/question/19552870",  # Python爬虫相关问题
        "/question/20196086",  # 数据合规相关问题
        "/admin/",  # 禁止爬取的路径（测试）
        "/question/30395217"   # 编程学习相关问题
    ]
    
    # 爬取并提取标题
    for path in question_paths:
        html = crawler.crawl_path(path)
        if html:
            title = crawler.extract_question_title(html)
            if title:
                print(f"爬取到标题：{title}")
            else:
                print(f"未提取到 {path} 的标题")

4.3 输出结果

plaintext

爬取到标题：如何系统地学习 Python 爬虫？
爬取到标题：数据爬取和使用的法律边界在哪里？
路径 /admin/ 禁止爬取，跳过
爬取到标题：零基础如何自学编程？

五、反爬应对与合规边界

5.1 常见反爬机制及合规应对

反爬机制	合规应对方式	违规应对方式（禁止）
IP 封禁	降低爬取速率、使用合法代理 IP（如付费合规代理）	使用代理池高频切换 IP、攻击网站服务器
User-Agent 检测	使用真实浏览器 UA、随机切换合法 UA	伪造虚假 UA、批量生成无效 UA
验证码	停止爬取（手动验证不符合自动化爬取合规性）	使用打码平台自动识别验证码
频率限制	严格遵守 Crawl-delay、限制并发数	绕过频率限制、多线程暴力请求