Python爬虫与数据分析：从数据采集到分析可视化

第一程序员

13747人浏览 · 2026-04-04 23:22:21

第一程序员 · 2026-04-04 23:22:21 发布

Python爬虫与数据分析：从数据采集到分析可视化

前言

大家好，我是第一程序员（名字大，人很菜）。作为一个非科班转码、正在学习Rust和Python的萌新，最近我开始学习Python的爬虫和数据分析。说实话，一开始我对爬虫的概念还很模糊，但随着学习的深入，我发现爬虫是获取数据的重要手段，而数据分析则是从数据中提取价值的关键。今天我想分享一下我对Python爬虫与数据分析的学习心得，希望能给同样是非科班转码的朋友们一些参考。

一、爬虫基础

1.1 使用requests库

requests是Python中用于发送HTTP请求的库：

import requests

# 发送GET请求
response = requests.get('https://api.github.com/users/octocat')

# 检查响应状态码
if response.status_code == 200:
    # 解析JSON响应
    data = response.json()
    print(data)
else:
    print(f"Error: {response.status_code}")

# 发送POST请求
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://httpbin.org/post', data=payload)
print(response.text)

1.2 使用BeautifulSoup解析HTML

BeautifulSoup是Python中用于解析HTML和XML的库：

import requests
from bs4 import BeautifulSoup

# 获取网页内容
url = 'https://example.com'
response = requests.get(url)

# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')

# 提取标题
print(soup.title.text)

# 提取所有链接
for link in soup.find_all('a'):
    print(link.get('href'))

# 提取特定元素
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

1.3 使用Selenium

Selenium用于自动化浏览器操作，适用于处理JavaScript渲染的页面：

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# 启动浏览器
driver = webdriver.Chrome()

# 打开网页
driver.get('https://example.com')

# 等待页面加载
time.sleep(2)

# 提取元素
title = driver.find_element(By.TAG_NAME, 'h1').text
print(title)

# 点击按钮
button = driver.find_element(By.CSS_SELECTOR, 'button')
button.click()

# 关闭浏览器
driver.quit()

二、高级爬虫

2.1 使用Scrapy框架

Scrapy是一个功能强大的Python爬虫框架：

# 创建Scrapy项目
# scrapy startproject myproject

# 定义爬虫
# myproject/spiders/example_spider.py
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        # 提取数据
        title = response.css('h1::text').get()
        yield {'title': title}

        # 跟随链接
        for href in response.css('a::attr(href)').getall():
            yield response.follow(href, self.parse)

# 运行爬虫
# scrapy crawl example

2.2 处理反爬措施

常见的反爬措施及应对方法：

import requests
from fake_useragent import UserAgent

# 使用随机User-Agent
ua = UserAgent()
headers = {'User-Agent': ua.random}

# 发送请求
response = requests.get('https://example.com', headers=headers)

# 使用代理
proxies = {
    'http': 'http://your-proxy:port',
    'https': 'https://your-proxy:port'
}
response = requests.get('https://example.com', proxies=proxies)

# 延迟请求
import time
time.sleep(1)  # 延迟1秒

三、数据存储

3.1 存储到CSV文件

import csv

# 写入CSV文件
data = [
    ['name', 'age', 'city'],
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'London'],
    ['Charlie', 35, 'Paris']
]

with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

# 读取CSV文件
with open('data.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

3.2 存储到JSON文件

import json

# 写入JSON文件
data = {
    'name': 'Alice',
    'age': 25,
    'city': 'New York'
}

with open('data.json', 'w') as file:
    json.dump(data, file, indent=4)

# 读取JSON文件
with open('data.json', 'r') as file:
    data = json.load(file)
    print(data)

3.3 存储到数据库

import sqlite3

# 连接数据库
conn = sqlite3.connect('data.db')
cursor = conn.cursor()

# 创建表
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    city TEXT
)
''')

# 插入数据
users = [
    ('Alice', 25, 'New York'),
    ('Bob', 30, 'London'),
    ('Charlie', 35, 'Paris')
]
cursor.executemany('INSERT INTO users (name, age, city) VALUES (?, ?, ?)', users)
conn.commit()

# 查询数据
cursor.execute('SELECT * FROM users')
rows = cursor.fetchall()
for row in rows:
    print(row)

# 关闭连接
conn.close()

四、数据分析

4.1 使用Pandas进行数据分析

import pandas as pd

# 读取CSV文件
df = pd.read_csv('data.csv')

# 查看数据
print(df.head())
print(df.info())
print(df.describe())

# 数据筛选
filtered_df = df[df['age'] > 30]
print(filtered_df)

# 数据分组
grouped_df = df.groupby('city').mean()
print(grouped_df)

# 数据排序
sorted_df = df.sort_values('age', ascending=False)
print(sorted_df)

4.2 使用NumPy进行数值计算

import numpy as np

# 创建数组
arr = np.array([1, 2, 3, 4, 5])
print(arr)

# 基本运算
print(arr + 1)
print(arr * 2)
print(np.sum(arr))
print(np.mean(arr))
print(np.max(arr))
print(np.min(arr))

# 多维数组
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr_2d)
print(arr_2d.shape)
print(np.sum(arr_2d, axis=0))  # 按列求和
print(np.sum(arr_2d, axis=1))  # 按行求和

五、数据可视化

5.1 使用Matplotlib

import matplotlib.pyplot as plt
import pandas as pd

# 读取数据
df = pd.read_csv('data.csv')

# 条形图
plt.bar(df['name'], df['age'])
plt.xlabel('Name')
plt.ylabel('Age')
plt.title('Age by Name')
plt.show()

# 折线图
plt.plot(df['name'], df['age'])
plt.xlabel('Name')
plt.ylabel('Age')
plt.title('Age by Name')
plt.show()

# 散点图
plt.scatter(df['age'], df['age'])  # 示例，实际应使用不同变量
plt.xlabel('Age')
plt.ylabel('Age')
plt.title('Scatter Plot')
plt.show()

5.2 使用Seaborn

import seaborn as sns
import pandas as pd

# 读取数据
df = pd.read_csv('data.csv')

# 条形图
sns.barplot(x='name', y='age', data=df)
plt.title('Age by Name')
plt.show()

# 箱线图
sns.boxplot(x='city', y='age', data=df)
plt.title('Age by City')
plt.show()

# 热力图
# 创建相关矩阵
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.title('Correlation Matrix')
plt.show()