30天学会Python编程：26.Python网络爬虫编程指南

当前位置：点晴教程→知识管理交流 →『技术文档交流』

admin

2025年7月17日 21:56 本文热度 1078

概述

本指南将介绍Python网络爬虫开发的核心知识与技术，包含基础理论、实战技巧。

目录结构

1 网络爬虫基础

1.1 爬虫定义与原理

网络爬虫（Web Crawler）是一种自动提取网页信息的程序，其核心工作流程包括：

发送请求：向目标服务器发送HTTP/HTTPS请求
获取响应：接收服务器返回的HTML/JSON数据
解析内容：从响应数据中提取所需信息
存储数据：将提取的数据保存到本地或数据库

注意事项：

设置合理的请求间隔（通常1-3秒）
处理HTTP状态码（200成功，404未找到等）
考虑网页编码问题（推荐使用UTF-8）

1.2 法律与道德规范

爬虫开发必须遵守法律与道德准则：

注意事项	合规建议
robots协议	遵守目标网站的robots.txt规则
访问频率	添加适当延迟(≥1秒)
数据使用	仅用于合法用途
用户认证	不绕过登录验证机制

原则：爬取公开数据，避免侵犯隐私，尊重网站服务条款

2 请求库使用

2.1 requests库基础

Requests是Python最常用的HTTP库，提供简洁的API：

import requests

def fetch_page(url):
    try:
        # 设置请求头模拟浏览器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
            'Accept-Language': 'zh-CN,zh;q=0.9'
        }
        
        # 发送GET请求
        response = requests.get(url, headers=headers, timeout=5)
        
        # 检查HTTP状态
        response.raise_for_status()
        
        # 返回响应内容
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
        return None

# 使用示例
html_content = fetch_page('https://example.com')

参数：

headers: 设置请求头，模拟浏览器行为
timeout: 设置超时时间（秒）
params: 传递URL参数（字典形式）

2.2 请求技巧

# 1. 会话保持（处理cookies）
session = requests.Session()
session.get('https://example.com/login', data={'user': 'name', 'pass': 'secret'})

# 2. 代理设置
proxies = {'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080'}
response = requests.get(url, proxies=proxies)

# 3. 文件下载（大文件流式处理）
with requests.get('https://example.com/large_file.zip', stream=True) as r:
    with open('large_file.zip', 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192): 
            f.write(chunk)

实践技巧：

使用Session对象保持登录状态
设置代理IP解决IP封锁问题
大文件下载使用stream=True避免内存溢出

3 数据解析技术

3.1 BeautifulSoup解析

BeautifulSoup提供简单易用的HTML/XML解析接口：

from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, 'lxml')  # 推荐使用lxml解析器
    
    # CSS选择器提取元素
    titles = [h1.text for h1 in soup.select('h1.article-title')]
    
    # 属性提取
    links = [a['href'] for a in soup.find_all('a', class_='external')]
    
    # 文本处理
    content = soup.find('div', id='content').get_text(strip=True, separator='\n')
    
    return {'titles': titles, 'links': links, 'content': content}

解析器对比：

解析器	速度	依赖	适用场景
html.parser	中等	Python标准库	简单HTML解析
lxml	快	需要安装	复杂文档高效解析
html5lib	慢	需要安装	解析不规范HTML

3.2 XPath与lxml

XPath提供更精确的节点定位能力：

from lxml import etree

def xpath_parse(html):
    tree = etree.HTML(html)
    
    # 提取文本内容
    prices = tree.xpath('//div[@class="price"]/text()')
    
    # 提取嵌套数据
    items = []
    for product in tree.xpath('//div[@class="product"]'):
        items.append({
            'name': product.xpath('.//h2/text()')[0],
            'sku': product.xpath('./@data-sku')[0]
        })
    
    return {'prices': prices, 'items': items}

XPath常用表达式：

//div: 选择所有div元素
//div[@class='name']: 选择class为name的div
//a/text(): 提取链接文本
//img/@src: 提取图片src属性

4 动态页面处理

4.1 Selenium自动化

Selenium可模拟浏览器行为处理JavaScript渲染的页面：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def selenium_crawl(url):
    # 配置无头浏览器
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    
    driver = webdriver.Chrome(options=options)
    
    try:
        driver.get(url)
        
        # 显式等待元素加载
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, ".dynamic-content"))
        )
        
        # 执行JavaScript
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
        # 获取渲染后的HTML
        return driver.page_source
    finally:
        driver.quit()

建议：

使用无头模式(headless)节省资源
添加显式等待(WebDriverWait)避免元素未加载完成

限制资源加载加速页面获取：

options.add_experimental_option("prefs", {
    'profile.managed_default_content_settings.images': 2,  # 不加载图片
    'permissions.default.stylesheet': 2  # 不加载CSS
})

4.2 接口逆向分析

直接调用数据接口效率更高：

import requests
import time

def api_crawl():
    # 分析XHR请求构造参数
    api_url = 'https://api.example.com/data'
    params = {
        'page': 1,
        'size': 20,
        'timestamp': int(time.time()*1000)  # 时间戳防缓存
    }
    
    # 添加认证头
    headers = {'Authorization': 'Bearer token123'}
    
    response = requests.get(api_url, params=params, headers=headers)
    data = response.json()
    
    # 解析JSON数据
    for item in data['list']:
        print(f"商品: {item['name']}, 价格: {item['price']}")

接口分析：

浏览器开发者工具 → Network → XHR
查找返回目标数据的API请求
分析请求参数和认证方式
模拟相同请求获取结构化数据

5 数据存储方案

5.1 文件存储

import csv
import json

# CSV存储（适合表格数据）
def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

# JSON存储（适合结构化数据）
def save_to_json(data, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

5.2 数据库存储

import sqlite3
import pymongo

# SQLite存储（轻量级关系型数据库）
def sqlite_save(data):
    conn = sqlite3.connect('data.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS products
               (id TEXT PRIMARY KEY, name TEXT, price REAL)''')
    c.executemany('REPLACE INTO products VALUES (?,?,?)', 
                 [(d['id'], d['name'], d['price']) for d in data])
    conn.commit()

# MongoDB存储（文档型数据库）
def mongo_save(data):
    client = pymongo.MongoClient('mongodb://localhost:27017/')
    db = client['web_data']
    collection = db['products']
    # 批量插入并更新已存在文档
    bulk_ops = [pymongo.ReplaceOne({'id': item['id']}, item, upsert=True) 
                for item in data]
    collection.bulk_write(bulk_ops)

存储方案选择：

数据类型	推荐存储方案
小型结构化数据	SQLite/CSV
大型结构化数据	MySQL/PostgreSQL
半结构化数据	MongoDB/JSON文件
非结构化数据	文件系统/MinIO

6 反爬应对策略

6.1 常见反爬机制与应对

6.2 反反爬技巧

from fake_useragent import UserAgent
import random

# 代理IP池管理
class ProxyPool:
    def __init__(self):
        self.proxies = self.load_proxies()
        self.current = 0
    
    def load_proxies(self):
        # 从文件或API获取代理IP列表
        return ['http://ip1:port', 'http://ip2:port', ...]
    
    def get_proxy(self):
        proxy = self.proxies[self.current % len(self.proxies)]
        self.current += 1
        return {'http': proxy, 'https': proxy}

# 随机请求头生成
def get_random_headers():
    ua = UserAgent()
    return {
        'User-Agent': ua.random,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': random.choice(['zh-CN', 'en-US', 'ja-JP']),
        'Referer': 'https://www.google.com/'
    }

# 使用示例
proxy_pool = ProxyPool()
response = requests.get(url, 
                        headers=get_random_headers(), 
                        proxies=proxy_pool.get_proxy(),
                        timeout=10)

IP代理实践：

使用付费代理服务（更稳定）
定时验证代理可用性
不同网站使用不同代理池
设置代理失败重试机制

7 应用示例

案例1：电商商品爬虫

import requests
from bs4 import BeautifulSoup
import time
import random

def ecommerce_crawler(base_url, max_page=10):
    products = []
    
    for page in range(1, max_page+1):
        # 随机延迟防止封禁
        time.sleep(random.uniform(1, 3))
        
        url = f"{base_url}?page={page}"
        html = fetch_page(url)  # 使用之前定义的fetch_page函数
        
        if not html:
            continue
            
        soup = BeautifulSoup(html, 'lxml')
        items = soup.select('.product-item')
        
        for item in items:
            try:
                products.append({
                    'name': item.select_one('.name').text.strip(),
                    'price': float(item.select_one('.price').text.replace('¥', '')),
                    'sku': item['data-sku'],
                    'rating': float(item.select_one('.rating')['data-score'])
                })
            except Exception as e:
                print(f"解析失败: {e}")
    
    # 保存结果
    save_to_csv(products, 'products.csv')
    return products

案例2：新闻聚合爬虫

import schedule
import time
from datetime import datetime

def news_monitor():
    sources = [
        {'url': 'https://news.source1.com/rss', 'type': 'rss'},
        {'url': 'https://news.source2.com/api/latest', 'type': 'api'}
    ]
    
    all_news = []
    
    for source in sources:
        try:
            if source['type'] == 'rss':
                news = parse_rss(source['url'])
            else:
                news = parse_news_api(source['url'])
            all_news.extend(news)
        except Exception as e:
            print(f"{source['url']} 爬取失败: {e}")
    
    # 去重并存储
    store_news(all_news)
    print(f"{datetime.now()} 抓取完成，新增{len(all_news)}条新闻")

# 定时任务配置
schedule.every(30).minutes.do(news_monitor)  # 每30分钟执行一次

# 主循环
while True:
    schedule.run_pending()
    time.sleep(60)  # 每分钟检查一次任务