用于抓取网站的Python程序：从基础到进阶的完整代码示例

为什么抓取网站需要代理IP？

很多朋友在写Python爬虫时会遇到一个头疼的问题：明明代码没问题，运行几次后目标网站就打不开了。这通常是因为你的IP被网站识别为爬虫并封禁了。

想象一下，一个商店发现同一个人短时间内频繁进出，自然会起疑心。网站服务器也是同样的道理。使用代理IP就像让不同的人轮流去商店，每个IP只访问几次，大大降低了被封的风险。

特别是对于需要大量数据采集的项目，代理IP几乎是必备工具。它能帮你：

避免IP被封 – 通过轮换IP分散请求
Amélioration de l'efficacité de la collecte – 多个IP可以同时工作
获取地域特定内容 – 使用目标地区的IP访问

基础爬虫代码示例

我们先来看一个最简单的Python爬虫，它没有使用任何代理IP：

import requests
from bs4 import BeautifulSoup

def simple_crawler(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()   检查请求是否成功
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup.prettify()   返回格式化后的HTML
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
        return None

 使用示例
result = simple_crawler('https://httpbin.org/ip')
print(result)

这个代码虽然简单，但连续运行几次后很容易被目标网站检测到。接下来我们看看如何加入代理IP。

为爬虫添加代理IP支持

使用代理IP其实很简单，主要是在requests库中添加proxies参数：

import requests
import random
import time

def proxy_crawler(url, proxy_list):
    """
    使用代理IP的爬虫函数
    """
    if not proxy_list:
        print("没有可用的代理IP")
        return None
    
     随机选择一个代理IP
    proxy = random.choice(proxy_list)
    proxies = {
        'http': f'http://{proxy}',
        'https': f'https://{proxy}'
    }
    
    try:
        response = requests.get(url, proxies=proxies, timeout=15)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"代理 {proxy} 请求失败: {e}")
         可以在这里添加重试逻辑
        return None

 示例代理IP列表（实际使用时需要从服务商获取）
sample_proxies = [
    '123.456.789.100:8080',
    '123.456.789.101:8080',
    '123.456.789.102:8080'
]

 使用示例
result = proxy_crawler('https://httpbin.org/ip', sample_proxies)
print(result)

进阶：智能代理IP管理策略

对于专业的爬虫项目，我们需要更智能的代理IP管理：

import requests
import time
from collections import defaultdict

class SmartProxyManager:
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.proxy_stats = defaultdict(int)   记录每个代理的使用情况
        self.failed_count = defaultdict(int)   记录失败次数
        
    def get_best_proxy(self):
        """选择最优的代理IP"""
         简单的策略：优先选择使用次数少且最近没有失败的代理
        available_proxies = [
            p for p in self.proxies 
            if self.failed_count[p] < 3   失败次数少于3次
        ]
        
        if not available_proxies:
             如果所有代理都失败多次，重置计数
            self.failed_count.clear()
            available_proxies = self.proxies
            
         选择使用次数最少的代理
        return min(available_proxies, key=lambda x: self.proxy_stats[x])
    
    def mark_success(self, proxy):
        """标记代理使用成功"""
        self.proxy_stats[proxy] += 1
        
    def mark_failure(self, proxy):
        """标记代理使用失败"""
        self.failed_count[proxy] += 1
    
    def crawl_with_retry(self, url, max_retries=3):
        """带重试机制的爬取函数"""
        for attempt in range(max_retries):
            proxy = self.get_best_proxy()
            proxies = {
                'http': f'http://{proxy}',
                'https': f'https://{proxy}'
            }
            
            try:
                print(f"尝试第{attempt+1}次请求，使用代理: {proxy}")
                response = requests.get(url, proxies=proxies, timeout=20)
                response.raise_for_status()
                
                self.mark_success(proxy)
                return response.text
                
            except Exception as e:
                print(f"请求失败: {e}")
                self.mark_failure(proxy)
                time.sleep(2  attempt)   指数退避策略
        
        return None

 使用示例
proxy_manager = SmartProxyManager(sample_proxies)
result = proxy_manager.crawl_with_retry('https://httpbin.org/ip')
print(result)

选择优质代理IP服务的要点

不是所有代理IP都适合爬虫使用，选择服务商时要关注这几个关键点：

caractérisation	signification	instructions
Pureté IP	votre (honorifique)	确保IP没有被其他用户过度使用
stabilité	votre (honorifique)	连接成功率和响应速度
Nombre de PI	moyen à élevé	足够的IP池避免频繁重复
localisation géographique	milieu	根据目标网站选择合适地区的IP

Recommandé : ipipgo proxy service IP

在众多代理服务商中，ipipgo是一个值得推荐的选择。他们的动态住宅代理IP资源总量高达9000万+，覆盖全球220+国家和地区，所有IP均来自真实家庭网络，具备高度匿名性。

对于爬虫项目来说，ipipgo的几个优势特别实用：

facturation au flux – 根据实际使用量付费，成本可控
Rotation et sessions collantes – 灵活适应不同采集需求
Pays/ville désigné(e) – 精确定位目标市场
Prise en charge complète du protocole – HTTP(S)和SOCKS5都能用

对于需要更高稳定性的项目，ipipgo还提供静态住宅代理，50万+的纯净IP资源确保业务长期稳定运行。

完整实战案例：电商价格监控

下面是一个使用ipipgo代理IP的电商价格监控实例：

import requests
import json
import time
from datetime import datetime

class PriceMonitor:
    def __init__(self, ipipgo_api_key):
        self.api_key = ipipgo_api_key
        self.base_url = "https://api.ipipgo.com/proxy"   示例API地址
        
    def get_ipipgo_proxy(self):
        """从ipipgo获取代理IP"""
         实际使用时需要参考ipipgo的API文档
        params = {
            'key': self.api_key,
            'protocol': 'http',
            'count': 1
        }
        
        try:
            response = requests.get(f"{self.base_url}/get", params=params)
            data = response.json()
            return data['proxies'][0]
        except Exception as e:
            print(f"获取代理IP失败: {e}")
            return None
    
    def monitor_price(self, product_url):
        """监控商品价格"""
        proxy_info = self.get_ipipgo_proxy()
        if not proxy_info:
            return None
            
        proxies = {
            'http': f"http://{proxy_info['ip']}:{proxy_info['port']}",
            'https': f"http://{proxy_info['ip']}:{proxy_info['port']}"
        }
        
        try:
             设置合适的请求头，模拟真实浏览器
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8'
            }
            
            response = requests.get(product_url, proxies=proxies, 
                                  headers=headers, timeout=30)
            
             这里需要根据具体网站编写解析逻辑
            price = self.parse_price(response.text)
            
            return {
                'price': price,
                'timestamp': datetime.now().isoformat(),
                'proxy_used': proxy_info['ip']
            }
            
        except Exception as e:
            print(f"价格监控失败: {e}")
            return None
    
    def parse_price(self, html_content):
        """解析价格信息（需要根据目标网站调整）"""
         示例解析逻辑，实际使用时需要针对具体网站编写
         这里可以使用BeautifulSoup等解析库
        return "99.99"   示例返回值

 使用示例
monitor = PriceMonitor("your_ipipgo_api_key")
result = monitor.monitor_price("https://example.com/product/123")
print(json.dumps(result, indent=2))

Questions fréquemment posées

Q: 免费代理和付费代理有什么区别？
A: 免费代理通常稳定性差、速度慢，而且安全性无法保证。付费代理如ipipgo提供高质量的服务，有更好的稳定性、速度和安全性保障。

Q: 如何判断代理IP是否有效？
A: 可以通过访问httpbin.org/ip这样的服务来测试，返回的IP地址应该显示为代理服务器的IP而不是你的真实IP。

Q: 爬虫使用代理IP会被完全检测不到吗？
A: 没有100%不被检测的方法，但高质量代理IP可以大大降低被检测的概率。配合合理的请求频率和User-Agent轮换，基本可以满足大多数采集需求。

Q: ipipgo的代理IP适合爬取哪些网站？
A: ipipgo的住宅代理IP适合大多数电商平台、社交媒体、搜索引擎等网站。对于特别严格的网站，建议使用他们的静态住宅代理服务。

Q: 如何控制代理IP的使用成本？
A: ipipgo按流量计费的模式很灵活，可以通过设置合理的请求间隔、使用数据压缩、优化爬取逻辑等方式来控制成本。

用于抓取网站的Python程序：从基础到进阶的完整代码示例

为什么抓取网站需要代理IP？

基础爬虫代码示例

为爬虫添加代理IP支持

进阶：智能代理IP管理策略

选择优质代理IP服务的要点

Recommandé : ipipgo proxy service IP

完整实战案例：电商价格监控

Questions fréquemment posées

scénario d'entreprise

Fournisseur professionnel de services d'IP proxy étrangers-IPIPGO

Laisser un commentaire Annuler la réponse

Nous contacter

Suivez-nous sur WeChat

为什么抓取网站需要代理IP？

基础爬虫代码示例

为爬虫添加代理IP支持

进阶：智能代理IP管理策略

选择优质代理IP服务的要点

Recommandé : ipipgo proxy service IP

完整实战案例：电商价格监控

Questions fréquemment posées

scénario d'entreprise

Fournisseur professionnel de services d'IP proxy étrangers-IPIPGO

Articles connexes

ASN库有什么用：教你通过ASN号判断是否为真实宽带ISP

黑名单IP（Blacklist）怎么去查：不要让脏IP毁了你的项目

WebRTC泄露了真实IP：指纹浏览器防止IP穿透的高级设置

DNS泄露如何检测？配置好代理IP后必做的3次安全检查

欺诈分数过高（Fraud Score）怎么办：降低IP风险值的秘诀

怎么查我的IP归属地是不是原生：精准IP溯源查询方法总结

Laisser un commentaire Annuler la réponse

Nous contacter

Suivez-nous sur WeChat