IPIPGO ip proxy How to set up scrapy dynamic proxy ip? Custom download middleware in action

How to set up scrapy dynamic proxy ip? Custom download middleware in action

First, Scrapy proxy middleware in the end what is the use? Crawler brothers should have encountered this situation: just run a few minutes of the program, the target site will give you IP blocked to death. At this time, dynamic proxy IP is a lifesaver, like playing a game to open unlimited resurrection hang, was sealed automatically change the new vest to continue to do. ...

How to set up scrapy dynamic proxy ip? Custom download middleware in action

一、Scrapy代理中间件到底有啥用?

搞爬虫的兄弟应该都遇到过这种情况:刚跑几分钟程序,目标网站就给你IP封得死死的。这时候动态代理IP就是救命稻草,好比打游戏开无限复活挂,被封了自动换新马甲继续干。

Scrapy自带的代理设置太基础,应付不了复杂场景。咱们得自己写个Download Middleware,这玩意儿就像快递公司的调度中心,能拦截每个请求,偷偷把快递员(请求)的工服(IP地址)给换了。

二、手把手造轮子:动态代理中间件

先准备个靠谱的代理池,这里用ipipgo的动态住宅代理举例。他们家的API返回格式长这样:

{
  "proxy": "123.45.67.89:8888",
  "expire_time": 1800
}

new constructionmiddlewares.py文件,核心代码也就20行:

import random
import requests
from scrapy import signals

class DynamicProxyMiddleware:
    def __init__(self, api_url):
        self.api = api_url + "?apikey=YOUR_API_KEY"
        
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            api_url=crawler.settings.get('IPIPGO_API')
        )

    def process_request(self, request, spider):
         捞个新鲜IP
        resp = requests.get(self.api)
        proxy = f"http://{resp.json()['proxy']}"
         重要!设置认证信息
        request.meta['proxy'] = proxy
        request.headers['Proxy-Authorization'] = 'Basic ' + base64.b64encode(b'username:password').decode()

三、配置文件的秘密参数

existsettings.py里激活咱们的中间件:

DOWNLOADER_MIDDLEWARES = {
    'your_project.middlewares.DynamicProxyMiddleware': 543,
}
IPIPGO_API = "https://api.ipipgo.com/getProxy"   官方接口地址

Watch out for these two potholes:

1. 优先级数字别乱填,要比默认的HttpProxyMiddleware(750)小
2. Accreditation information记得换成ipipgo后台给的账号密码

IV. Practical guide to avoiding pitfalls

最近帮客户部署时遇到的真实问题:

symptomatic method settle an issue
连续返回407错误 检查Basic认证编码是否正确
IP survival time is too short 在API参数加&duration=600延长有效期
目标网站检测到WebDriver 开启ipipgo的header伪装功能

V. Quick questions and answers to frequently asked questions

Q: What should I do if my proxy IP fails frequently?
A: ipipgo's dynamic residential package comes with aFailure auto switch机制,建议把API调用频率设为每秒3-5次,他们家的IP池够大,经得起折腾。

Q: Sudden slowdown of the crawler?
A:检查是否开了CONCURRENT_REQUESTS并发控制,建议配合ipipgo的Regional optimization功能,把代理节点选在目标服务器所在地。

Q:需要处理CAPTCHA验证码?
A: Open in the ipipgo console智能验证码绕过服务,这个要企业版套餐才支持,普通用户建议降低请求频率。

最后说个冷知识:用动态代理别开COOKIES_ENABLED,不然网站会发现不同IP用同一套cookie,立马露馅!如果业务必须带cookie,记得搭配ipipgo的session hold功能,他们叫Sticky Session,能保证特定时间段内IP不变。

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/47701.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

New 10W+ U.S. Dynamic IPs Year-End Sale

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish