
Hands-on teaching you to use PySpider to hang proxies
What is the biggest fear of crawlers? Being blocked IP is definitely in the top three! Today, let's talk about how to put a vest on a crawler in PySpider - using a proxy IP to keep it safe. Don't be intimidated by those complicated tutorials, in fact, the configuration of the proxy is simpler than cooking noodles.
Why do I have to use a proxy IP?
To give a chestnut: you go to the supermarket every day to grab the special eggs, three consecutive days to wear the same red dress, the fourth day of the security guards directly to stop you outside the child. Proxy IP is a closet of clothes for the crawler, every time you go out and change to wear. With ipipgo's proxy, it's equivalent to renting a clothing store directly, with "clothes" from 200+ countries around the world to choose from.
Proxy Configuration in Three Steps
先导入必备工具包
from pyspider.libs.base_handler import
class MyCrawler(BaseHandler):
crawl_config = {
'proxy': 'http://username:password@proxy_ip:port', 这里填ipipgo提供的代理地址
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
}
@every(minutes=2460)
def on_start(self):
self.crawl('http://目标网站.com', callback=self.index_page)
Highlights:When the proxy address is obtained in the ipipgo backend, remember to select the HTTP/HTTPS protocol format. Dynamic residential IP is recommended to useDynamic Residential (Standard) PackageThe price of $7.67/GB is extra friendly for newbies.
Dynamic IP automatic switching trick
I want to realize the automatic change of IP per request, and use the API interface of ipipgo to catch and use it now:
import random
def get_proxy():
这里调用ipipgo的API接口
proxy_list = ["ip1:port","ip2:port","ip3:port"]
return random.choice(proxy_list)
class Handler(BaseHandler):
def make_request(self, url, callback):
return Request(url,
callback=callback,
proxy=get_proxy()) 每次请求自动换装
Guide to avoiding the pit (QA session)
Q: What should I do if the proxy suddenly fails?
A: ipipgo client comes with heartbeat detection, found that the IP hangs will automatically cut the new IP, with the cell phone automatically connect to WiFi a reason.
Q: How do I test if the proxy is working?
A: Add a test step to the crawler:
self.crawl('http://httpbin.org/ip', callback=self.check_ip)
def check_ip(self, response):
print(response.text) 这里显示的IP应该变成代理IP
How to choose a package without stepping on the mine
| Business Type | Recommended Packages | Applicable Scenarios |
|---|---|---|
| High Frequency Data Acquisition | Static homes | 35/IP for a whole month, suitable for long term surveillance. |
| Enterprise Crawler | Dynamic Residential (Business) | 9.47/GB with VIP channel, grab data faster! |
| Individual small projects | Dynamic residential (standard) | 7.67 Cabbage Prices, First Choice for Practice |
最后叨叨句:别在免费代理上浪费时间,之前我测试过,10个免费代理有8个是坏的。ipipgo的TK专线实测不到200ms,跟本地网络差不多快。他们客服还能给定制方案,上次有个兄弟要爬东南亚电商数据,直接给配了跨境专线。

