
Hands-on with pyspider to hang proxies
Brothers engaged in crawling understand that no proxy IP is like running naked on the Internet, minutes by the target site to pull black. Today we do not talk about false, directly on the dry goods to teach you how to configure the proxy in the pyspider, focusing on how to use ipipgo's proxy service to keep the peace.
Why do you want to put a vest on a reptile?
To give a chestnut, you go to the kiosk every day to buy cigarettes, the boss to see the face is familiar with the suspicion that you are a second-hand dealer. Proxy IP is to give the crawler to change the vest, so that the website thinks that each visit is a different person. Especially when you do large-scale data collection, if you don't have a proxy, the IP will be blocked, or the whole project will be paralyzed.
Three steps to pyspider proxy configuration
Adding proxies to pyspider's crawler scripts is actually very simple, the point is to find the right place. Remember the prime location:The fetch_type parameter of the self.drawl() methodThe
import pyspider
from pyspider.libs.base_handler import
class MySpider(BaseHandler).
def on_start(self).
callback=self.index_page,
callback=self.index_page, fetch_type='js', proxies={"http":
proxies={"http": "http://账号:密码@ProxyIP:Port",
"https": "https://账号:密码@proxyIP:port"})
There are two potholes to watch out for here:
- If you use the Socks5 protocol, you have to install the
requests[socks]this package - Remember to use urllib.parse if there are special symbols in the password.
Proxy Pool Tips
Single proxy is easy to be recognized, it is recommended to get a proxy pool rotation. Use ipipgo's API extraction interface to automatically change a batch of IPs every hour:
import requests
def get_proxies(): api_url =
api_url = "https://ipipgo.com/api/get_proxy?type=动态住宅&count=50"
resp = requests.get(api_url).json()
return [f "http://{item['ip']}:{item['port']}" for item in resp['data']]
Load the agent pool when the crawler is initialized
class MySpider(BaseHandler).
def __init__(self).
self.proxy_pool = get_proxies()
self.current_proxy = 0
def get_proxy(self).
proxy = self.proxy_pool[self.current_proxy % len(self.proxy_pool)]
self.current_proxy += 1
return {"http": proxy, "https": proxy.replace('http','https')}
A guide to avoiding the pit (common QA)
| Symptoms of the problem | Great solution! |
|---|---|
| Sudden failure of the proxy | Set up 3 times retry mechanism to switch to the next IP automatically |
| Website loading slows down | 优先选静态住宅IP,能降60% |
| A 407 authentication error occurs | Check account password format, recommended API whitelist authentication |
Why do you recommend ipipgo?
The agency service used in your own home, to mention a few real advantages:
- Dynamic Residential IPSeven dollars and seventy-seven cents.You get 1G of traffic for less than the price of a drink.
- If you are bombarded with CAPTCHAs, switch to their TK line and you'll see immediate results!
- Customer service response speed than the delivery boy faster, last 3:00 am to mention the work order actually seconds back!
Beginners are recommended to use dynamic residential (standard version) to test the water, the business volume directly on the enterprise version. Don't underestimate the 2 dollar difference, enterprise version of the more IP survival protection, the critical moment does not fall off the chain.
Say something from the heart.
Proxy IP this thing is like buying insurance, usually think that it is a waste of money, when the real IP blocked time to cry can not come. I've seen too many people use free proxies for cheap, and as a result, the whole library is polluted halfway through the data collection. Remember, reliable proxy service is the lifeblood of the crawler, save nothing can not save this.

