
Hands-on teaching you to use PySpider to hang proxies
What is the biggest fear of crawlers? Being blocked IP is definitely in the top three! Today, let's talk about how to put a vest on a crawler in PySpider - using a proxy IP to keep it safe. Don't be intimidated by those complicated tutorials, in fact, the configuration of the proxy is simpler than cooking noodles.
Why do I have to use a proxy IP?
To give a chestnut: you go to the supermarket every day to grab the special eggs, three consecutive days to wear the same red dress, the fourth day of the security guards directly to stop you outside the child. Proxy IP is a closet of clothes for the crawler, every time you go out and change to wear. With ipipgo's proxy, it's equivalent to renting a clothing store directly, with "clothes" from 200+ countries around the world to choose from.
Proxy Configuration in Three Steps
Import the necessary toolkit first
from pyspider.libs.base_handler import
class MyCrawler(BaseHandler).
crawl_config = {
'proxy': 'http://username:password@proxy_ip:port', here is the proxy address provided by ipipgo
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
}
@every(minutes=2460)
def on_start(self).
self.crawl('http://目标网站.com', callback=self.index_page)
Highlights:When the proxy address is obtained in the ipipgo backend, remember to select the HTTP/HTTPS protocol format. Dynamic residential IP is recommended to useDynamic Residential (Standard) PackageThe price of $7.67/GB is extra friendly for newbies.
Dynamic IP automatic switching trick
I want to realize the automatic change of IP per request, and use the API interface of ipipgo to catch and use it now:
import random
def get_proxy().
Here we call ipipgo's API interface
proxy_list = ["ip1:port", "ip2:port", "ip3:port"]
return random.choice(proxy_list)
class AutoProxyHandler(BaseHandler).
class AutoProxyHandler(BaseHandler): def make_request(self, url, callback).
return Request(url, callback=callback, callback).
callback=callback, proxy=get_proxy())
proxy=get_proxy()) Automatically loaded for each request.
Guide to avoiding the pit (QA session)
Q: What should I do if the proxy suddenly fails?
A: ipipgo client comes with heartbeat detection, found that the IP hangs will automatically cut the new IP, with the cell phone automatically connect to WiFi a reason.
Q: How do I test if the proxy is working?
A: Add a test step to the crawler:
self.crawl('http://httpbin.org/ip', callback=self.check_ip)
def check_ip(self, response): print(response.text)
print(response.text) The IP shown here should be a proxy IP.
How to choose a package without stepping on the mine
| Business Type | Recommended Packages | Applicable Scenarios |
|---|---|---|
| High Frequency Data Acquisition | Static homes | 35/IP for a whole month, suitable for long term surveillance. |
| Enterprise Crawler | Dynamic Residential (Business) | 9.47/GB with VIP channel, grab data faster! |
| Individual small projects | Dynamic residential (standard) | 7.67 Cabbage Prices, First Choice for Practice |
Lastly, don't waste your time on free proxies, I've tested them before, and 8 out of 10 free proxies are bad. ipipgo's TK line has a measured latency of less than 200ms, which is almost as fast as the local network. Their customer service can also give customized solutions, last time a brother to climb the Southeast Asian e-commerce data, directly with a cross-border line.

