
First, why do crawlers have to use proxy middleware?
The brothers who do data crawling know that the anti-climbing mechanism of the target website is getting more and more ruthless. Last week, a customer doing e-commerce price comparison, with ordinary crawlers continuously blocked more than 20 IP, anxious to jump straight to the feet. This time we have to rely on proxy middleware toAutomatic IP switching, it's like equipping a crawler with chameleon skills to make the site think it's a different user each time it visits.
Here we should focus on ipipgo's dynamic residential agent, who has more than 90 million real home IPs covering more than 220 countries. As a chestnut, you want to catch the price data of a multinational e-commerce company, with their agent can automatically change the city IP every 5 minutes, completely simulating the geographic distribution of real users.
Second, hands-on integration ipipgo agent
Add a new class in Scrapy's middlewares.py, the core of which is three things: getting proxies, handling exceptions, and automatic switching. It's super easy to use ipipgo's API to fetch proxies, and remember to configure the authentication information in settings.py:
settings.py
IPIPGO_API_KEY = 'Your own key'
IPIPGO_ROTATE_INTERVAL = 5 minutes
The middleware key code looks like this:
class IpProxyMiddleware.
def __init__(self, api_url).
self.proxy_pool = []
Pull the latest proxy pool from ipipgo
response = requests.get(api_url, auth=(settings.IPIPGO_API_KEY, ''))
self.proxy_pool = json.loads(response.text)['proxies']
def process_request(self, request, spider).
current_proxy = random.choice(self.proxy_pool)
request.meta['proxy'] = f "http://{current_proxy['ip']}:{current_proxy['port']}"
Automatically add authentication headers
request.headers['Proxy-Authorization'] = basic_auth_header(
current_proxy['username'], current_proxy['password']
)
Third, the tart operation of automatic IP rotation
It's not enough to be able to change IPs, you have to be strategic. It is recommended to useIntelligent switching algorithm::
| take | Response program |
|---|---|
| 3 consecutive failed requests | Switch Country Node Now |
| Response time > 5 seconds | Reduce the IP weight of the region |
| Encountering CAPTCHA | Switch browser fingerprints + change IP |
Here's a shout out to ipipgo's Enterprise Edition Dynamic Proxy for supporting thesession holdFunction. For example, if you want to stay logged in to crawl data, you can set the same IP to maintain for 30 minutes, and then automatically change to a new IP when you are done.
IV. Error Handling Life Saving Guide
It's inevitable that agents will roll over if you use them too much, and these are a few exceptions that must be dealt with:
def process_exception(self, request, exception, spider): if isinstance(exception, TimeoutError).
if isinstance(exception, TimeoutError).
self.stats.inc_value('proxy/timeout')
return self._retry(request)
elif isinstance(exception, ConnectionError): self.stats.inc_value('proxy/timeout') return self._retry(request)
self.stats.inc_value('proxy/dead')
return self._replace_proxy(request)
Here's the kicker.403 blockingof handling sets:
- Stop using the current IP immediately
- Toggle User-Agent and Request Header
- Reduce crawl frequency
- Switch to ipipgo's static residential IP (his static proxy has a survival rate of 99.91 TP3T)
V. Careful performance optimization
Proxy use bad instead of slowing down, the actual test of these three tips can speed up 40%:
- Preloaded IP pool: 200 available proxies are cached before the crawler is launched
- Asynchronous detection: checking proxy connectivity in a separate thread
- Geographic preference: filtering nodes with latency <100ms using ipipgo's API
VI. Frequently Asked Questions QA
Q: What should I do if the proxy IP is invalid after using it?
A: It is recommended to enable ipipgo's auto-refresh feature, their API supports setting the failure auto-replace thresholds
Q: How do I mess up if I need to use IPs from different countries at the same time?
A: Add locale filtering logic to the middleware, for example:
if request.meta.get('need_usa_ip'):
proxies = [p for p in self.proxy_pool if p['country'] == 'US']
Q: What could be the reason for the sudden slowdown of the crawler?
A: First check the quality of the proxy, we recommend using ipipgo's static residential proxy. If it does not work, adjust the CONCURRENT_REQUESTS parameter appropriately!
Seven, choose the right package to save big money
There's something to be said for ipipgo's package choices:
- Dynamic residential (standard): Ideal for a fledgling business, with no pain in the ass per-traffic billing
- Dynamic Residential (Business): With intelligent route optimization, a must-have for over 10,000 requests per day
- Static homes: the first choice for long-term monitoring business, IP can be used stably for 30 days
Lastly, I would like to remind you that when you encounter CAPTCHA bombing, don't be hardcore. On ipipgo's TikTok solution, their intelligent route optimization can reduce the CAPTCHA trigger rate of 70%, personally tested effective!

