
Hands on with putting an agent vest in Scrapy
Crawler brothers understand, not with a proxy is like a naked Internet, minutes by the site blocked IP. today we take Scrapy to open the knife, say how to give it to wear a good proxy vest. Here we use our own proxy service ipipgo example, pro-test effective not to pull false.
Scrapy's Three Axes of Proxy Configuration
Let's start with the most straightforward configuration method for the newbie:
Add the material in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 543,
}
Plug the proxy into the specific request
yield scrapy.Request(
url, meta={'proxy': 'proxy')
meta={'proxy': 'http://username:password@proxy.ipipgo.com:8000'}
)
this kind ofHard coding methodIt is suitable for temporary testing, if you use it for a long time, you have to change a smart method. In practice, I found that writing a dead proxy directly in settings is easy to be targeted by anti-crawling mechanisms.
Dynamic Proxy Pools are King
Advanced players use rotating proxies, here we recommend using ipipgo's API to get it dynamically:
import random
class ProxyMiddleware.
def process_request(self, request, spider): proxy_list = get_ipipgo_proxies() call ipipgo API interface.
proxy_list = get_ipipgo_proxies() call ipipgo API interface
proxy = random.choice(proxy_list)
request.meta['proxy'] = f "http://{proxy['ip']}:{proxy['port']}"
request.headers['Proxy-Authorization'] = basic_auth_header(
proxy['user'], proxy['password']
)
Take care of it.Proxy Failure Auto SwitchingI suggest to add a retry mechanism in the exception handling. ipipgo's API is very responsive, it takes milliseconds to get a new proxy.
The Doorway in the Configuration File
Older drivers are doing their work in settings.py and recommending configuration packages:
| configuration item | recommended value |
|---|---|
| CONCURRENT_REQUESTS | Adjusted for proxy packages (30-50 recommended for dynamic proxies) |
| DOWNLOAD_TIMEOUT | Setting 15-30 seconds is safer |
| RETRY_TIMES | Suggest 3 retries to stay safe |
Record of actual pitfalls
I encountered the most pitiful situation: the agent obviously works, but the crawler just can not connect. Later, I realized that it wasSSL authenticationDamn, adding this parameter to the request solves it immediately:
request.meta['download_timeout'] = 30
request.meta['proxy'] = 'https://...' Note the protocol type
request.meta['dont_redirect'] = True Prevents redirects from dropping proxy
Frequently Asked Questions First Aid Kit
Q: What should I do if the proxy suddenly fails?
A: Add exception capture in the middleware to automatically pull new agents from ipipgo. It is recommended to turn on the proxy health check and kick out the pool in time if it is ruined.
Q: Crawl like a turtle?
A: Check the type of proxy package. Dynamic Residence (Enterprise Edition) is 30% faster than Standard Edition, if you have enough budget to go directly to Static Residence, the speed will fly.
Q: Always encounter CAPTCHA?
A: Change to use ipipgo's TK special line proxy, this residential IP is less likely to trigger the verification. The real test after using this dedicated line verification code appearance rate dropped 70%.
How to choose a ipipgo package
Personal recommendation package comparison:
- Small-scale crawlers: dynamic residential (standard) 7.67 yuan / GB, save enough to make use of the
- Enterprise-level projects: directly on the static residential 35 yuan / IP, stable and not tossed!
- Special needs: cross-border line against geographical restrictions on the website, who uses who knows
Finally said a hollow: proxy configuration is not a one-off thing, according to the target site's anti-climbing strategy flexible adjustment. Brothers with ipipgo remember to live with their customized services, technical customer service can help to adjust the reference, than their own blind toss much stronger.

