First, why toss distributed crawlers?
Engaged in data collection of the old driver understand, stand-alone crawler is like drinking milk tea with a straw - touch a large amount of data directly to the mouth cramps. Ordinary Scrapy framework can run a few million data volume, if you meet the anti-climbing ruthless website, minutes to send your IP into the small black room. This timeScrapy-Redis + Proxy IPThe combination is like turning on the golden finger, being able to work in a distributed manner and also being able to change vests at any time.
II. Cluster Deployment Hardcore Operations Manual
First, we'll organize three servers (it's okay to use virtual machines if you don't have the money) and install the Redis database. Here's the kicker: Scrapy projects on all machinessettings.pyIt all has to be matched with these lines:
REDIS_URL = 'redis://your server IP:6379' SCHEDULER = 'scrapy_redis.scheduler.Scheduler' DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
Remember to put the crawler file in thestart_urls
Switch to reading from Redis:
def start_requests(self). for url in self.server.lrange('Crawler Task Queue', 0, -1):: yield scrapy.Request(url.decode("utf-8")). yield scrapy.Request(url.decode("utf-8"))
Third, the correct way to open the proxy IP
This is where we have to bring out ouripipgo proxy serviceNow, his API is designed to be extraordinarily hassle-free. Add a middleware to middlewares.py:
import random class ProxyMiddleware. proxy_list = [ 'http://账号:密码@proxy.ipipgo.com:端口'. It is recommended to use dynamically generated API links here ] request.meta['proxy'] = random.choice(proxy_list)
Key reminder: remember to put theConcurrency count lowered, don't drain the proxy IP pool. It is recommended to open 20-30 concurrency per node, depending on the package traffic bought.
error scenario | First aid program |
---|---|
429 status code appears | Immediate proxy IP switching + reduced crawling frequency |
Redis Connection Timeout | Check firewall settings + add retry mechanism |
IV. Practical guide to avoiding pitfalls
1. never write a dead proxy IP in the crawler script, use ipipgo'sDynamic API InterfaceThat's the way to go. His family can change 5,000+ IPs per minute.
2. don't be dead set on CAPTCHA, set up auto retry policy + switch IP package. ipipgo's exclusive IP pool is very useful at this time.
3. Logs remember to do hierarchical processing, the proxy IP-related errors in a separate file, to facilitate subsequent optimization
V. White common QA
Q: Why does my crawler node always grab tasks?
A: Check the configuration of Redis' BRPOP command, it is recommended to use different queues for priority triage
Q: What should I do if I use a proxy IP or get blocked?
A: 80% of the request header is not randomized, install a fake_useragent library, then check the cookie processing
Q: How do I choose the right package for ipipgo?
A: test period with pay per volume, stable run cut monthly package. Concurrency more than 50 choose enterprise-class dynamic pool, there are specialists to do IP maintenance
Final rant: distributed crawlers are not silver bullets with ipipgo'sIntelligent Routing Agentbefore it really takes off. Remember to update the crawler rules regularly, don't let the anti-climbing strategy upgrade to the pit. If you have any deployment problems, you can directly @ their technical customer service, the response speed is N times faster than the free agent...