
When a crawler hits an anti-crawler mechanism
Do data crawl friends should have experienced such a scene: just run through the crawler script, the next day to receive the target site 403 error. The anti-climbing mechanism is like a spring, the more fierce you are, the stronger it is. At this timeDistributed Crawler + Proxy IPThe combinations are like putting a golden bell on a reptile.
Scrapy-Redis's one-of-a-kind approach
Traditional Scrapy is a one-man operation, encountered anti-climbing hard to stop.Scrapy-Redis this thing to the task queue stored in Redis, so that more than one machine can work together. To give a chestnut, like a hot pot restaurant kitchen: cut vegetables master, food master, frying masters have their own duties, but they are staring at the center of the order to see the board work.
| Traditional Scrapy | Scrapy-Redis |
|---|---|
| stand-alone operation | multicomputer collaboration |
| memory queue | Redis Persistence |
| manual continuation of the climb | Breakpoint auto-connect |
The right way to open a proxy IP
Many newbies use proxy IPs as a master key, only to find that they are blocked faster than naked. Here's athe Three Dos and Don'tsCatchphrase:
coerceDynamic IP rotation,coerceHigh Stash Agents,coercePinpointing areas;
refrain fromFixed IP,refrain fromTransparent agents,refrain fromJumping around across the region.
This is a must for my own brother.ipipgo proxy serviceThey are supported by their dynamic IP pools.Switch city lines on demandThe success rate can soar from 401 TP3T to 921 TP3T, for example, when grabbing a real estate website, using Chengdu IP to access Chengdu listings, and Shanghai IP to grab Shanghai data, the website simply can't tell whether it's a real person or a machine.
Handy Configuration Tips
Add these key configuration lines to settings.py (note that you replace your_username with the account you registered with ipipgo):
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100
}
PROXY_LIST = 'https://api.ipipgo.com/proxy?username=your_username&format=txt'
PROXY_MODE = 0 automatic rotation mode
Here's a pitfall to be aware of: ipipgo's API returns theInstant Proxy AddressUnlike some platforms that give fixed IP segments. The good thing is that you do not have to maintain their own IP pool, the bad thing is that each request has to be re-acquired, but their interface response speed is fast enough, measured latency within 200ms.
A practical guide to avoiding the pit
Recently, I encountered a typical problem when helping a client to capture an e-commerce platform: obviously using a proxy IP, it still triggered the CAPTCHA. Later, I realized that it wasCookie not switching with IP. The solution is to add a hook in middleware:
def process_request(self, request, spider).
request.meta['proxy'] = get_new_proxy()
request.headers['Cookie'] = generate_fake_cookie()
return None
Also recommend using ipipgo'sSession-holding agents, especially suitable for scenarios that require login. Their long-lasting proxies can keep the same exit IP for 15 minutes, which is enough to complete the complete process of login-browse-order.
Frequently Asked Questions QA
Q: What should I do if my proxy IP is slow?
A: Priority to choose the same geographical agent (such as catching Guangdong website with Guangdong IP), ipipgo support accurate to the city level of positioning. In addition, check whether the automatic retry, set the timeout time to 8-10 seconds is more reasonable.
Q: How can I tell if a proxy is in effect?
A: Test in scrapy shell:
fetch('http://httpbin.org/ip', meta={'proxy':'ipipgo's proxy address'})
See if the returned IP changes
Q: What should I do if I encounter a website that blocks an entire IP segment?
A: That's why we recommend ipipgo, their IP resources cover the three major carriers + 200+ cities across the country, and they immediately cut city lines when they encounter blocking, which is more flexible than changing IP segments.
One last nagging word, being a crawler is about martial arts. Set reasonable request intervals, with a reliable proxy service like ipipgo, in order to go farther on the road of data acquisition. Don't wait until your account is blocked and your IP is blacked out before you remember to do a good job of wind control.

