
When the crawler meets the anti-climbing - proxy ip rescue posture
Folks with Python to engage in data crawling, certainly bypass Requests and Scrapy these two old guys. These two goods look at the work of crawlers, but the actual use of the difference can be a lot. Today we'll talk about them and proxy ip with the use, especially ouripipgoHow does the proxy service of the home play out on the side of these two libraries.
Warfighter vs. Group Warfare
Requests are like a Swiss Army Knife. If you want to grab a web page temporarily, you can do it in three lines of code. But when it comes to a scenario where you need to change a lot of ips, you have to write your own rotation logic:
import requests
from ipipgo import get_proxy Our own proxy interface.
def grab_data(url): proxy = get_proxy()
proxy = get_proxy() Randomly get high quality proxy
try: resp = requests.get(url, prox)
resp = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=10)
return resp.text
except.
print("This ip may be banned, automatically switch to the next one.")
return grab_data(url) recursive retry
Scrapy is an automation factory, with its own middleware mechanism that makes proxy rotation a huge pain in the ass. Configure it in settings.py.ipipgoAPI, and the entire crawler force is automatically dressed:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100,
}
IPIPGO_API = "https://api.ipipgo.com/rotate" dynamic ip pool interface
def process_request(self, request, spider).
request.meta['proxy'] = self.get_proxy() Automatically hooks proxy for each request
Proxy Consumption Comparison Fact Sheet
| take | Requests Consumption | Scrapy consumption |
|---|---|---|
| Grab 1000 pages | About 30-50 ip | Controllable within 10 |
| Encountering CAPTCHA | Manual replacement required | Automatic fuse switching |
| distributed crawling | hard-synchronous state | Natural support clusters |
Practical Selection Guide
Brothers who are just starting out are advised to use Requests+ first.ipipgoof a static proxy packet that fixes the use of a region's ip like this:
proxies = {
"http": "121.36.84.149:8008", exclusive channel copied from ipipgo backend
"https": "121.36.84.149:8008"
}
When it's time for a big project, remember to cut to Scrapy + dynamic agent pooling. Let'sipipgoThe intelligent scheduling interface can automatically match residential ip or server room ip according to the anti-climbing strength of the target website, which is much more reliable than sticking to a single ip type.
Old Driver QA Time
Q: What should I do if I always get my ip blocked?
A: Check three things: 1. whether the proxy anonymity is high enough (with ipipgo's Extreme Stash package) 2. whether the request header has a browser fingerprint 3. whether the visit frequency is like a real person
Q: How to set the frequency of ip change in Scrapy?
A: Add a counter to the download middleware, for example, change the ip every 5 requests. when using ipipgo's concurrency package, it is recommended to set the frequency of changing 200 times or less in 1 minute.
Q: Is it okay to use a free proxy?
A: Brother you are digging a pit for yourself! Free Agents 90% are honeypots, and if they are light, they will lose data, and if they are heavy, they will be marked by the anti crawl. WeipipgoWhy use an unreliable one when there is a $5 experience package for new subscribers.
Finally said a lesson in tears: last year with Requests to catch an e-commerce site, did not hang the agent hard just, the results of half an hour was blocked the entire server room exit ip. later replaced with Scrapy + ipipgo dynamic residential agent, hanging run three days and three nights did not turn over. So ah, the tool to choose the right agent in place, this is the crawler does not turn over the king of the road!

