
Python crawlers messing with data, these potholes should not be stepped on!
Recently, a lot of friends who do data crawling planted, either by the site blocked IP or receive a lawyer's letter. There is an e-commerce price comparison brother, with their own broadband to climb for three days, the results of the entire community network was blacked out, the neighbors are looking for him to settle scores. This thing tells us, engage in crawlers only know how to write code can not be, have to know some "jianghu rules".
Why does your crawler always get caught?
A lot of newbies think that a random UA (user agent) will be able to muddle through, in fact, the site wind control is now very fine. Just like the supermarket security door, you change a vest people can still recognize you. Here is aDeath TrioFixed IP, high-frequency access, regular requests, all three of them, the seal is a matter of minutes.
| the act of suicide | probability of banning |
|---|---|
| Single IP Hard Kong | 99% |
| No visit interval | 80% |
| Crawling sensitive data | Direct solicitor's letter |
The right way to open a proxy IP
Here we recommend the use of ipipgo home dynamic residential agent, their IP pool is particularly large, each request automatically change IP, just like playing chicken game airdrop supplies, each landing is a new identity. Specific configuration code is long like this (remember to change the API_KEY to your own):
import requests
from itertools import cycle
proxy_pool = ipipgo.get_proxy_pool() get the latest IP pool automatically
proxy_cycler = cycle(proxy_pool)
for page in range(1, 100): proxy = next(proxy_cycler)
proxy = next(proxy_cycler)
try: resp = requests.get(url)
resp = requests.get(url, proxies={"http": proxy, "https": proxy})
Processing data...
except.
ipipgo.report_bad_ip(proxy) report invalid ip
If you don't pay attention to these details, it's useless to be an agent
1. Don't be an iron chicken.: Some friends use an IP over and over again to save money. It is recommended to change IP every 5-10 requests. ipipgo's traffic billing model is especially suitable for this scenario.
2. Request headers should be realistic: don't use the default headers from the requests library, you can copy the whole set of headers from a real browser, the ones with cookies and referers.
3. There is a silver lining in every aspect of what one does.: In robots.txt in the directory explicitly prohibited do not touch, crawl interval is recommended to set more than 3 seconds!
QA time: what you might want to ask
Q: Is it absolutely safe to use a proxy IP?
A: Just like wearing gloves to commit crimes, it can reduce the risk but is not a free pass. The key depends on the use of the data, if it involves user privacy or trade secrets, even the gods can't save it.
Q: What if ipipgo's IP is blocked?
A: They have a smart fusion mechanism that automatically shields failed nodes. If it is a high concurrency demand, it is recommended to open a dedicated IP package, stability enhancement of more than 70%
Q: How can I tell if a website has blocked my crawler?
A: The appearance of 403 error code, request for verification code, and return of false data are all danger signals. At this time you should immediately pause, check the request header settings, or contact ipipgo customer service to change the IP segment
Say something from the heart.
I've seen too many programmers get into lawsuits because of crawlers, in fact, most sites are not against reasonable data collection, the key is to comply with the rules of the game. Like fishing, with the right fishing rod (proxy IP), in the allowed waters (public data), fishing compliance fish species (non-sensitive information), so that the water can flow. ipipgo recently came out of a novice protection package, with automatic compliance detection, it is recommended that friends who just started to play to try, at least to step on the pit of the 80% less.

