
What to do when a crawler encounters an anti-crawler? Try this.
Old iron people do data collection, nine times out of ten have encountered 403 Forbidden, right? Nowadays, websites are so smart that they will block your IP if they don't like you.Proxy IP + Custom HeaderIt is the golden partner. For example, with ipipgo's proxy service, each request for a new "armor", the site can not distinguish between a person and a machine.
Hands-on with curl pass Header
Let's get real first, straight to the code:
curl -x http://user:pass@proxy.ipipgo.cn:8080
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0)"
-H "X-Requested-With: XMLHttpRequest"
https://target-site.com/api/data
here areThe -x parameter specifies the proxy serverThe proxy address of ipipgo should be filled in with your account password. More than one Header to write a few more -H, with the sugar gourd like a string.
Header camouflage four kingpins
These are the parameters that work best:
- User-Agent (device fingerprint)
- Accept-Language (language preference)
- Referer
- Cookies (login credentials)
It is recommended to get a configuration file to store common combinations, for example:
{
"mobile": {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_7 like Mac OS X)",
"Accept": "application/json"
},
"pc": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)", "Accept-Language": {
"Accept-Language": "zh-CN,zh;q=0.9"
}
}
Dynamic Header Rotation Black
You'll still get caught with a fixed header, and it's time to get on ipipgo'sDynamic IP PoolThe effect can be compared to the Monkey King's seventy-two changes. With the script randomly switch Header, the effect is comparable to the Monkey King's 72 changes:
headers_list = [
{"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"},
{"User-Agent": "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.7.62"}, {"User-Agent": "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.7.62"}, {"User-Agent": "Opera/9.80 (Windows NT 6.1; U; en)
{"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"}
]
proxy = "http://user:pass@proxy.ipipgo.cn:3000"
curl -x $proxy -H "${headers_list[$RANDOM % 3]}" https://xxx.com
A practical guide to avoiding the pit
Some sites will detect the order of the Header, do not think that just write it. It is recommended to use the browser to visit the site normally, grab the packet to see the original request of the Header order, according to the cat drawing the tiger is the most secure.
| wrong posture | correct posture |
|---|---|
| Missing Content-Type | Setting by interface type |
| non-conventional character set | Unified UTF-8 |
| unconventional timestamp | Maintaining time zone consistency |
question-and-answer session
Q: What should I do if I still get banned after adding Header?
A: Try ipipgo'sHigh Stash AgentsIf you do not want to hide the original IP completely, then check if the cookies are expired or too frequent.
Q: How do I break it if I need to deal with cookies?
A: Use curl -c to save the cookie file first, and bring the -b parameter to subsequent requests:
curl -x http://proxy.ipipgo.cn -c cookies.txt -b cookies.txt https://xxx.com/login
Q: Why is the response slower after using a proxy?
A: It may be a node line problem, switching in the ipipgo backgroundBGP hybrid line, choose a server room that is physically close.
The Ultimate Solution
At the end of the day, if you want to stabilize data collection.ipipgo's commercial proxy packagesIt is the king's way. Exclusive IP pool + intelligent route switching + Header automatic camouflage, three-in-one solution. New users receive 200M traffic trial, not good to use direct brick.
Finally, a nagging word: Header camouflage is not a panacea, with a reasonable request interval. Just like eating barbecue with beer, drinking beer alone is not that flavor is not?

