
First, why is your request always blocked? May be missing this layer of "protective shell"
We do data collection brothers must have encountered this kind of thing - obviously write the right code, but the target site is not to give data. At this time you have to think about it, your request is not too "naked"? Like wearing pants in the summer into a high-end restaurant, the doorman does not stop you to stop who?
And here's where we get to therequest header masquerading asThis life-saving skill. Many sites will detect the request header User-Agent, Accept-Language parameters, to determine whether you are a robot. Plus the proxy IP is like a cloak for the request, together with a well-designed request header, the success rate can be doubled.
Example of basic camouflage (too easy to spot)
curl -H "User-Agent: Mozilla/5.0" http://example.com
Advanced masquerading should be played like this (with proxy IPs)
curl -x http://user:pass@gateway.ipipgo.com:9020
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
-H "Accept-Language: zh-CN,zh;q=0.9,en;q=0.8"
-H "Sec-Fetch-Site: same-site"
http://target-site.com
Second, the request header configuration of the four diamonds (with the actual code)
Don't think that just add a few parameters can be fooled, there are a lot of things to say here. Remember these four mandatory parameters, to ensure that your request is as stable as an old dog:
1. User-Agent: Browser ID (the latest version is recommended)
2. Accept-Encoding: compression (do not use the site does not support)
3. Referer: the previous page address (forged access path)
4. Cookie: login credentials (dynamic update is useful)
In practice, it is recommended to use ipipgo's Dynamic Residential Proxy, their IP pool is updated daily by 5 million +, with this script to ensure that every request is like a real person:
PROXY="http://user:pass@rotating.ipipgo.com:9021"
UA=$(shuf -n 1 user-agents.txt) UA library prepared in advance
curl -x $PROXY
-H "User-Agent: $UA"
-H "Accept: text/html,application/xhtml+xml"
-H "Connection: keep-alive"
-H "Upgrade-Insecure-Requests: 1"
http://target.com
III. Special techniques for counter-surveillance (unknown to 90%)
Some sites will detectrequest header orderThis is the kind of tawdry operation where it's time to play around with the -proxy-header parameter:
curl -x http://user:pass@gateway.ipipgo.com:9020
--proxy-header "Proxy-Authorization: Basic base64 string"
-H "Accept-Language: zh-CN"
-H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15"
-H "X-Requested-With: XMLHttpRequest"
http://api.target.com/data
Here's the kicker: ipipgo's proxy supportDynamic Certificate ValidationThe first thing you need to do is to set up the proxy to be able to bypass TLS fingerprinting automatically. This is a lot of free agents can not do, the specific configuration to find their technology to secret key on it.
iv. guide to demining common problems
Q: Add the request header and still get banned?
A: 80% is the proxy IP quality is not good, with ipipgo's exclusive proxy package, each IP with real browser environment simulation
Q: What if I need to process a CAPTCHA?
A: Add "X-Captcha-Key: ipipgo_auto" in the request header (this is their built-in auto-coding feature)
Q: How do you maintain conversational coherence?
A: Use ipipgo'sAgents with long statute of limitations+ Cookie persistent storage, same IP to maintain more than 30 minutes of sessions
V. Ultimate Configuration Program (Recommended Collection)
This configuration template I have used for three years did not turn over, with ipipgo's enterprise-class agent, the daily mining of millions of data stable:
! /bin/bash
IPPOOL=("gateway.ipipgo.com:9020" "gateway.ipipgo.com:9021" "gateway.ipipgo.com:9022"))
UA_ARRAY=($(curl -s https://cdn.ipipgo.com/ua_pool))
for i in {1..1000}; do
RANDOM_IP=${IPPOOL[$RANDOM % ${IPPOOL[@]}]}
RANDOM_UA=${UA_ARRAY[$RANDOM % ${UA_ARRAY[@]}]}
curl -x "http://user:pass@${RANDOM_IP}"
-H "User-Agent: ${RANDOM_UA}"
-H "Accept-Encoding: gzip, deflate, br"
-H "Sec-Fetch-Dest: document"
-H "Pragma: no-cache" -H "Cache-Control"
-H "Cache-Control: no-cache" -H "Cache-Control: no-cache"
--compressed
"http://target.com/page=$i" -o "data_$i.html"
sleep $((RANDOM%5+2)) Random delays are important!
done
Lastly, I'd like to say a few words: free agents are all pits, and you should look for an old service provider like ipipgo for professional work. They recently engaged in activities, new users to send 10G traffic, enough to test. Code in hand, the world you have, get up brothers!

