
PHP grab data always be blocked? Try this trick
Recently, many brothers asked me to use PHP curl to capture data is always the target site blocked IP, anxious to jump straight to the feet. This is something I also encountered three years ago, and later found that the use of proxy IP is like giving the program to wear a vest, today to the guys to break the doorway.
Figuring out what's going on with proxy IPs
Proxy IP is equivalent to your network request to find a stand-in actor, as if you go to the supermarket to buy cigarettes are always recognized by the boss, change a friend to help you go to buy on it. There are three types of proxies on the market:
Transparent Proxy - the equivalent of taking a friend and announcing yourself (revealing your real IP)
Anonymous Proxy - friend goes alone but wearing your clothes (hides IP but has proxy features)
Hidden Proxy - friend is completely disguised as a passerby (recommended)
Here's the point! When choosing an agent, you have to pickipipgoThis kind of specializes in high stash agents, their home IP pool is large, each request randomly change the vest, the target site simply can not feel the law.
Hands-on teaching you curl setup proxy
Take the collection of the price of an e-commerce platform, for example, do not use the agent's code is long like this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://目标网站.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
on top of thatipipgoPost-agency:
// Proxy information from the ipipgo backend
$proxy = '123.123.123.123:8888';
$auth = 'username:password';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://目标网站.com");
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $auth); curl_setopt($ch, CURLOPT_PROXYUSERPWD, $auth);
curl_setopt($ch, CURLOPT_TIMEOUT, 10); // set a short for timeout
Note that you have to replace username and password withipipgoThe backend gives you authentication information, and their proxy verification method is especially newbie friendly.
Guide to Avoiding the Pit: 5 Common Mistakes Newbies Make
1. Proxy IP repeatedly: the same IP continuous request is easy to be recognized, it is recommended to change the IP for each request.
2. The timeout is set too long: it is recommended to be within 10 seconds, and the next IP address will be changed if it exceeds 10 seconds.
3. Forget the exception handling: curl_exec to check whether $output is empty after
4. UA header not disguised: remember to set common browser UA with curl_setopt
5. Ignore HTTPS certificates: add this line to avoid certificate validation jams
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
Practical QA: You ask, I answer
Q: What can I do about slow proxy IPs?
A: Priority ElectionipipgoThe domestic BGP line, measured latency can be controlled within 200ms
Q: How do I verify if the agent is in effect?
A: Visit http://httpbin.org/ip to see if the IP returned is a proxy IP
Q: What should I do if I encounter a 403 error?
A: three steps: 1. check whether the IP is blocked 2. change User-Agent 3. reduce the collection frequency
Upgrade Play: Automatically Switching IP Pools
expense or outlayipipgoAPI to get IPs dynamically, get an IP pool management script:
// Get the IP pool
$ip_list = json_decode(file_get_contents('https://api.ipipgo.com/getips?num=20'));
// Pick a random IP
$rand_key = array_rand($ip_list);
$current_ip = $ip_list[$rand_key]['ip'].' :'.$ip_list[$rand_key]['port'];
It is recommended to change the IP every 5 times of collection, with multi-threading can improve the efficiency by 10 times. But pay attention to the target site's anti-climbing strategy, don't make people's servers hang.
Finally nagging a word, choose the proxy service don't be greedy for cheap, before using a free proxy, the result of the collection of data are all phishing sites inserted in the ads. Now useipipgoThe exclusive IP package, the stability is really top, do the project heart down to earth.

