
Why is PHP crawler always blocked? Try this trick
Recently, many brothers asked, written in PHP crawler is always the target site blocked IP, angry want to smash the keyboard. This matter is frankly too obvious that your network fingerprints, today teach you a trick - with a proxy IP to play cover. Like playing hide-and-seek constantly changing vests, so that the site can not catch your real body.
There's a lot to be said for picking a guy. Don't mess with the tools.
The newbie favorite is file_get_contents, but that's no different than running around naked:
$html = file_get_contents("http://目标网站");
Veterans are using CURL suits as if they were wearing body armor:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://目标网站");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
Proxy IPs are what keep you alive.
Add these lines to the curl configuration and it instantly changes:
curl_setopt($ch, CURLOPT_PROXY, 'Proxy IP:Port');
// If using dynamic tunneling with ipipgo
curl_setopt($ch, CURLOPT_PROXY, 'http://用户名:密码@gateway.ipipgo.com:端口');
take note ofChanging IPs for every request, ipipgo's API gets the latest IP in real time, like this:
$ip_list = json_decode(file_get_contents('https://api.ipipgo.com/get?num=5'));
$random_ip = $ip_list[rand(0,4)];
Practical case: grab a limited number of goods
Last year, I helped my friend to write a script to grab shoes, and I was cool in 5 minutes without using a proxy. Later, I used ipipgo's exclusive IP pool, and the secret of success is here:
function stealth_request($url){
$ch = curl_init();
// Get the day's valid IPs from ipipgo
$proxy = get_ipipgo_proxy();
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_TIMEOUT, 10); // set short for timeout
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
]);
return curl_exec($ch);
}
Guide to avoiding pitfalls (collect for backup)
| symptomatic | antidote |
|---|---|
| Suddenly return to blank | Immediate switching of ipipgo's next IP node |
| CAPTCHA appears | Reduce Request Frequency + Change User-Agent |
| Connection timeout | Check if the proxy port is filled in incorrectly |
A must-see for beginners QA
Q: Can't I use the free agent?
A: The market free agent 10 have 9 is the pit, either slow or early failure. ipipgo commercial level agent with dedicated maintenance, measured success rate of 98% or more.
Q: How do I know the agent is in effect?
A: Put a check in the code:
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
if(curl_exec($ch) === false) {
echo "Proxy $proxy is hanging, move to the next one!" ;
}
Q: How to solve the problem when encountering the website backcrawl?
A: Three tricks: ① use ipipgo's residential proxy ② randomly hibernate for 0.5-3 seconds ③ mix mobile/PC UA header
Upgrade Play: Distributed Crawler
For large projects remember to use multithreading + agent pools and configure it that way:
// Get 200 IPs from ipipgo for Redis.
$ip_pool = get_ipipgo_batch(200);
// Fetch different IPs for each thread
$worker->setProxy(array_pop($ip_pool));
Note that IP availability should be monitored and IP replacement is automatically triggered when it falls below 90%.
Finally, to be honest, the proxy IP thing a penny a penny. Since the use of ipipgo, no longer need to get up in the middle of the night to change the IP, the system automatically maintains the pool, saving time enough to sleep a peaceful sleep. Some brothers said expensive, but compared to the losses caused by the blocked number, this investment is really nothing.

