
Why is PHP crawler always blocked? Try this trick
Engaged in web crawling brothers know, with PHP to write a crawler is the biggest headache IP blocked. Last month there is an e-commerce price comparison brother to find me, said his script runs less than half an hour on the shutdown, changed three servers do not work. This thing ah, to put it bluntly is not good proxy IP this magic weapon.
// Typical blocked crawler code
$html = file_get_contents('https://目标网站.com');
The above direct connection is like taking a loud speaker and shouting "I am a reptile", if you do not block you block who? We have to learn to use proxy IP to cover.
Teach you to write a crawler with a proxy IP by hand.
First of all, let me tell you a true story: after I helped that e-commerce guy to switch to the proxy IP program, it ran for three days without any problem. Here use ipipgo proxy service as a chestnut, their interface is very simple:
$proxy = 'http://username:password@gateway.ipipgo.com:9020';
$context = stream_context_create([
'http' => [
'proxy' => $proxy, 'request_fulluri' => true
'request_fulluri' => true
]
]);
$html = file_get_contents('destination url', false, $context);
Be careful not to step in these potholes:
- ① Remember to change your account password to the one you got from ipipgo.
- ② different proxy types (HTTP/HTTPS/SOCKS5) to choose the right port
- ③ The timeout setting should preferably not exceed 10 seconds.
Practical skills: let the crawler live long three axes
| gambit | What to do. | Recommended settings |
|---|---|---|
| IP Rotation | Different proxies for each request | Dynamic packages from ipipgo |
| request interval | Random hibernation 1-5 seconds | sleep(rand(1,5)) |
| Header disguise | Analog Browser Information | Setting the User-Agent |
Give a complete example with automatic IP changing:
function getProxyList() {
// Here we call the ipipgo API to get the latest proxy list.
return json_decode(file_get_contents('https://api.ipipgo.com/proxy_pool'));
}
$retry = 3;
while($retry--) {
$proxies = getProxyList();
foreach($proxies as $proxy) {
try {
// Set up the proxy and send the request
$html = doRequest($targetUrl, $proxy);
// Process the data...
break; }
} catch(Exception $e) {
// Log the failure
continue; } catch(Exception $e) { // Log the failure.
}
}
}
Frequently Asked Questions QA
Q: What should I do if my proxy IP is not working?
A: choose ipipgo this can automatically replace the IP pool of service providers, their family every minute to update 2000 + new IP, simply can not be used up!
Q: What should I pay attention to in HTTPS web crawling?
A: Remember to add these two sentences to the code:
stream_context_set_default([ 'ssl' => ['verify_peer' => false] ]).
However, the formal practice should be configured with CA certificates, specifically you can find ipipgo technical support to ask for a solution!
Q: How can I tell if an agent is really effective?
A: Write a heartbeat detection script and periodically visit thehttps://api.ipipgo.com/check_ipFor this interface, a status code of 200 is returned indicating that the IP is available
Lastly, I'd like to say a few words from the bottom of my heart: this crawler thing is to engage in a long-lasting battle with the website. With the right proxy IP is like wearing a bulletproof vest, saving not a half a star. Especially do large-scale data collection, directly on the ipipgo enterprise edition package, there are special people to help you debug configuration, than their own toss much stronger.

