
Hands-on teaching you to use PHP to engage in web page collection
The most fearful thing about data collection is that the IP will be blocked! Today, let's talk about how to use PHP CURL with proxy IP to keep the peace. First of all, a real thing, I have a buddy to do price comparison site, did not add a proxy directly hard, the results of the next day, the server IP was the target station blacklisted, and now change to use ipipgo's proxy pool never turn over the car.
Base model collection template (with proxy)
function crawlWithProxy($url) {
$ch = curl_init();
// Here's the kicker! Here's how to set up the proxy server
curl_setopt($ch, CURLOPT_PROXY, 'proxy.ipipgo.com:9021');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'user name:password');
curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_URL, $url).
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // skip https authentication
$output = curl_exec($ch);
if(curl_errno($ch)){
throw new Exception('Crawling error: '.curl_error($ch));
}
curl_close($ch); }
return $output; }
}
// Example usage
try {
$html = crawlWithProxy('http://目标网站.com'); echo $html; }
echo $html; } catch(Exception $e) { $html
} catch(Exception $e) {
echo $e->getMessage(); } catch(Exception $e) { echo $e->getMessage(); }
}
watch carefullyAgent Settings sectionThe proxy addresses provided by ipipgo are used here. They are generally in the formatDomain:PortIf you want to use a proxy, you have to remember to change the account password to the one you registered with. The advantage of using his proxy is that each request automatically change IP, the target site simply can not feel your set.
Advanced Configuration Tips
Want to make acquisition more stable? These parameters have to be tuned:
// Set the timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
// Disguise browser headers
$headers = [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
'Accept-Language: zh-CN,zh;q=0.9'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
// Automatically handle redirects
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
Special note: with ipipgo'sLong-lasting static proxiesRemember to set the whitelist in the background. If you use a dynamic proxy pool, their API interface can directly get the latest proxy list, which will be discussed later.
Real-world common pitfalls QA
Q: What should I do if the proxy always times out the connection?
A: First check the proxy address and port is not right, and then try to adjust the CURLOPT_CONNECTTIMEOUT parameter. If you use ipipgo encountered this situation, their customer service response speed thief, background submit a work order 5 minutes must return.
Q: What should I pay attention to when collecting https sites?
A: Set CURLOPT_SSL_VERIFYPEER and CURLOPT_SSL_VERIFYHOST to false, which is not very safe but can solve the problem. Or go to the official website of ipipgo to download the CA certificate, and specify the path of the certificate is more secure.
Q: How to switch proxy IP automatically?
A: ipipgo's dynamic proxy service comes with this function, in the code to replace their API interface on the line. For example:
$proxy = file_get_contents('https://api.ipipgo.com/dynamic?token=你的令牌');
curl_setopt($ch, CURLOPT_PROXY, $proxy);
Tips for using ipipgo
Their agents come in three packages, chosen according to needs:
| Package Type | Applicable Scenarios | Recommended Configurations |
|---|---|---|
| dynamic rotation | high frequency acquisition | Automatic IP change per request |
| static and long-lasting | Fixed IP required | 24-hour validity period |
| Customized Exclusive | Enterprise Requirements | Exclusive IP Pool + Customized Strategy |
Remember to register as a new user2G Free Traffic PackIt's enough for testing. There is a hidden benefit: in the code with their alternate domain name proxy2.ipipgo.net, sometimes the main domain name is blocked by some sites can use this.
最后说个骚操作:把采集脚本放crontab定时跑的时候,记得在代码里加个随机sleep(mt_rand(1,5)),这样既模拟真人操作,又能避免触发目标网站的风控机制。配合ipipgo的代理,基本上可以做到无感采集,亲测有效!

