PHP crawler encountered IP blocked? Try this trick
Brothers who have engaged in web page collection understand that the biggest headache is that the target site suddenly gives you aIP blocking. Especially with PHP to write a crawler for newbies, often run to run and found that the data can not be captured - this time the proxy IP appearance. To give a real case: last week there is a price comparison site friends, with native PHP to write a collection script, the results just run two days was blocked more than 20 IP, and then added a proxy pool to solve the problem.
Hands-on with PHP crawlers to install proxies
Here is an example of how to do this with the commonly used GuzzleHTTP library:
// Introduce ipipgo's proxy configuration
$proxy = 'http://用户名:密码@gateway.ipipgo.com:端口';
$client = new GuzzleHttpClient([
'proxy' => $proxy, 'timeout' => 30
'timeout' => 30
]);
try {
$response = $client->get('https://目标网站.com'); echo $response->getBody(); echo $response->getBody()
echo $response->getBody(); } catch (Exception $e) { $client->get(''); }
} catch (Exception $e) {
// It is recommended to keep an error log to automatically switch between alternate proxies.
echo "Capture failed:".$e->getMessage();
}
Attention to three points: 1. Proxy address with account password 2. Timeout time do not set too short 3.Exception handling must be doneOtherwise the whole script crashes when the proxy fails.
Proxy IP Selection Guide to Avoid Pitfalls
There are all sorts of agent types on the market, so here's a comparison table for newbies:
typology | tempo | stability | Applicable Scenarios |
---|---|---|---|
Data Center Agents | plain-spoken | center | routine collection |
Residential Agents | center | your (honorific) | high impact crawling website |
Mobile Agent | slowly | lower (one's head) | special needs |
Like ipipgo's.Dynamic Residential AgentsIt would be more suitable for e-commerce data collection, their IP pool is updated daily with more than 20%, which is not easily recognized.
Practical experience in the field
Name a few potholes that are easy to step into:
1. Don't use free proxies! Nine out of ten don't work and are easily flagged by anti-crawler systems.
2. Concurrency control is very important, it is recommended that newcomers start testing from 5 threads
3. Regular replacement of User-Agent, and proxy IP with better results
4. Don't be tough when encountering CAPTCHA, use a coding platform if you need to.
Frequently Asked Questions
Q: What should I do if my proxy IP is slow?
A: Prioritize proxy nodes in the same geographic region. ipipgo supports filtering by city, which is very useful.
Q:How to choose the overseas website I need to collect?
A: directly choose ipipgo's overseas nodes, their Hong Kong, U.S. machine room speed can be measured to within 200ms.
Q: How do I choose a cost-effective agent package?
A: short-term projects choose to pay by volume, long-term use if ipipgo's annual payment package can save 40% or so, but also send request failure retry function.
Why recommend ipipgo
Used more than two years, three most real: 1. After-sales response fast, once three o'clock in the morning to mention the work order actually seconds back 2.API docking simple, document written like a tutorial for dummies 3.hourly rateThe small program is particularly money-saving. Recently they are new on the IPv6 proxy pool, the collection of certain government websites pro-test effective.
Finally, to remind novice friends, proxy IP is not a panacea, with random dormancy, request header camouflage these means in order to maximize the effect. Encounter specific problems can be ipipgo official website to find technical customer service, their technical support in the industry is considered more reliable.