
A, why to toss the proxy IP, the old blocked how to do ah
Recently, a lot of brothers are asking, with PHP to write a crawler always be the target site to block the IP how to do. This thing is like playing the game was ban number, you always use a number of brush copies, the system does not block you block who? This time you have to use the proxy IP this plug-in - oh no, is the tool.
To give a real case: last week there is a price comparison website buddy, every hour to capture an e-commerce platform 50,000 times data. With their own servers directly grasp, the results of less than two hours IP into the blacklist. Later changedExclusive proxy pool for ipipgoI've been working on this for a while now, and I've set up an automatic IP switch for every 50 requests, and now it's been running stably for a week without a hitch.
Second, how to choose the proxy IP? Not just find a can use
There are three types of proxy IPs on the market, let's make it clear in a table:
| typology | specificities | Scenario |
|---|---|---|
| Transparent Agent | The other side can see the real IP | It's basically useless. |
| Anonymous agent | Hide the real IP but expose the proxy in use | General Data Acquisition |
| High Stash Agents | Completely hide all information | Tough anti-climbing sites |
look as ifHigh Stash Proxy for ipipgoThe actual test in crawling a large social platform, the success rate is 37% higher than the ordinary proxy. the focus is on their family IP pool is updated quickly, a lot of them are undisclosed section of the server room IP, is not easy to be recognized.
Third, the hands-on PHP code combat
Let's demonstrate this with the most commonly used cURL extension. Note the two key parameters: CURLOPT_PROXY and CURLOPT_PROXYUSERPWD.
$ch = curl_init();
$proxy = 'gateway.ipipgo.net:9021'; //proxy server address
$auth = 'username:password'; //authentication information obtained in ipipgo backend
curl_setopt_array($ch, [
CURLOPT_URL => 'https://目标网站.com/api',
CURLOPT_PROXY => $proxy, [
CURLOPT_PROXYUSERPWD => $auth,
CURLOPT_TIMEOUT => 30,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => false //Test environments can turn off certificate validation.
]);
$response = curl_exec($ch);
if(curl_errno($ch)){
//It is recommended to keep an error log here
echo 'Crawl failed: '.curl_error($ch); }
}
curl_close($ch); }
Here's the kicker.timeout settingMany newbies don't set TIMEOUT, so they just wait when the agent can't connect. It is recommended to set 10-60 seconds according to the business needs, and change the next agent directly after the timeout.
IV. Guide to avoiding pitfalls - a summary of blood and tears experience
1. Don't use free agents.: Tested last year, the average availability of free proxies is less than 15%, and there is a risk of data leakage
2. Remember to add the pilot mechanism: it is suggested that it should read something like this.
$retry = 3;
while($retry--){
//Execute the request code
if(success) break; sleep(2); //fail and wait 2 seconds and try again.
sleep(2); //after failure wait 2 seconds and try again
}
3. Pay attention to concurrency control: do not think that the use of proxies can do whatever you want, it is recommended that the control of 5-10 requests per second
V. QA time - a common pitfall for novices
Q: What should I do if my proxy IP is not working?
A: Normal phenomenon, all proxies have expiration date. It is recommended to use ipipgo's API to get the latest IP dynamically, they have provided sample code
Q: The returned status code is always 407?
A: This is a proxy authentication failure, check if the username and password are correct. Note that ipipgo's password is dynamically generated and should be reacquired every month!
Q: How do I test if the proxy is really working?
A: You can use this test interface:
curl_setopt($ch, CURLOPT_URL, 'http://httpbin.org/ip');
The returned origin field should show the proxy IP, not your local IP.
Why recommend ipipgo?
After using 7 or 8 proxy providers, I finally chose ipipgo for three main reasons:
1. Fast enough response time, with an average delay of 200ms or less
2. Customer service is a real technical background, can help debugging code
3. Flexible pricing schemes, with $5 per day packages available for small-scale testing
Especially theirIntelligent Routing FunctionThe most important thing is that they can automatically select the optimal node. Last week to capture the data of a government website, directly take their government line channel, the success rate from 43% soared to 91%.
Finally remind: proxy IP is not a panacea, with User-Agent rotation, request frequency control in order to maximize the effect. There are any specific questions welcome to ipipgo official website to find technical customer service nagging, they are online 24 hours a day, more detailed than I wrote here.

