
A. Why is crawling data always blocked? Try this stupid method
The buddies who have engaged in web crawling understand that the biggest headache is the anti-climbing mechanism of the target site. Yesterday also ran a good script, today suddenly 403, angry want to smash the keyboard. At this time, do not rush to change the language rewrite, try to give PHP scriptsWear a vest.-- Disguise your real identity with a proxy IP.
Like you go to the supermarket to try to eat, wearing the same red clothes every day to go, the clerk does not stop you to stop who? Proxy IP is like a different color coat every day, so that the site does not recognize you as an old acquaintance. Here we recommend usingipipgoof proxy services, their home IP pool is as big as the Pacific Ocean, so just pick a new identity and keep working.
Second, hand to teach you to PHP scripts set of vests
First the whole live code to see (remember to install curl extension):
$proxy = '123.123.123.123:8888'; // proxy address provided by ipipgo
$targetUrl = 'https://目标网站.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $targetUrl);
curl_setopt($ch, CURLOPT_PROXY, $proxy); curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); curl_setopt($ch, CURLOPT_TIMEOUT, 30).
// Important! Set proxy authentication (available in the ipipgo backend)
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "username:password"); // Important!
$response = curl_exec($ch);
if(curl_errno($ch)){
echo 'Crawl error: '.curl_error($ch); }
}
curl_close($ch); }
focus onProxy ValidationThis is the pit! Many newbies forget to set the CURLOPT_PROXYUSERPWD parameter, and as a result, they can't connect to it. ipipgo's proxy account can be found in the user's background in the "Access Guide", so don't be silly and use a registered account to log in.
Third, how to choose the type of proxy IP does not step on the pits
There are three types of agents on the market, so get confused and get in trouble:
| typology | Applicable Scenarios | ipipgo referral program |
|---|---|---|
| Transparent Agent | Monitoring Network Traffic | Not recommended! Will be recognized by the website |
| General anonymous | Daily data collection | Dynamic Residential IP Package |
| High Stash Agents | High frequency/sensitive acquisition | Enterprise Exclusive IP Pool |
Focusing on high stash agents, this agent willCompletely hiddenYour real IP and proxy characteristics. ipipgo's high stash of nodes randomly replaces HTTP headers, handling even hidden fields like X-Forwarded-For cleanly.
IV. Practical guide to avoiding pitfalls
1. IP Switching PolicyDon't wait to be blocked before switching IPs, it is recommended to switch automatically every 5-10 pages. ipipgo's API gets a list of available IPs in real time.
2. timeout settingSome free proxies are as slow as a snail, remember to set the CURLOPT_TIMEOUT parameter, and give up if you don't get a response for more than 10 seconds!
3. Exception handling: Don't throw an exception when you encounter Connection timed out, log it and retry 3 times.
// Smart Retry Example
$retry = 0;
while($retry < 3){
$result = curl_exec($ch); if(!curl_errno($ch)) break; if(!
if(!curl_errno($ch)) break;
$retry++;
sleep(2); // wait 2 seconds and try again
}
V. Six questions you definitely want to ask
Q1: Is it legal to use a proxy IP?
A: As long as you don't crawl sensitive data, don't do any damage, just like chopping vegetables with a kitchen knife is not a crime. ipipgo all IPs are from regular channels!
Q2: Why do you recommend ipipgo?
A: His family has two bulls: one is IP survival time up to 72 hours (others usually 4 hours), the second is to providerequest header masquerading asTechnical Support
Q3: What should I do if I encounter CAPTCHA authentication?
A: Combination of three strokes: ① reduce the frequency of requests ② use a headless browser ③ switch ipipgo mobile IP
Q4: Do I need to maintain my own IP pool?
A: No need at all! ipipgo has a "smart scheduling" function in the background, which automatically eliminates failed nodes and saves you 10 times more time than maintaining it by yourself!
Q5: How can I tell if an agent is high stash?
A: Visit http://httpbin.org/ip, if the returned IP is the same as the set proxy IP and there is no header information such as X-Proxy-Id, it is true high stash
Q6: How is asynchronous acquisition handled?
A: Use Guzzle's Concurrent Requests + Proxy Pool Polling, you can see the developer documentation on the ipipgo website for the specific code.
VI. Speak the truth
Engaging in data collection is like fighting a guerrilla war, and the key toflexible and changeable. Don't expect one set of parameters to go everywhere, what works well today may not work tomorrow. It is recommended to make more use of ipipgo'sRequest header randomizationfunction, the User-Agent, Accept-Language these parameters into an array of random rotation, so that the anti-crawling system can not feel the rules.
最后提醒新手朋友,千万别在采集脚本里用sleep(1)这种固定,聪明点的网站会通过请求间隔时间识别爬虫。随机+动态代理才是王道,这方面ipipgo的SDK已经封装好了相关方法,直接调用就行,比自己造轮子靠谱多了。

