IPIPGO ip proxy PHP web crawling: PHP website data collection guide

PHP web crawling: PHP website data collection guide

First, why crawl data is always blocked? Try this stupid way to engage in web crawling buddies understand, the most headache is the target site's anti-climbing mechanism. Yesterday also ran a good script, today suddenly 403, angry to smash the keyboard. At this time do not rush to change the language rewrite, try to wear a piece of PHP script horse ...

PHP web crawling: PHP website data collection guide

A. Why is crawling data always blocked? Try this stupid method

The buddies who have engaged in web crawling understand that the biggest headache is the anti-climbing mechanism of the target site. Yesterday also ran a good script, today suddenly 403, angry want to smash the keyboard. At this time, do not rush to change the language rewrite, try to give PHP scriptsWear a vest.-- Disguise your real identity with a proxy IP.

Like you go to the supermarket to try to eat, wearing the same red clothes every day to go, the clerk does not stop you to stop who? Proxy IP is like a different color coat every day, so that the site does not recognize you as an old acquaintance. Here we recommend usingipipgoof proxy services, their home IP pool is as big as the Pacific Ocean, so just pick a new identity and keep working.

Second, hand to teach you to PHP scripts set of vests

First the whole live code to see (remember to install curl extension):


$proxy = '123.123.123.123:8888'; // proxy address provided by ipipgo
$targetUrl = 'https://目标网站.com';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $targetUrl);
curl_setopt($ch, CURLOPT_PROXY, $proxy); curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); curl_setopt($ch, CURLOPT_TIMEOUT, 30).

// Important! Set proxy authentication (available in the ipipgo backend)
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "username:password"); // Important!

$response = curl_exec($ch);
if(curl_errno($ch)){
    echo 'Crawl error: '.curl_error($ch); }
}
curl_close($ch); }

focus onProxy ValidationThis is the pit! Many newbies forget to set the CURLOPT_PROXYUSERPWD parameter, and as a result, they can't connect to it. ipipgo's proxy account can be found in the user's background in the "Access Guide", so don't be silly and use a registered account to log in.

Third, how to choose the type of proxy IP does not step on the pits

There are three types of agents on the market, so get confused and get in trouble:

typology Applicable Scenarios ipipgo referral program
Transparent Agent Monitoring Network Traffic Not recommended! Will be recognized by the website
General anonymous Daily data collection Dynamic Residential IP Package
High Stash Agents High frequency/sensitive acquisition Enterprise Exclusive IP Pool

Focusing on high stash agents, this agent willCompletely hiddenYour real IP and proxy characteristics. ipipgo's high stash of nodes randomly replaces HTTP headers, handling even hidden fields like X-Forwarded-For cleanly.

IV. Practical guide to avoiding pitfalls

1. IP Switching PolicyDon't wait to be blocked before switching IPs, it is recommended to switch automatically every 5-10 pages. ipipgo's API gets a list of available IPs in real time.

2. timeout settingSome free proxies are as slow as a snail, remember to set the CURLOPT_TIMEOUT parameter, and give up if you don't get a response for more than 10 seconds!

3. Exception handling: Don't throw an exception when you encounter Connection timed out, log it and retry 3 times.


// Smart Retry Example
$retry = 0;
while($retry < 3){
    $result = curl_exec($ch); if(!curl_errno($ch)) break; if(!
    if(!curl_errno($ch)) break;
    $retry++;
    sleep(2); // wait 2 seconds and try again
}

V. Six questions you definitely want to ask

Q1: Is it legal to use a proxy IP?
A: As long as you don't crawl sensitive data, don't do any damage, just like chopping vegetables with a kitchen knife is not a crime. ipipgo all IPs are from regular channels!

Q2: Why do you recommend ipipgo?
A: His family has two bulls: one is IP survival time up to 72 hours (others usually 4 hours), the second is to providerequest header masquerading asTechnical Support

Q3: What should I do if I encounter CAPTCHA authentication?
A: Combination of three strokes: ① reduce the frequency of requests ② use a headless browser ③ switch ipipgo mobile IP

Q4: Do I need to maintain my own IP pool?
A: No need at all! ipipgo has a "smart scheduling" function in the background, which automatically eliminates failed nodes and saves you 10 times more time than maintaining it by yourself!

Q5: How can I tell if an agent is high stash?
A: Visit http://httpbin.org/ip, if the returned IP is the same as the set proxy IP and there is no header information such as X-Proxy-Id, it is true high stash

Q6: How is asynchronous acquisition handled?
A: Use Guzzle's Concurrent Requests + Proxy Pool Polling, you can see the developer documentation on the ipipgo website for the specific code.

VI. Speak the truth

Engaging in data collection is like fighting a guerrilla war, and the key toflexible and changeable. Don't expect one set of parameters to go everywhere, what works well today may not work tomorrow. It is recommended to make more use of ipipgo'sRequest header randomizationfunction, the User-Agent, Accept-Language these parameters into an array of random rotation, so that the anti-crawling system can not feel the rules.

最后提醒新手朋友,千万别在采集脚本里用sleep(1)这种固定,聪明点的网站会通过请求间隔时间识别爬虫。随机+动态代理才是王道,这方面ipipgo的SDK已经封装好了相关方法,直接调用就行,比自己造轮子靠谱多了。

我们的产品仅支持在境外网络环境下使用(除TikTok专线外),用户使用IPIPGO从事的任何行为均不代表IPIPGO的意志和观点,IPIPGO不承担任何法律责任。

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

美国长效动态住宅ip资源上新!

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish