IPIPGO ip proxy PHP parsing HTML: DOMDocument tutorials

PHP parsing HTML: DOMDocument tutorials

PHP grab the web must: DOMDocument nanny level teaching The old iron engaged in data collection should have encountered this problem: the target site to change the HTML structure of the parents do not recognize, write a good crawler script directly strike. Today we use PHP comes with the DOMDocument component, hand in hand to teach you how to optimize ...

PHP parsing HTML: DOMDocument tutorials

PHP grab the page essential: DOMDocument nanny level teaching

The old iron engaged in data collection should have encountered this problem: the target site to change the HTML structure of the parent mother do not recognize, write a good crawler script directly strike. Today we will use PHP's own DOMDocument component to teach you how to elegantly disassemble the structure of the web page, and then with theipipgoThe proxy IP service ensures that the collection is as steady as an old dog.

First, why use a proxy IP to engage in data capture?

Many sites are installed "access control system", the same IP frequent visits immediately pull black. At this time it is necessary to likeipipgoSuch a specialized proxy service is like preparing a bunch of "stand-ins" for the crawler. Let's take an example:


$proxy = 'http://username:password@gateway.ipipgo.io:9020';
$context = stream_context_create([
    'http' => ['proxy' => $proxy]
]);
$html = file_get_contents('destination URL', false, $context);

The gateway.ipipgo.io in this code is their intelligent routing portal, which automatically assigns the most suitable nodes. After testing with his proxy, the probability of being blocked dropped from 80% to below 5%.

Second, the basic operation of DOMDocument three axes

Once we have the web page source code, let's start disassembling the parts:


$dom = new DOMDocument();
@$dom->loadHTML($html); // mask the warning message with @
$xpath = new DOMXPath($dom);

// Example: grab all product prices
$prices = $xpath->query('//span[@class="price"]'); }; $prices = $xpath->query('//span[@class="price"]'); }
foreach ($prices as $node) {
    echo $node->nodeValue."";
}

Watch out for these two potholes:

1. Web page encoding issues are handled with mb_convert_encoding
2. Ignore HTML parsing errors with the @ sign

Third, the actual combat: dynamic website acquisition program

When you encounter a website that loads data with JS, you can do distributed collection with proxy IP. For example, configure the collection cluster like this:

Node type concurrency switching strategy
Domestic Residential IP 10 threads Change IP per request
Overseas Data Center IP 5 threads Hourly IP changes

expense or outlayipipgoAPI to get the IP pool:


$ip_list = json_decode(file_get_contents('https://api.ipipgo.com/getips?type=http&num=20'));

IV. First aid kit for common problems

Q: What should I do if I can't connect to the proxy IP?
A:First check the format of the authentication information, use the "connection test" tool provided by ipipgo to diagnose the problem.

Q: XPath is written correctly but can't grab the data?
A: 80% of the webpage has iframe, first use the regular to locate the specific frame and then parse the

Q: Sudden slowdown in acquisition?
A: It may have triggered the website speed limit, we suggest adding random waiting time in the code:


sleep(rand(1, 3)); // randomly sleep for 1-3 seconds

V. Hidden benefits of ipipgo

In addition to the basic agency services, his family has two other killer features:

1. Intelligent Retry System: Automatic switching of invalid IPs
2. Data Cleaning Interface: Automatic filtering of duplicate content

最后给个忠告:别在采集代码里用sleep(0),网站风控不是吃素的。用代理IP+随机+自动切换的三重防护,才能让采集脚本长命百岁。

我们的产品仅支持在境外网络环境下使用(除TikTok专线外),用户使用IPIPGO从事的任何行为均不代表IPIPGO的意志和观点,IPIPGO不承担任何法律责任。

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

IPIPGO-五一狂欢 IP资源全场特价!

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish