IPIPGO ip proxy PHP parsing HTML: DOMDocument tutorials

PHP parsing HTML: DOMDocument tutorials

PHP grab the web must: DOMDocument nanny level teaching The old iron engaged in data collection should have encountered this problem: the target site to change the HTML structure of the parents do not recognize, write a good crawler script directly strike. Today we use PHP comes with the DOMDocument component, hand in hand to teach you how to optimize ...

PHP parsing HTML: DOMDocument tutorials

PHP grab the page essential: DOMDocument nanny level teaching

The old iron engaged in data collection should have encountered this problem: the target site to change the HTML structure of the parent mother do not recognize, write a good crawler script directly strike. Today we will use PHP's own DOMDocument component to teach you how to elegantly disassemble the structure of the web page, and then with theipipgoThe proxy IP service ensures that the collection is as steady as an old dog.

First, why use a proxy IP to engage in data capture?

Many sites are installed "access control system", the same IP frequent visits immediately pull black. At this time it is necessary to likeipipgoSuch a specialized proxy service is like preparing a bunch of "stand-ins" for the crawler. Let's take an example:


$proxy = 'http://username:password@gateway.ipipgo.io:9020';
$context = stream_context_create([
    'http' => ['proxy' => $proxy]
]);
$html = file_get_contents('destination URL', false, $context);

The gateway.ipipgo.io in this code is their intelligent routing portal, which automatically assigns the most suitable nodes. After testing with his proxy, the probability of being blocked dropped from 80% to below 5%.

Second, the basic operation of DOMDocument three axes

Once we have the web page source code, let's start disassembling the parts:


$dom = new DOMDocument();
@$dom->loadHTML($html); // mask the warning message with @
$xpath = new DOMXPath($dom);

// Example: grab all product prices
$prices = $xpath->query('//span[@class="price"]'); }; $prices = $xpath->query('//span[@class="price"]'); }
foreach ($prices as $node) {
    echo $node->nodeValue."";
}

Watch out for these two potholes:

1. Web page encoding issues are handled with mb_convert_encoding
2. Ignore HTML parsing errors with the @ sign

Third, the actual combat: dynamic website acquisition program

When you encounter a website that loads data with JS, you can do distributed collection with proxy IP. For example, configure the collection cluster like this:

Node type concurrency switching strategy
Domestic Residential IP 10 threads Change IP per request
Overseas Data Center IP 5 threads Hourly IP changes

expense or outlayipipgoAPI to get the IP pool:


$ip_list = json_decode(file_get_contents('https://api.ipipgo.com/getips?type=http&num=20'));

IV. First aid kit for common problems

Q: What should I do if I can't connect to the proxy IP?
A:First check the format of the authentication information, use the "connection test" tool provided by ipipgo to diagnose the problem.

Q: XPath is written correctly but can't grab the data?
A: 80% of the webpage has iframe, first use the regular to locate the specific frame and then parse the

Q: Sudden slowdown in acquisition?
A: It may have triggered the website speed limit, we suggest adding random waiting time in the code:


sleep(rand(1, 3)); // randomly sleep for 1-3 seconds

V. Hidden benefits of ipipgo

In addition to the basic agency services, his family has two other killer features:

1. Intelligent Retry System: Automatic switching of invalid IPs
2. Data Cleaning Interface: Automatic filtering of duplicate content

Finally give a piece of advice: don't use sleep(0) in the collection code, website wind control is not vegetarian. Use proxy IP + random delay + automatic switching triple protection, in order to let the collection script long life.

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/36343.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

新春惊喜狂欢,代理ip秒杀价!

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish