
PHP grab the page essential: DOMDocument nanny level teaching
The old iron engaged in data collection should have encountered this problem: the target site to change the HTML structure of the parent mother do not recognize, write a good crawler script directly strike. Today we will use PHP's own DOMDocument component to teach you how to elegantly disassemble the structure of the web page, and then with theipipgoThe proxy IP service ensures that the collection is as steady as an old dog.
First, why use a proxy IP to engage in data capture?
Many sites are installed "access control system", the same IP frequent visits immediately pull black. At this time it is necessary to likeipipgoSuch a specialized proxy service is like preparing a bunch of "stand-ins" for the crawler. Let's take an example:
$proxy = 'http://username:password@gateway.ipipgo.io:9020';
$context = stream_context_create([
'http' => ['proxy' => $proxy]
]);
$html = file_get_contents('destination URL', false, $context);
The gateway.ipipgo.io in this code is their intelligent routing portal, which automatically assigns the most suitable nodes. After testing with his proxy, the probability of being blocked dropped from 80% to below 5%.
Second, the basic operation of DOMDocument three axes
Once we have the web page source code, let's start disassembling the parts:
$dom = new DOMDocument();
@$dom->loadHTML($html); // mask the warning message with @
$xpath = new DOMXPath($dom);
// Example: grab all product prices
$prices = $xpath->query('//span[@class="price"]'); }; $prices = $xpath->query('//span[@class="price"]'); }
foreach ($prices as $node) {
echo $node->nodeValue."";
}
Watch out for these two potholes:
1. Web page encoding issues are handled with mb_convert_encoding
2. Ignore HTML parsing errors with the @ sign
Third, the actual combat: dynamic website acquisition program
When you encounter a website that loads data with JS, you can do distributed collection with proxy IP. For example, configure the collection cluster like this:
| Node type | concurrency | switching strategy |
|---|---|---|
| Domestic Residential IP | 10 threads | Change IP per request |
| Overseas Data Center IP | 5 threads | Hourly IP changes |
expense or outlayipipgoAPI to get the IP pool:
$ip_list = json_decode(file_get_contents('https://api.ipipgo.com/getips?type=http&num=20'));
IV. First aid kit for common problems
Q: What should I do if I can't connect to the proxy IP?
A:First check the format of the authentication information, use the "connection test" tool provided by ipipgo to diagnose the problem.
Q: XPath is written correctly but can't grab the data?
A: 80% of the webpage has iframe, first use the regular to locate the specific frame and then parse the
Q: Sudden slowdown in acquisition?
A: It may have triggered the website speed limit, we suggest adding random waiting time in the code:
sleep(rand(1, 3)); // randomly sleep for 1-3 seconds
V. Hidden benefits of ipipgo
In addition to the basic agency services, his family has two other killer features:
1. Intelligent Retry System: Automatic switching of invalid IPs
2. Data Cleaning Interface: Automatic filtering of duplicate content
Finally give a piece of advice: don't use sleep(0) in the collection code, website wind control is not vegetarian. Use proxy IP + random delay + automatic switching triple protection, in order to let the collection script long life.

