
What to do when PHP crawling is targeted by anti-crawl? Try this trick
The old iron have done web crawling understand, the target site's anti-climbing mechanism is like velvet sugar can not be shaken off. 403, 429 error every day to see, the IP is blocked is a common occurrence. At this timeproxy IPIt's a lifesaver for you, especially if you use PHP for crawling, which allows you to bypass site monitoring by becoming a "Man of a Thousand Faces".
How do you play with proxy IPs to reverse crawl?
There are three main things that websites look for to recognize a crawler:Request Frequency, Behavioral Characteristics, IP TrajectoryThe first thing you need to do is to use a single IP to make a frantic request. Frantically requesting with a single IP is like sweeping through a supermarket 100 times in a row without checking out, so who's the security guard going to stare at if not you? The beauty of proxy IPs is this:
| anti-climbing tactic | Proxy IP Response Program |
|---|---|
| IP frequency limitation | Automatic switching of different export IPs |
| User Behavior Analysis | Simulate different device fingerprints |
| IP blacklisting | Massive IP pool rotation |
PHP real proxy configuration step beat
Here's an example of the use ofipipgoThe proxy service to give a chestnut, their family provides API to get the latest proxy directly. First the whole basic code:
// Get the proxy IP (using ipipgo's API example here)
$proxy = json_decode(file_get_contents('https://api.ipipgo.com/getproxy'));
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "destination URL");
curl_setopt($ch, CURLOPT_PROXY, $proxy->ip.':'.$proxy->port);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy->username.':'.$proxy->password);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
Here comes the key point:timeout settingTo be lower than the proxy response time (recommended 3-5 seconds), encounter lag immediately cut the next IP. plus random delay more realistic:
// randomly wait 1-3 seconds
usleep(rand(1000000, 3000000));
Advanced camouflage techniques are taught as a package
It's not enough to just change the IP, you have to do the whole trick:
- User-Agent Rotation: Don't use CURL default UA, prepare dozens of common browser UA random selection
- The request header should have Referer in it, pretending to jump from the site
- Keep the login state with CookieJar, don't bring a new cookie for each request
Give an example with a camouflaged head:
$headers = [
'Accept: text/html,application/xhtml+xml',
'Accept-Language: zh-CN,zh;q=0.9',
'Referer: https://目标网站.com/'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
Common Rollover Scene QA
Q: How many times do I use a proxy IP and get blocked?
A: You have to choose a high anonymity proxy (recommend ipipgo's mixed dialing node), ordinary anonymous proxies will expose the X-Forwarded-For header.
Q: Slow as a snail in crawling?
A:检查代理响应时间,ipipgo的节点平均<200ms,比自建代理快得多
Q: How do I choose a proxy service provider?
A: focus on three things: IP pool size (ipipgo has 200w+), protocol support (to support socks5), API stability (failure retry mechanism)
Please take the guide to avoid the pitfalls
A few final bloody lessons learned:
- Don't write dead proxy IPs in your code, use the Dynamic Get API!
- https site to use tunnel proxy, ordinary proxy will report SSL error
- Remember to bind different proxies for asynchronous requests, and don't share an IP with multiple requests.
Use these tips in conjunction withipipgoThe reliable proxy service can basically take care of 90%'s anti-crawling mechanism. Remember that website protection is also being upgraded, and crawling strategies should be adjusted regularly to maintain dynamic countermeasures.

