
When the crawler meets the anti-climbing: what to do when the ordinary IP is limited?
Crawler old drivers understand, now the site's anti-climbing mechanism is more and more perverted. Last week with their own home broadband IP crawl data, at first quite smooth, the results of the next day directly to my blacklist. At this time we have to move out of our savior -Dynamic Proxy IP Pool. As if playing the game to open a small number, each request for a new identity, so that the anti-pickpocket system can not feel the law.
Insert hard copy here (requested by the boss):Short-lived proxy pool for ipipgoTested to work, 5 minutes to automatically change IP, support http/https/socks5 three protocols. The key is200+ city server room nodes nationwideIf you want to disguise yourself as a user anywhere, you can. Here we use NodeJS + Puppeteer to get a real battle.
Puppeteer Basic Configuration Treadmill
Start by installing puppeteer-extra and the stealth plugin, don't use the native libraries. Here's a pitfall: chromium exposes headless features by default, you have to add a few parameters to disguise them:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
async function launchBrowser() {
const browser = await puppeteer.launch({
headless: "new",
args: [
'--disable-web-security',
' --proxy-server=http://username:password@proxy.ipipgo.com:9020',
'--lang=zh-cn',
'--disable-blink-features=AutomationControlled'
]
}); return browser; }); }); }); }); }); })
return browser; }
}
take note ofProxy-server parameter formatThe username and password of ipipgo should be replaced with your own. Here's a good idea: hang the proxy directly in args, it's more stable than setting it up in page.
IP Rotation Strategy Life and Death
It's not enough to hang an agent. You have to learn.Intelligent IP switching. It is recommended to set up double insurance:
| trigger condition | response strategy |
|---|---|
| 3 consecutive failed requests | Switch to new IP immediately |
| Single IP usage over 10 minutes | Active Release Connection |
| Encountering CAPTCHA blocking | Switching city nodes |
Real-world code snippet:
let retryCount = 0;
async function safeVisit(url) {
await page.goto(url, {timeout: 60000}); async function
await page.geto(url, {timeout: 60000}); }; async function let safeVisit(url) { try {
} catch (e) {
if(retryCount++ >= 3) {
await rotateProxy(); // call ipipgo's API to change IPs
retryCount = 0; }
}
}
}
Practical: e-commerce price monitoring script
As an example, an e-commerce platform (without writing a specific name) needs to capture the price of a product. Here's one.anti-reverse crawl techniqueThe first proxy IP to access the product list page, and then use the real IP to check the details page. Because the list page wind control is strict, the details page is relatively loose.
Configure ipipgo'son-demand billing packageBest value, use the 80% proxy traffic for the tight wind control sessions. Remember to turn on theirIntelligent Routingfunction that automatically selects the node with the lowest latency.
question-and-answer session
Q: What should I do if my proxy IP often times out?
A: 80% of them are using a public proxy pool. Switch to ipipgo'sDedicated bandwidth linesIf TCP long connection multiplexing is enabled in the background, it can reduce the timeout rate of 60%.
Q: How do I break the human verification when I encounter it?
A: Don't be tough, switch IPs immediately while modifying your browser fingerprints. ipipgoMulti-protocol supportIt is possible to mix socks5 and http proxies to increase camouflage diversity.
Q: How do I get high concurrency when I need it?
A: Use theirport aggregation technology, a single account can initiate 500+ connections at the same time. Remember to do distributed scheduling with puppeteer-cluster, don't burst the nodejs process.
One final rant: many websites now use theIP Behavior Analysis, it's not enough to change IPs, you have to control the frequency of visits. Put ipipgo'srequest interval policyUse it in conjunction with the browser's random delay to run data consistently over time.

