
Crawlers being counter-crawled? Try this proxy IP trick
Recently, many Node.js crawler brothers are complaining that the site anti-climbing more and more ruthless. The day before yesterday, an old brother said, he wrote the crawler ran less than half an hour, the IP was blocked to death. This is something I feel too much, last year, when I did e-commerce data collection, we have to change the IP two or three days, and later found that the use of proxy IP is the true fragrance.
How exactly does a proxy IP help you
In a nutshell.Invisibility cloak for reptiles. Let's say you want to collect the price of a product from a website:
const axios = require('axios');
// Normal request (blocked in minutes)
async function normalRequest() {
try {
const response = await axios.get('destination URL'); console.log(response.data); // normal request (blocked in minutes); // normal request (blocked in minutes).
console.log(response.data);
} catch (error) {
console.log('Damn, IP is blocked!'); } catch (error) { const response = await axios.get('Target website URL'); console.log(response.data); } catch (error) { console.log(response.data)) ); }
}
}
After switching to a proxy IP:
// proxy request (recommended API with ipipgo)
const proxyConfig = {
host: 'ipipgo Dynamic Residential Proxy IP',
port: port number,
auth: {
username: 'Your account number',
password: 'Random password'
}
}.
async function proxyRequest() {
try {
const response = await axios.get('Target site URL', {
proxy: proxyConfig, {
timeout: 5000 // It's important to set a timeout.
});
console.log('Data arrived!') ;)
} catch (error) {
console.log('Change IP and continue'); } catch (error) { console.log('Change IP and continue') ; }
}
}
Real-world code plays this way
recommendedAPI extraction methods for ipipgo, ten times more convenient than traditional proxy pools:
const { IpProxy } = require('ipipgo-sdk'); // official SDK
const puppeteer = require('puppeteer'); // Official SDK.
async function smartCrawler() {
// Get the proxy IP dynamically (emphasis added!)
const proxy = await IpProxy.getDynamicResidential({
country: 'us', protocol: 'https'
protocol: 'https'
});
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxy.ip}:${proxy.port}`]
});
// Remember to set the page timeout
const page = await browser.newPage(); await page.goto(); }; // Remember to set the page timeout.
await page.goto('Target URL', {timeout: 60000});
// Randomly slide the mouse (to simulate a real person's action)
await page.mouse.move(100, 100); await page.
await page.waitForTimeout(2000);
const data = await page.evaluate(() => {
return document.querySelector('.price').innerText; }); const data = await page.evaluate((() => {
}).
await browser.close();
return data; }); await browser.close(); return data; }
}
Concurrent processing beware
Use this routine when you need to have multiple crawlers on at the same time:
const { Worker } = require('worker_threads');
function createWorker(proxy) {
return new Promise((resolve) => {
const worker = new Worker('. /crawler.js', {
workerData: { proxy }
});
worker.on('message', resolve); worker.on('error', () => { workerData: { proxy } }; }
worker.on('error', () => {
console.log(`${proxy.ip} hung, move to the next one`); }); worker.on('message', () => { worker.on('error', () => {
});
});
}
// Batch create proxy instances
const proxyList = await IpProxy.batchGet(10); // take 10 IPs at a time
const results = await Promise.all(proxyList.map(createWorker)); // take 10 IPs at a time.
Common pitfalls QA
Q: Why use a residential agent?
A: data center IP has long been blacklisted by major websites, residential IP looks like a real user. ipipgo's dynamic residential agent is a real home broadband, personally tested a certain East and a certain treasure can be run steadily.
Q: What is the cost-effective way to charge for a proxy IP?
A: Look at the business scenario to choose a package and save the price list:
| Package Type | Applicable Scenarios | price of item |
|---|---|---|
| Dynamic residential (standard) | Routine data collection | 7.67 Yuan/GB/month |
| Dynamic Residential (Business) | High-frequency visit requirements | 9.47 Yuan/GB/month |
| Static homes | Requires fixed IP scenarios | 35RMB/IP/month |
Q: How do I prevent account linkage?
A: Three steps: ① change different country IP for each request ② clear the browser fingerprints ③ with ipipgo's TK line to do account isolation.
Why ipipgo?
Used seven or eight agent service providers, the last long-term use of ipipgo on three reasons: ① their SERP API can directly climb Google data (others have to toss their own) ② three o'clock in the morning to find customer service actually seconds back ③ support for socks5 protocols, to engage in the handicraft scripts are also convenient. Recently found that they can also customize the hourly billing program, especially friendly to short-term projects.
Finally, a nagging word: although the proxy IP is good, but don't gripe people's websites to death. I've seen people open 100 threads to crawl, the result is that people's servers hang, this kind of bad thing we can not do.

