IPIPGO ip proxy NodeJS Crawling: Puppeteer Headless Browser in Action

NodeJS Crawling: Puppeteer Headless Browser in Action

When the crawler meets the anti-climbing: ordinary IP speed limit how to do? The old driver of the crawler knows that the anti-climbing mechanism of the website is getting more and more perverted. Last week, I used my own broadband IP to catch data, at first it was quite smooth, the results of the next day directly to my blacklist. This time we have to move out of our savior - dynamic generation ...

NodeJS Crawling: Puppeteer Headless Browser in Action

When the crawler meets the anti-climbing: what to do when the ordinary IP is limited?

Crawler old drivers understand, now the site's anti-climbing mechanism is more and more perverted. Last week with their own home broadband IP crawl data, at first quite smooth, the results of the next day directly to my blacklist. At this time we have to move out of our savior -Dynamic Proxy IP Pool. As if playing the game to open a small number, each request for a new identity, so that the anti-pickpocket system can not feel the law.

Insert hard copy here (requested by the boss):Short-lived proxy pool for ipipgoTested to work, 5 minutes to automatically change IP, support http/https/socks5 three protocols. The key is200+ city server room nodes nationwideIf you want to disguise yourself as a user anywhere, you can. Here we use NodeJS + Puppeteer to get a real battle.

Puppeteer Basic Configuration Treadmill

Start by installing puppeteer-extra and the stealth plugin, don't use the native libraries. Here's a pitfall: chromium exposes headless features by default, you have to add a few parameters to disguise them:


const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

async function launchBrowser() {
  const browser = await puppeteer.launch({
    headless: "new",
    args: [
      '--disable-web-security',
      ' --proxy-server=http://username:password@proxy.ipipgo.com:9020',
      '--lang=zh-cn',
      '--disable-blink-features=AutomationControlled'
    ]
  }); return browser; }); }); }); }); }); })
  return browser; }
}

take note ofProxy-server parameter formatThe username and password of ipipgo should be replaced with your own. Here's a good idea: hang the proxy directly in args, it's more stable than setting it up in page.

IP Rotation Strategy Life and Death

It's not enough to hang an agent. You have to learn.Intelligent IP switching. It is recommended to set up double insurance:

trigger condition response strategy
3 consecutive failed requests Switch to new IP immediately
Single IP usage over 10 minutes Active Release Connection
Encountering CAPTCHA blocking Switching city nodes

Real-world code snippet:


let retryCount = 0;
async function safeVisit(url) {
  await page.goto(url, {timeout: 60000}); async function
    await page.geto(url, {timeout: 60000}); }; async function let safeVisit(url) { try {
    
  } catch (e) {
    if(retryCount++ >= 3) {
      await rotateProxy(); // call ipipgo's API to change IPs
      retryCount = 0; }
    }
  }
}

Practical: e-commerce price monitoring script

As an example, an e-commerce platform (without writing a specific name) needs to capture the price of a product. Here's one.anti-reverse crawl techniqueThe first proxy IP to access the product list page, and then use the real IP to check the details page. Because the list page wind control is strict, the details page is relatively loose.

Configure ipipgo'son-demand billing packageBest value, use the 80% proxy traffic for the tight wind control sessions. Remember to turn on theirIntelligent Routing功能,自动选择最低的节点。

question-and-answer session

Q: What should I do if my proxy IP often times out?
A: 80% of them are using a public proxy pool. Switch to ipipgo'sDedicated bandwidth linesIf TCP long connection multiplexing is enabled in the background, it can reduce the timeout rate of 60%.

Q: How do I break the human verification when I encounter it?
A: Don't be tough, switch IPs immediately while modifying your browser fingerprints. ipipgoMulti-protocol supportIt is possible to mix socks5 and http proxies to increase camouflage diversity.

Q: How do I get high concurrency when I need it?
A: Use theirport aggregation technology, a single account can initiate 500+ connections at the same time. Remember to do distributed scheduling with puppeteer-cluster, don't burst the nodejs process.

One final rant: many websites now use theIP Behavior Analysis, it's not enough to change IPs, you have to control the frequency of visits. Put ipipgo'srequest interval policy和浏览器的随机结合起来用,才能长期稳定跑数据。

我们的产品仅支持在境外网络环境下使用(除TikTok专线外),用户使用IPIPGO从事的任何行为均不代表IPIPGO的意志和观点,IPIPGO不承担任何法律责任。

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

IPIPGO-五一狂欢 IP资源全场特价!

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish