IPIPGO ip proxy NodeJS Crawling: Puppeteer Headless Browser in Action

NodeJS Crawling: Puppeteer Headless Browser in Action

When the crawler meets the anti-climbing: ordinary IP speed limit how to do? The old driver of the crawler knows that the anti-climbing mechanism of the website is getting more and more perverted. Last week, I used my own broadband IP to catch data, at first it was quite smooth, the results of the next day directly to my blacklist. This time we have to move out of our savior - dynamic generation ...

NodeJS Crawling: Puppeteer Headless Browser in Action

When the crawler meets the anti-climbing: what to do when the ordinary IP is limited?

Crawler old drivers understand, now the site's anti-climbing mechanism is more and more perverted. Last week with their own home broadband IP crawl data, at first quite smooth, the results of the next day directly to my blacklist. At this time we have to move out of our savior -Dynamic Proxy IP Pool. As if playing the game to open a small number, each request for a new identity, so that the anti-pickpocket system can not feel the law.

Insert hard copy here (requested by the boss):Short-lived proxy pool for ipipgoTested to work, 5 minutes to automatically change IP, support http/https/socks5 three protocols. The key is200+ city server room nodes nationwideIf you want to disguise yourself as a user anywhere, you can. Here we use NodeJS + Puppeteer to get a real battle.

Puppeteer Basic Configuration Treadmill

Start by installing puppeteer-extra and the stealth plugin, don't use the native libraries. Here's a pitfall: chromium exposes headless features by default, you have to add a few parameters to disguise them:


const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

async function launchBrowser() {
  const browser = await puppeteer.launch({
    headless: "new",
    args: [
      '--disable-web-security',
      ' --proxy-server=http://username:password@proxy.ipipgo.com:9020',
      '--lang=zh-cn',
      '--disable-blink-features=AutomationControlled'
    ]
  }); return browser; }); }); }); }); }); })
  return browser; }
}

take note ofProxy-server parameter formatThe username and password of ipipgo should be replaced with your own. Here's a good idea: hang the proxy directly in args, it's more stable than setting it up in page.

IP Rotation Strategy Life and Death

It's not enough to hang an agent. You have to learn.Intelligent IP switching. It is recommended to set up double insurance:

trigger condition response strategy
3 consecutive failed requests Switch to new IP immediately
Single IP usage over 10 minutes Active Release Connection
Encountering CAPTCHA blocking Switching city nodes

Real-world code snippet:


let retryCount = 0;
async function safeVisit(url) {
  await page.goto(url, {timeout: 60000}); async function
    await page.geto(url, {timeout: 60000}); }; async function let safeVisit(url) { try {
    
  } catch (e) {
    if(retryCount++ >= 3) {
      await rotateProxy(); // call ipipgo's API to change IPs
      retryCount = 0; }
    }
  }
}

Practical: e-commerce price monitoring script

As an example, an e-commerce platform (without writing a specific name) needs to capture the price of a product. Here's one.anti-reverse crawl techniqueThe first proxy IP to access the product list page, and then use the real IP to check the details page. Because the list page wind control is strict, the details page is relatively loose.

Configure ipipgo'son-demand billing packageBest value, use the 80% proxy traffic for the tight wind control sessions. Remember to turn on theirIntelligent Routingfunction that automatically selects the node with the lowest latency.

question-and-answer session

Q: What should I do if my proxy IP often times out?
A: 80% of them are using a public proxy pool. Switch to ipipgo'sDedicated bandwidth linesIf TCP long connection multiplexing is enabled in the background, it can reduce the timeout rate of 60%.

Q: How do I break the human verification when I encounter it?
A: Don't be tough, switch IPs immediately while modifying your browser fingerprints. ipipgoMulti-protocol supportIt is possible to mix socks5 and http proxies to increase camouflage diversity.

Q: How do I get high concurrency when I need it?
A: Use theirport aggregation technology, a single account can initiate 500+ connections at the same time. Remember to do distributed scheduling with puppeteer-cluster, don't burst the nodejs process.

One final rant: many websites now use theIP Behavior Analysis, it's not enough to change IPs, you have to control the frequency of visits. Put ipipgo'srequest interval policyUse it in conjunction with the browser's random delay to run data consistently over time.

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/35906.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

New 10W+ U.S. Dynamic IPs Year-End Sale

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish