Hands-on with Node-fetch + proxy IP grab data
The old iron of data collection should understand, directly with their own server IP hard target site, minutes to be blocked and black. Yesterday, an e-commerce friend complained to me that when they climbed the price of competitors, they just grabbed 200 pieces of data and the IP was blocked. This time we have to sacrifice ourThe Proxy IP Method, especially the quality IPs provided by professional service providers like ipipgo, can definitely double your collection efficiency.
Why do I have to use a proxy IP?
Take a chestnut 🌰: you go to the supermarket to buy special goods, every time you wear the same fluorescent green jacket, the security guards do not stare at you to stare at who? Similarly, if you use a fixed IP to access a website with high frequency, the firewall is not vegetarian. ipipgo's proxy IP pool is large enough, each request isAutomatically cut and change vests, the perfect solution to this problem.
// Original naked version of request (high risk operation)
const fetch = require('node-fetch');
fetch('https://目标网站.com/api');
Practical Makeover: Putting a Proxy Vest on a Request
Let's start with the two babies to be used:node-fetchResponsible for sending requests.http-proxy-agentResponsible for proxy configuration. Here is an example of ipipgo's HTTP proxy (they send 1G of traffic for new users, enough for testing):
const fetch = require('node-fetch');
const { HttpsProxyAgent } = require('https-proxy-agent');
// Fetch the proxy information from the ipipgo backend
const proxyConfig = {
host: 'gateway.ipipgo.com',
port: 9021, // proxyConfig
auth: 'account:password' // remember to change it to your own
};
const agent = new HttpsProxyAgent(
`http://${proxyConfig.auth}@${proxyConfig.host}:${proxyConfig.port}`
).
// Safe request with proxy
async function safeFetch(url) {
try {
const response = await fetch(url, { agent }); console.log(await response.text()); console.log(await response.text()); }
console.log(await response.text()); } catch (error) { const response = await fetch(url, { agent }); }
} catch (error) {
console.log('Request pounced:', error.message); }
}
}
// Real-world calls
safeFetch('https://目标网站.com/api?page=1');
Guide to avoiding pitfalls: these details are in vain if you don't pay attention to them
1. timeout setting: It is recommended to add a timeout in the fetch configuration, and give up if you don't get a response for more than 5 seconds.
2. IP Rotation: ipipgo supports IP switching by request, remember to add &change=1 parameter after the proxy address.
3. Concurrent control: Don't start 100 threads, use the p-limit library to control the number of concurrencies.
problematic phenomenon | check the direction of the investigation |
---|---|
Return 407 error | Check that the account password/whitelist IP is configured correctly |
Connection timeout | Try switching proxy nodes for different geographies |
Frequently Asked Questions QA
Q: Can't I use a free proxy?
A: 8 out of 10 free proxies don't work, ipipgo's exclusive IP poolsAvailability 98%The actual test saves more than the self-built agent.
Q: Do I have to manually change the proxy configuration every time?
A: You can use ipipgo's API to dynamically obtain proxies, and it is recommended that you work with redis to do automatic updates to the IP pool.
Q: How to solve the problem when encountering the website backcrawl?
A: ipipgo'sHigh Stash IP+ Random UA + request delay combo, pro-tested to bypass most basic protections
Say something from the heart.
I've tried maintaining my own proxy server before, and I lost half my hair just dealing with IP blocking and network jitter. Then I switched to ipipgo's off-the-shelf service, and my development efficiency took off straight away. Especially theirIntelligent RoutingThe feature, which automatically matches the fastest nodes, is a real treat for projects that require a lot of data collection.
Lastly, I would like to remind you that although proxy IP can reduce the risk of banning, the frequency of collection should be controlled. It is recommended to set reasonable intervals according to the target website's robots.txt, and we should do the followingEthical Crawler EngineerNo?