
Hands-on with Node.js to build a man-in-the-middle proxy
Recently, many friends who do data collection complained to me that the website anti-climbing is getting more and more perverted. In fact, this time the wholeProxy IP Transit StationEspecially useful, like to the crawler to wear a myriad of masks. Today we take Node.js whole job, teach you how to build a man-in-the-middle agent from scratch.
Don't be sloppy with your preparations.
First, make sure that on your computerNode.js version ≥14Don't go through the whole vintage version. It is recommended to use nvm to manage the version, it is convenient to switch up and change clothes. The core of the two modules:http-proxyrespond in singingexpress, and be careful not to misspell the command when you turn the dependency:
npm install express http-proxy --save-dev
Here's a pitfall to be aware of: some tutorials will tell you to an turnrequestmodule, in fact the newer versions of Node.js use the built-inhttpThe module is enough to make it work.
Proxy server three-step construction method
new constructionproxy.jsDocumentation, let's take three steps:
| move | code segment |
|---|---|
| 1. Basic framework | const express = require('express'); |
| 2. Middleware configuration | app.use('/api', createProxyMiddleware({ |
| 3. Activation of services | app.listen(3000, () => { |
Putting armor on the agent.
It's not enough to just be able to forward, you have to add some protection. We recommend ipipgo.Dynamic IP Pool, their IP survivability is measured to be over 90%. Add a proxy switching logic in the configuration:
const proxyOptions = {
target: 'target address',
router: function(req) {
return ipipgo.getRandomIP() // Get a random premium IP
}
}
Notice there's arecount (e.g. results of election): ipipgo's API calls should be set at reasonable intervals, so as not to blow up other people's servers. It is recommended to use a timer to change IP every 5-10 seconds.
Common pitfalls QA
Q: What should I do if I can't connect to the agent all the time?
A: First check if the IP is valid with ipipgo'sping detection interface测下。如果返回码是407,八成是认证没做好。
Q: How do you handle website certificate validation?
A: Add in the configurationsecure: falseSkip SSL validation, though formal certificates are recommended for formal environments.
Q:请求太高怎么办?
A: change ipipgo'sDedicated IP Line, measured more than 3 times faster than shared IP. Remember to set the timeout in the code:
timeout: 5000 // in milliseconds
Performance Optimization Tips
Finally, I'd like to share a few practical tips:
- expense or outlay
cluster moduleOpen multiple processes, CPU utilization is directly doubled - Work with Redis to do IP state caching to reduce API calls
- Don't be lazy about logging, use winston hierarchical logging
The whole package rides down with ipipgo'sHigh Stash IP PoolThe company's technical customer service can also provide customized solutions if they encounter more complex anti-crawling mechanisms. If you encounter more complex anti-climbing mechanism, their technical customer service can also give customized solutions, this point is quite worrying.

