
A trick to teach you to dig into the core data of job boards
Recently, a headhunter's buddy complained to me, saying that it is more and more difficult to dig people now. Enterprises to salary treatment hidden tightly, the competition on the company's recruitment information is also like a riddle. In fact, this thing with the right tools, and open your own refrigerator to find food almost easy - the key to find the right key.
Let's take the most common job boards as an example, they have three axes to prevent crawlers:IP blocking, access frequency restriction, behavioral track detectionThe first thing you need to do is to make sure that you have a good understanding of what you are doing. Last year there is a friend who does payroll analytics, he wrote his own script and ran it for two days and was blocked for more than 20 IPs, so he was so angry that he almost smashed his keyboard.
It's time to bring out the killer in us--High Stash SOCKS5 Proxy IPThe most important difference between this thing and a normal proxy is that it's like wearing a cloak of secrecy. The biggest difference between this thing and the ordinary proxy, like wearing an invisibility cloak shopping, the site can only see the proxy server information, completely unable to touch the edge of your real IP. Especially with ipipgo home residential IP resources, each IP is a real home network environment, recruitment sites that anti-climbing system simply can not distinguish between real people visit or machine operation.
Hands On Data Acquisition System
First of all, let's talk about a real case: a human resources company used our ipipgo's SOCKS5 agent to catch 500,000+ recruitment information in three months. The technical person in charge of them said, "The IP is more diligent than changing socks, but the success rate is frozen at 95% or more."
How exactly does it work? Remember these three points:
1. IP rotation strategy should be "wave" enough
Don't be silly fixed 5 minutes to change the IP, this is easy to be recognized instead. It is recommended to use ipipgo's dynamic residential IP, set a random switching interval (ranging from 30 seconds to 5 minutes), so that the site's anti-climbing system can not feel the pattern.
2. Requests should have a "face".
It's not enough to just change the IP, you have to randomly change the User-Agent and Referer parameters as well. Just like every time you go out not only change your clothes, but also change your looks, so as to be safe enough.
| parameters | camouflage technique |
|---|---|
| User-Agent | Prepare logos for 20+ different browser versions |
| access interval | 设置0.5-3秒的随机 |
| Click on the path | Mimic the browsing habits of real people (look at the listing page before going into the details) |
3. "Play dead" in exception handling
When you encounter a CAPTCHA, don't be tough, immediately suspend the task of the current IP. ipipgo's API supports automatic offline abnormal IP, and wait for a period of time and then quietly kill a back.
Three Tough Tips for Salary Analysis
It's no good having data you can't use, so I'll teach you a few tricks of the trade:
① Job Salary LevelsThe same post to take the median, compared with different companies to open the price gap. For example, a large factory JAVA development post to 35k, the competition may only dare to open to 28k, which is the moisture is the opportunity to headhunters.
② Hidden Benefits MiningThe keywords "year-end bonus" and "stock options" appear frequently, and the real benefits of many companies are hidden in these words and phrases.
③ Recruitment tempo monitoring: A sudden increase in hiring for a particular position is likely to correspond to an expansion of the business line. Last year, a client relied on this to dig out the news of the disbandment of the autopilot team of a large factory in advance.
Frequently Asked Questions QA
Q: Is it legal to collect data with a proxy IP?
A: As long as you don't break the normal access authority of the website, there is no problem to collect public information. ipipgo all IPs come from compliant channels, just like using different cell phones to brush the webpage of the same nature.
Q: How to choose between dynamic IP and static IP?
A: high-frequency collection with dynamic residential IP (ipipgo support automatic rotation), long-term monitoring of specific pages with static residential IP. don't try to use cheap data center IP, job sites are now staring at this kind of IP seal.
Q: What should I do if I encounter a CAPTCHA?
A: three steps: ① immediately switch to a new IP ② reduce the collection frequency ③ use ipipgo's request interval randomization function. Really can not get around and then consider coding platform, but the cost will skyrocket.
In the end, data collection is a cat and mouse game. Last year, a customer opened 30 crawler processes at the same time, with ipipgo's global node resources to play "guerrilla warfare", hard to a job site's job update monitoring to real-time level. Remember, proxy IP is not the key to everything, but choose the right service provider (such as our ipipgo), at least you can let your crawler less 80% detour.

