
When Recruitment Headhunters Meet Python Crawlers
Recently, I nagged with a few old iron in HR and found that their biggest headache is finding resumes. A headhunter little brother complained that the efficiency of manually picking up data with LinkedIn now is slower than a snail. No, I overnight gave him a whole Python script, with theipipgoof agency services that pulls the efficiency right out of the box. Today, I'm going to break down and crumple up this combo and make sure that even the little guy can play with it.
Proxy IPs are a life preserver for crawlers
LinkedIn's anti-climbing mechanism is stricter than the security check, use your own IP hard? You'll be blocked in minutes. Here to teach you a tawdry operation:Proxy IPs for CrawlersThe principle is like playing a game of chicken to change the skin, each request to change the IP address. The principle is like playing a game of chicken to change the skin, each request to change the IP address, the server can not distinguish between a person and a machine.
import requests
from itertools import cycle
Proxy pool from ipipgo backend
proxies = [
"http://user:pass@gateway.ipipgo.com:30001",
"http://user:pass@gateway.ipipgo.com:30002".
... Prepare at least 20 IPs
]
proxy_pool = cycle(proxies)
for page in range(1,50): current_proxy = next(proxy_pool)
current_proxy = next(proxy_pool)
current_proxy = next(proxy_pool)
response = requests.get(
url="https://www.linkedin.com/jobs/search/",
url="", proxies={"http": current_proxy},
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
)
print(f "Page {page} of data arrived!")
except.
print("This IP is caught, move to the next one!")
Three Iron Rules for Choosing a Proxy IP
There are all sorts of proxy services on the market, but you have to recognize these three for LinkedIn:
1. Residential IP priority: Server room IP is like wearing overalls into a nightclub, too eye-catching. Recommended to use ipipgo's dynamic residential proxy, real home network environment
2. Stable concurrency control: Don't send 10 requests a second like a rash, use ipipgo's smart scheduling API to control the frequency automatically!
3. Geographically accurate: Want to poach Silicon Valley engineers? Remember to pick an IP node on the U.S. West Coast
A practical guide to avoiding the pit
Last week to help an e-commerce company to catch the post data, they wrote their own script is always ban. later found three fatal injuries:
| concern | prescription |
|---|---|
| User-Agent is fixed | Randomly generated with the fake_useragent library |
| Requests are too regularly spaced | Add random.uniform(1,3) to create the illusion of human manipulation |
| Abnormal login status | With ipipgo's session hold feature |
Old Driver QA Time
Q: What should I do if I suddenly run out of data while crawling?
A: 80% is triggered by the wind control, immediately do three things: 1. empty cookies 2. replace ipipgo's IP 3. reduce the frequency of requests to 3 times per minute
Q: Do free proxies work?
A: Wake up bro! Free IP pools are like public restrooms, anyone who has used them can be safe? As tested before, the availability rate of free IP is less than 10%, and the survival rate of ipipgo can reach more than 98%.
Q: How many IPs are needed to be sufficient?
A:According to our stress test, if there are 1,000 requests per hour, it's safer to prepare 50 IPs for rotation. ipipgo's package has a dynamic IP pool, which automatically replenishes new IPs.
Upgraded Capture Program
The ultimate program for the reachers:
1. Building a distributed crawler with the Scrapy framework
2. Access to ipipgo's API to get the latest proxy IPs.
3. Deploy to cloud servers and run regularly
4. Automatic storage of data in the MongoDB database
After the whole set of processes run through, set up a WeChat robot, every day before you go to work to automatically send the report to your phone, gorgeous~.
To conclude, data collection is like guerrilla warfare.fast, accurate and stableOur team has been testing ipipgo's proxy service for three months and the stability is really something to beat. Especially theirDynamic Residential IPThe LinkedIn data can be accessed locally, and the anti-climbing system can't catch it at all. If you need to go to the official website to take a look, new users to send 1G traffic trial, enough for you to test the basic functions.

