
Hands-on teaching you to use Jsoup to grab data without blocking the number
The old iron engaged in crawler should understand, now the site anti-climbing more and more strict. Last week my disciple with Jsoup just grabbed 200 pieces of data, IP directly into the blacklist. Today with the guys nagging how to use proxy IP with Jsoup, so that the crawler live a little longer.
Jsoup basic operation review
First of all, the most basic Jsoup code, to the newbie just into the pit of a wake-up call. Note thatNever take this paragraph directly to a commercial site to catchThe minutes are sealed:
Document doc = Jsoup.connect("https://目标网站.com")
.timeout(5000)
.timeout(5000); .get();
Elements items = doc.select(".product-item");
It won't take more than half an hour for this code to trigger a reverse crawl, don't you believe me? Don't ask me how I know...
Proxy IPs to the rescue
Websites rely heavily on these 3 tricks to identify crawlers:
| Detection method | Response program |
|---|---|
| IP request frequency | Rotation of proxy IPs |
| Request header characteristics | Browser simulation |
| Behavioral Trajectory Analysis | Random operation interval |
One of the most damaging things is IP blocking, which is when you have to rely on proxy IPs tolit. the cicada sheds its carapace (idiom); fig. vanish leaving an empty shell. For example, with ipipgo's residential proxies, the site can't tell if it's a machine or a real person with each request for a different live IP.
Jsoup with proxy code
Straight to the dry stuff, pay attention to the proxy settings section:
// Proxy information from ipipgo
String proxyHost = "gateway.ipipgo.com";
int proxyPort = 9021;
String proxyUser = "your account"; int
String proxyPass = "password";
// Proxy authentication
Authenticator.setDefault(new Authenticator() {
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication(proxyUser, proxyPass.toCharArray());
}
});
// Request with proxy
Document doc = Jsoup.connect("https://目标网站")
.proxy(proxyHost, proxyPort)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit...")
.timeout(30000)
.timeout(30000); .get()
Watch out for a few pit stops:
1. Don't set the timeout too short, more than 20 seconds is recommended.
2. User-Agent to match the full set of browser features
3. It is better to change different proxy IPs for each request (ipipgo's API can rotate them automatically)
Anti-Reverse Crawl Advanced Tips
It's not enough to use an agent, it has to be coupled with some blinders:
// Randomly wait for anti-frequency detection
Thread.sleep((long)(Math.random() 3000 + 2000));
// Fake the full request header
Connection conn = Jsoup.connect(url)
.header("Accept-Language", "zh-CN,zh;q=0.9")
.header("Accept-Encoding", "gzip, deflate, br")
.header("Cache-Control", "max-age=0");
It's no good using the best agent if you don't pay attention to these details. It's like wearing a nightshirt to steal something and ending up with fluorescent shoes on your feet...
QA session
Q: What should I do if the proxy IP is not working?
A: It is recommended to use ipipgo's dynamic residential proxy, their IP pool is updated every day 2 million +, automatic switching without worrying about it!
Q: How do I break the CAPTCHA when I encounter it?
A: This belongs to another technical field, you can work with ipipgo's fixed session proxy to keep the same IP to handle the authentication process.
Q: Is the agent too slow to affect efficiency?
A:选对代理类型很重要,ipipgo的静态数据中心代理能控制在200ms内,适合需要快速响应的场景
Why recommend ipipgo
After using so many proxy services, I ended up locking down ipipgo for three main reasons:
- be in favor ofpay per volumeSmall-cost programs don't hurt.
- exclusivityIP Survival DetectionFunction, invalid IP automatic filtering
- provide a completeRequest logIt's especially convenient for debugging.
Recently, they are having an event, new users get 1G traffic, and when you register, you can get 20% more traffic by filling in [JSOUP2023]. If you need to go to the official website to take a look, here will not put the link (so as not to be said to advertise).
As a final reminder, technology is a double-edged sword, and crawling data is careful to observe theRobots protocolrespond in singingRelevant laws and regulations. Don't put yourself on the line for a little data, it's not worth it!

