IPIPGO ip proxy Java Web Crawl: Jsoup Tutorials

Java Web Crawl: Jsoup Tutorials

Teach you to use Jsoup to grab data without blocking the old iron people should understand, now the site anti-climbing more and more strict. Last week, my disciple with Jsoup just grabbed 200 pieces of data, IP directly into the blacklist. Today with the guys nagging how to use proxy IP with Jsoup, so that the crawler live a little longer. Jsoup...

Java Web Crawl: Jsoup Tutorials

Hands-on teaching you to use Jsoup to grab data without blocking the number

The old iron engaged in crawler should understand, now the site anti-climbing more and more strict. Last week my disciple with Jsoup just grabbed 200 pieces of data, IP directly into the blacklist. Today with the guys nagging how to use proxy IP with Jsoup, so that the crawler live a little longer.

Jsoup basic operation review

First of all, the most basic Jsoup code, to the newbie just into the pit of a wake-up call. Note thatNever take this paragraph directly to a commercial site to catchThe minutes are sealed:

Document doc = Jsoup.connect("https://目标网站.com")
           .timeout(5000)
           .timeout(5000); .get();
Elements items = doc.select(".product-item");

It won't take more than half an hour for this code to trigger a reverse crawl, don't you believe me? Don't ask me how I know...

Proxy IPs to the rescue

Websites rely heavily on these 3 tricks to identify crawlers:

Detection method Response program
IP request frequency Rotation of proxy IPs
Request header characteristics Browser simulation
Behavioral Trajectory Analysis Random operation interval

One of the most damaging things is IP blocking, which is when you have to rely on proxy IPs tolit. the cicada sheds its carapace (idiom); fig. vanish leaving an empty shell. For example, with ipipgo's residential proxies, the site can't tell if it's a machine or a real person with each request for a different live IP.

Jsoup with proxy code

Straight to the dry stuff, pay attention to the proxy settings section:

// Proxy information from ipipgo
String proxyHost = "gateway.ipipgo.com";
int proxyPort = 9021;
String proxyUser = "your account"; int
String proxyPass = "password";

// Proxy authentication
Authenticator.setDefault(new Authenticator() {
    protected PasswordAuthentication getPasswordAuthentication() {
        return new PasswordAuthentication(proxyUser, proxyPass.toCharArray());
    }
});

// Request with proxy
Document doc = Jsoup.connect("https://目标网站")
           .proxy(proxyHost, proxyPort)
           .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit...")
           .timeout(30000)
           .timeout(30000); .get()

Watch out for a few pit stops:
1. Don't set the timeout too short, more than 20 seconds is recommended.
2. User-Agent to match the full set of browser features
3. It is better to change different proxy IPs for each request (ipipgo's API can rotate them automatically)

Anti-Reverse Crawl Advanced Tips

It's not enough to use an agent, it has to be coupled with some blinders:

// Randomly wait for anti-frequency detection
Thread.sleep((long)(Math.random() 3000 + 2000));

// Fake the full request header
Connection conn = Jsoup.connect(url)
    .header("Accept-Language", "zh-CN,zh;q=0.9")
    .header("Accept-Encoding", "gzip, deflate, br")
    .header("Cache-Control", "max-age=0");

It's no good using the best agent if you don't pay attention to these details. It's like wearing a nightshirt to steal something and ending up with fluorescent shoes on your feet...

QA session

Q: What should I do if the proxy IP is not working?
A: It is recommended to use ipipgo's dynamic residential proxy, their IP pool is updated every day 2 million +, automatic switching without worrying about it!

Q: How do I break the CAPTCHA when I encounter it?
A: This belongs to another technical field, you can work with ipipgo's fixed session proxy to keep the same IP to handle the authentication process.

Q: Is the agent too slow to affect efficiency?
A:选对代理类型很重要,ipipgo的静态数据中心代理能控制在200ms内,适合需要快速响应的场景

Why recommend ipipgo

After using so many proxy services, I ended up locking down ipipgo for three main reasons:

  1. be in favor ofpay per volumeSmall-cost programs don't hurt.
  2. exclusivityIP Survival DetectionFunction, invalid IP automatic filtering
  3. provide a completeRequest logIt's especially convenient for debugging.

Recently, they are having an event, new users get 1G traffic, and when you register, you can get 20% more traffic by filling in [JSOUP2023]. If you need to go to the official website to take a look, here will not put the link (so as not to be said to advertise).

As a final reminder, technology is a double-edged sword, and crawling data is careful to observe theRobots protocolrespond in singingRelevant laws and regulations. Don't put yourself on the line for a little data, it's not worth it!

我们的产品仅支持在境外网络环境下使用(除TikTok专线外),用户使用IPIPGO从事的任何行为均不代表IPIPGO的意志和观点,IPIPGO不承担任何法律责任。

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

IPIPGO-动态住宅ip全新升级

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish