
Crawlers are blocked by the site's IP address?
Recently, I helped a friend to catch the price data of an e-commerce platform, and the result was only 300 IPs were blocked. Nowadays, if you're a crawler and don't know how to use a proxy IP, it's like running naked into the battlefield. Today we will nag how to use Java's Jsoup library to catch the data, focus on how to use theProxy services for ipipgoKeeping the peace.
Jsoup basic operation three pieces
Let's warm up with the most basic code first:
// Remember to import the package first!
import org.jsoup.
import org.jsoup.nodes.
public class BasicCrawler {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("https://目标网站.com")
.timeout(5000)
.timeout(5000); .get(); System.out.println(Jsoup)
System.out.println(doc.title());
}
}
The problem with this code is like a tick in the head - it's obvious. If you expose your real IP directly, you will be blocked in less than half an hour. This is the time toProxy IP for ipipgoOn the field.
The right way to open a proxy IP
Adding proxies to your code is actually easier than cooking instant noodles, it's all about using the right posture. Watch this:
// See here for highlights!
public class ProxyDemo {
public static void main(String[] args) {
// Proxy information from ipipgo
String proxyHost = "gateway.ipipgo.com";
int proxyPort = 9021;
String username = "Your account number"; int
String password = "Your password";
try {
Document doc = Jsoup.connect("https://目标网站.com")
.proxy(proxyHost, proxyPort)
.timeout(10000)
.header("Proxy-Authorization", "Basic " +
Base64.getEncoder().encodeToString(
(username+": "+password).getBytes())))
.get();
System.out.println("Successfully cloaked! Page title: " + doc.title());
} catch (Exception e) {
System.err.println("Rollover! Error message:" + e.getMessage()); }
}
}
}
Here's a couple.Points for avoiding pitfalls::
- Don't be stingy with the timeout, 8 seconds is recommended to start with
- Remember to handle SSL certificate issues (you can add .ignoreHttpErrors(true))
- The IP pool should be large enough, it is recommended to use ipipgo's dynamic residential proxy
Practical: crawl e-commerce price data
Let's say we want to catch the price of an item from a certain East, the HTML structure looks like this:
<div class="price">
<span class="main-price">¥2999</span>
<span class="discount">500 off</span>
</div>
corresponding Java code:
Elements prices = doc.select(".price .main-price");
for (Element price : prices) {
System.out.println("Current price: " + price.text().replace("¥", "")); }
}
At this point, if you don't use a proxy, you'll be recognized as a crawler in minutes. Use ipipgo'sIntelligent Rotation Agentsfunction, automatically switching IPs, which is much less troublesome than manually changing IPs.
Frequently Asked Questions QA
Q: What should I do if the proxy IP is invalidated while I am using it?
A: This situation is eighty percent of the IP is the target site pulled black. Suggestion:
1. Checking for excessive request frequency
2. Change to ipipgo's dynamic residential proxy package
3. Adding a fail-over mechanism
Q: How to set the request header in Jsoup?
A: Chain calls after .connect():
.header("User-Agent", "Mozilla/5.0...")
.header("Accept-Language", "zh-CN")
Q: How do I choose an agent package for ipipgo?
A: Depending on the business scenario:
| Business Type | Recommended Packages |
|---|---|
| High Frequency Data Acquisition | Enterprise Dynamic Agents |
| Long-term monitoring | Exclusive Static Proxy |
| Temporary assignments | pay-per-use package |
Anti-Blocking Strategy Bundle
It's not enough to be an agent, it has to be paired with these combos:
- Randomized sleep time (0.5-3 seconds)
- Replacement of User-Agent
- Simulate mouse trajectory (with Selenium)
- Regular cookie clearing
A final word from the heart: in the business of reptiles.Stable and reliable proxy IPIt is your second life. Tossing your own proxy server is time-consuming and labor-intensive, so why not just use a professional service like ipipgo, and save time to spend more time with your family, right?

