
First, why use Java to engage in proxy IP resolution?
Do the old iron of the network crawler understand, directly with their own IP frantically requesting the site, minutes to be blacklisted. This time you have to use a proxy IP toHide your true identity.It's like putting a million masks on a crawler. But the market proxy IP services are returned to the HTML format, can not manually copy and paste it? This time you need to write a parser to batch processing.
Second, hand building wheel tutorial
Let's use Jsoup as an HTML parser, with ipipgo's proxy service to practice. Suppose we want to extract the IP address and port number from the page obtained by ipipgo, the page structure looks like this:
<div class="proxy-list">
<span>101.202.3.4</span>
<em>|</em>
<span>8080</span>
</div>
The code is written this way (note the exception handling section):
// Setting up ipipgo's proxy (emphasis added!)
System.setProperty("http.proxyHost", "gateway.ipipgo.com");
System.setProperty("http.proxyPort", "9021");;
Document doc = Jsoup.connect("https://api.ipipgo.com/proxies")
.timeout(10000)
.timeout(10000); .get();
Elements proxies = doc.select("div.proxy-list");
for (Element proxy : proxies) {
String ip = proxy.select("span:first-child").text();
String port = proxy.select("span:last-child").text(); String port = proxy.select("span:last-child").text();
System.out.println("Caught valid IP:" + ip + ":" + port);
}
III. A guide to avoiding the three giant pits
Pit 1: IP failure is not handled - Suggested by ipipgoSurvival rate 99%packages, their IPs are automatically refreshed every 15 minutes
Pit 2: Requests are banned too often - Add a random wait time to the code:
Thread.sleep((long)(Math.random() 3000));
Pit 3: HTTPS certificate issues - Add this configuration to the initialization:
Connection connection = Jsoup.connect(url)
.sslSocketFactory(ipipgoSSLContext().getSocketFactory());
IV. QA Frequently Asked Questions
| concern | prescription |
| What should I do if I always time out when parsing? | Set ipipgo's response timeout parameter to 15000ms, the average response from their API is only 800ms |
| What if I need a highly anonymous agent? | Go with ipipgo.Enterprise PackageThe X-Forwarded-For request header will automatically have the X-Forwarded-For |
V. Performance Optimization Tips
1. Reduce repeated handshakes with connection pooling:
Connection.Response res = Jsoup.newSession()
.url(url)
.proxy("gateway.ipipgo.com", 9021)
.execute();
2. with ipipgoexclusive IP poolThe actual parsing speed is more than 3 times faster.
3. Remember to regularly clean up invalid IPs, you can use the API status detection interface they provide
VI. Speak the truth
The most troublesome part of writing your own parser isn't the code, it's maintaining the quality of the proxy IPs. I've used a couple of free services before and 8 out of 10 IPs were dead. Then I switched to ipipgo.Dynamic Residential IPThe resolution success rate has increased directly from 50% to 95%, which is a relief to say the least, no need to toss the retry mechanism all day long.

