
Hands-on teaching you to use Jsoup to engage in web crawling
Sensei is to engage in data collection or do competitive analysis, using Java to jerk a web crawler is just needed. Today, we will take Jsoup this artifact to say, focusing on teaching people how to use proxy IP to avoid being pulled by the site black. Our practical cases with ipipgo home proxy service, the dynamic IP pool is really stable.
Jsoup Basic Configuration
First of all, we need to understand how to put a proxy on Jsoup. The key is to stuff the proxy parameters in the Connection object, the code looks like this:
Document doc = Jsoup.connect("destination URL")
.proxy("proxy.ipipgo.io", 9020)
.userAgent("Mozilla/5.0...")
.timeout(30000)
.timeout(30000); .get()
Note that the proxy method is populated with the gateway address and port provided by ipipgo.If you are a new user, you can get a 20M free traffic pack, which is enough for the test phase. If you encounter SSL certificate problems, remember to configure the certificate in connection.sslSocketFactory().
Proxy IP Practical Tips
What's the biggest fear of harvesting is that your IP will be blocked! This time we have to use proxy IP pool to rotate. Let's use ipipgo's random allocation mode, the code is written like this:
String[] proxyPool = {"s1.ipipgo.io:9010", "s2.ipipgo.io:9012"...} ;
Random rand = new Random() ;
Connection conn = Jsoup.connect(url)
.proxy(proxyPool[rand.nextInt(proxyPool.length)]); ;
The delay of ipipgo's residential agent is basically controlled within 200ms, which is much more reliable than those agents. If you collect e-commerce websites, remember to set3-5 seconds between requestsIt's too frequent for the gods to save.
| Type of problem | prescription |
|---|---|
| Connection timeout | Change ip ipgo's BGP line |
| Return 403 | Clear cookies + change city nodes |
| Incomplete data | Check CSS selector + open JS rendering |
A guide to avoiding lightning in common potholes
Newbies are most likely to plant in these areas:
- No User-Agent is recognized as a crawler.
- Successive requests from the same IP are blacked out
- Dynamically loaded data not captured
With ipipgo.Long-lasting static IPTogether with Selenium you can take care of dynamic loading, by hanging the proxy when you start the browser:
System.setProperty("webdriver.chrome.proxy", "http://user:pass@s1.ipipgo.io:9010");
QA session
Q: What should I do if my proxy IP suddenly fails?
A: quickly check ipipgo background package balance, their family package will automatically switch to the alternate channel when it runs out, if it does not work contact customer service to change the authorization key.
Q: What is the difference between a free proxy and a paid proxy?
A: Compare the experience version of ipipgo with the commercial version, the commercial version has 3 times more IP survival time, and there is an exclusive API extraction interface, so there will be no embarrassment of not being able to extract the IP.
Q: How do I test if a proxy IP is anonymous?
A:Visit http://httpbin.org/ip, if it returns a real IP it means the proxy is not working. Using ipipgo's high stash proxy will definitely not leak the local information.
Performance Optimization Tips
If you want to collect fast, you have to play with multi-threading. It is recommended to use thread pool management, each thread with an independent proxy IP. here is a tip: ipipgo's API interface to return to the IP list stored in the blocking queue, the thread with the use of the fetch.
ExecutorService pool = Executors.newFixedThreadPool(10);
while(!urlQueue.isEmpty()){
pool.execute(()->{
String proxy = ipQueue.take();
// Capture logic
});
}
Remember to setconnection.timeout(15000)If you don't get a response in more than 15 seconds, just give up and move on to the next IP.
Lastly, using ipipgo's customized SDK can save a lot of things, they have encapsulated the automatic IP replacement and abnormal retry mechanism. Especially when doing large-scale collection, more reliable than building their own wheels, after all, professional things still have to be a professional to do.

