IPIPGO ip proxy Java web crawler: Jsoup parsing HTML tutorials

Java web crawler: Jsoup parsing HTML tutorials

Teach you to use Jsoup to engage in web crawling Sensei is to engage in data collection or to do competitive analysis, using Java to jerk a web crawler is just needed. Today we take Jsoup this tool to say, focus on teaching people how to use proxy IP to avoid being pulled by the site black. Our actual case on the use of ipipgo family proxy ...

Java web crawler: Jsoup parsing HTML tutorials

Hands-on teaching you to use Jsoup to engage in web crawling

Sensei is to engage in data collection or do competitive analysis, using Java to jerk a web crawler is just needed. Today, we will take Jsoup this artifact to say, focusing on teaching people how to use proxy IP to avoid being pulled by the site black. Our practical cases with ipipgo home proxy service, the dynamic IP pool is really stable.

Jsoup Basic Configuration

First of all, we need to understand how to put a proxy on Jsoup. The key is to stuff the proxy parameters in the Connection object, the code looks like this:

Document doc = Jsoup.connect("destination URL")
               .proxy("proxy.ipipgo.io", 9020)
               .userAgent("Mozilla/5.0...")
               .timeout(30000)
               .timeout(30000); .get()

Note that the proxy method is populated with the gateway address and port provided by ipipgo.If you are a new user, you can get a 20M free traffic pack, which is enough for the test phase. If you encounter SSL certificate problems, remember to configure the certificate in connection.sslSocketFactory().

Proxy IP Practical Tips

What's the biggest fear of harvesting is that your IP will be blocked! This time we have to use proxy IP pool to rotate. Let's use ipipgo's random allocation mode, the code is written like this:

String[] proxyPool = {"s1.ipipgo.io:9010", "s2.ipipgo.io:9012"...} ;
Random rand = new Random() ;
Connection conn = Jsoup.connect(url)
                    .proxy(proxyPool[rand.nextInt(proxyPool.length)]); ;

The delay of ipipgo's residential agent is basically controlled within 200ms, which is much more reliable than those agents. If you collect e-commerce websites, remember to set3-5 seconds between requestsIt's too frequent for the gods to save.

Type of problem prescription
Connection timeout Change ip ipgo's BGP line
Return 403 Clear cookies + change city nodes
Incomplete data Check CSS selector + open JS rendering

A guide to avoiding lightning in common potholes

Newbies are most likely to plant in these areas:

  1. No User-Agent is recognized as a crawler.
  2. Successive requests from the same IP are blacked out
  3. Dynamically loaded data not captured

With ipipgo.Long-lasting static IPTogether with Selenium you can take care of dynamic loading, by hanging the proxy when you start the browser:

System.setProperty("webdriver.chrome.proxy", "http://user:pass@s1.ipipgo.io:9010");

QA session

Q: What should I do if my proxy IP suddenly fails?
A: quickly check ipipgo background package balance, their family package will automatically switch to the alternate channel when it runs out, if it does not work contact customer service to change the authorization key.

Q: What is the difference between a free proxy and a paid proxy?
A: Compare the experience version of ipipgo with the commercial version, the commercial version has 3 times more IP survival time, and there is an exclusive API extraction interface, so there will be no embarrassment of not being able to extract the IP.

Q: How do I test if a proxy IP is anonymous?
A:Visit http://httpbin.org/ip, if it returns a real IP it means the proxy is not working. Using ipipgo's high stash proxy will definitely not leak the local information.

Performance Optimization Tips

If you want to collect fast, you have to play with multi-threading. It is recommended to use thread pool management, each thread with an independent proxy IP. here is a tip: ipipgo's API interface to return to the IP list stored in the blocking queue, the thread with the use of the fetch.

ExecutorService pool = Executors.newFixedThreadPool(10);
while(!urlQueue.isEmpty()){
   pool.execute(()->{
      String proxy = ipQueue.take();
      // Capture logic
   });
}

Remember to setconnection.timeout(15000)If you don't get a response in more than 15 seconds, just give up and move on to the next IP.

Lastly, using ipipgo's customized SDK can save a lot of things, they have encapsulated the automatic IP replacement and abnormal retry mechanism. Especially when doing large-scale collection, more reliable than building their own wheels, after all, professional things still have to be a professional to do.

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/32056.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

New 10W+ U.S. Dynamic IPs Year-End Sale

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish