
Don't Let IP Blocking Block Your Crawler's Way
Brothers who have done web crawling understand that the hard work of writing the crawler suddenly paralyzed, in all probability, the IP is blocked by the site. At this time the proxy IP on the scene to save the emergency, especially like theipipgoThis kind of service provider that specializes in high-quality proxies can help you play around with data collection.
Three steps to get started with Nokogiri
First, install a Nokogiri library, and at the command line, hitgem install nokogiriAnd that's the end of it. Remember the three axes for basic usage:
1. Grabbing web content with URIs
2. Feed content to Nokogiri for parsing
3. Select data like you would clothes with CSS
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open('https://目标网站'))
puts doc.css('h1.title').text
Putting a Proxy Vest on a Crawler
Straight to the hardcore code, here withipipgoof the agent doing the demo. Pay attention to theproxy_userrespond in singingproxy_passFor these two parameters, just replace them with the authentication information you got from the ipipgo backend.
proxy_host = 'gateway.ipipgo.com'
proxy_port = 9021
proxy_user = 'Your account'
proxy_pass = 'Your password'
options = {
http_proxyaddr: proxy_host,
http_proxyport: proxy_port, http_proxyuser: proxy_user, http_proxyport: proxy_user
http_proxyuser: proxy_user, http_proxypass: proxy_pass
http_proxypass: proxy_pass
}
doc = Nokogiri::HTML(URI.open('https://目标网站', options))
| Type of program | success rate | maintenance cost |
|---|---|---|
| direct connection | 30% | Changing the code every day |
| General Agent | 60% | Weekly IP change |
| ipipgo proxy | 95%+ | It's basically a no-brainer. |
A practical guide to avoiding the pit
Don't panic when you run into a CAPTCHA, try these three tricks:
1. Reduce the frequency of requests by adding asleep(3)
2. Change User-Agent, don't use the same one all the time.
3. Use of ipipgoDynamic Residential AgentsVisiting in the guise of a real person
Frequently asked questions on demining
Q: Can't I use the free agent?
A: Nine out of ten free proxies are pits, either slow as a tortoise, or hang up after two minutes of use. The professional thing is still left to ipipgo this kind of paid service reliable.
Q: What can I do if I can't get up to agent speed?
A: Pick a node that is close to the target server, for example, if you want to catch Japanese websites, use ipipgo's Tokyo server room. You can also see the latency data of each node in the background, so pick the ones marked in green.
Q: How can I tell if the IP is in effect?
A: Add a test to the code:
puts open('http://ipinfo.io/ip', options).read
Skills necessary for upgrading and fighting monsters
When you come across a particularly difficult site, try ipipgo'ssession holdFunction. This one ensures that the same exit IP is used for 20 minutes, which is suitable for scenarios where you have to log in the state. Coupled with their intelligent routing, which automatically selects the fastest line, the collection efficiency is directly doubled.
Finally said a tearful lesson: last year took a cross-border e-commerce project, did not bother to buy proxy services, the results of their own maintenance IP pool almost did not die of exhaustion. Later changed to ipipgo, every month to save 40 hours debugging time, the money spent is absolutely worth it.

