
When crawlers meet CAPTCHA? Try this.
Recently, I helped my friend to make a price monitoring script, and I wrote a crawler in Ruby, but the next day, I had a break - the target website blocked our IP. It was only then that I remembered the proxy IP thing, just like eating hot pot and realizing there was no dipping sauce, so I tried to find a solution on the spot.
How does this Nokogiri thing work?
Let's not talk about proxies, we need to understand the basic tools, Nokogiri is an HTML parser, easy to install:
gem install nokogiri
To give a chestnut, you want to catch the price of goods in an e-commerce page, the code is probably long like this:
require 'nokogiri'
require 'open-uri'
html = URI.open('https://example.com/product').read
doc = Nokogiri::HTML(html)
price = doc.css('span.price-class').first.text
puts "Current price: {price}"
take note ofcss selectorGetting it right is like trying to fit a key to a tooth slot, and right-clicking on an element in Chrome Developer Tools and selecting Copy selector saves a lot of work.
What to do if your IP is blocked? Proxy IP to the rescue
Here's the point! Single-IP high-frequency access is like sneaking around the neighborhood a dozen times in the middle of the night, the security guards do not stare at you to stare at who? At this time it is necessary toipipgoof proxy services to cover up.
Here's the remodeled script:
require 'nokogiri'
require 'open-uri'
proxy_list = [
'http://username:password@gateway.ipipgo.com:8080',
'http://username:password@gateway.ipipgo.com:8081'
]
5.times do |i|
do |i| begin
html = URI.open('https://target-site.com',
:proxy => proxy_list.sample, 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0)
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0)'
).read
The parsing code is the same as above
rescue => e
rescue => e
puts "Failed {i+1}th attempt: {e.message}"
end
end
It's used here.ipipgoMultiple exit IPs provided, one randomly selected each time. It's like fighting a guerrilla war, shooting one shot and changing places.
A practical guide to avoiding the pit
Name a couple of common fallouts for newbies:
| concern | prescription |
|---|---|
| SSL Certificate Error | Add ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE to the request |
| Load Timeout | Set the read_timeout parameter, 10-30 seconds is recommended. |
| Blocking User-Agents | Generating random browser fingerprints with the Faker gem |
You may ask.
Q: Can't I use the free agent?
A: Free agents are like public restrooms, anyone can use them but they are easy to block. Commercial scenarios are still recommendedipipgoThis professional service with a large IP pool is also stable.
Q: What should I do if my agent is slow?
A: Choose a node close to the geographic location, such as climbing the domestic website with the East China server room. ipipgo's background can be self-selected export area, this point is quite convenient.
Q: How can I tell if a proxy is in effect?
A: Add a scriptputs html[0..100]Export the beginning of the page and see if the content is correct. Or use a third party website to check the export IP.
Upgrade Play
What to do with dynamically loaded data? It is possible to work withselenium-webdriver, making proxy settings more thorough:
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--proxy-server=http://gateway.ipipgo.com:8080')
driver = Selenium::WebDriver.for :chrome, options: options
driver.navigate.to "https://target-site.com"
This way even JS rendered pages can be caught, equivalent to the crawler installed a telescope.
Lastly, using a proxy IP is like wearing a seatbelt in a car, which is usually troublesome, but can save your life at critical moments. Especially for commercial crawlers, don't save on this budget.ipipgoof new users have trial packages that cost a lot less to step in the hole.

