Ruby Web Crawling: Nokogiri Parsing HTML Example

When crawlers meet CAPTCHA? Try this.

Recently, I helped my friend to make a price monitoring script, and I wrote a crawler in Ruby, but the next day, I had a break - the target website blocked our IP. It was only then that I remembered the proxy IP thing, just like eating hot pot and realizing there was no dipping sauce, so I tried to find a solution on the spot.

How does this Nokogiri thing work?

Let's not talk about proxies, we need to understand the basic tools, Nokogiri is an HTML parser, easy to install:

gem install nokogiri

To give a chestnut, you want to catch the price of goods in an e-commerce page, the code is probably long like this:


require 'nokogiri'
require 'open-uri'

html = URI.open('https://example.com/product').read
doc = Nokogiri::HTML(html)
price = doc.css('span.price-class').first.text
puts "Current price: {price}"

take note ofcss selectorGetting it right is like trying to fit a key to a tooth slot, and right-clicking on an element in Chrome Developer Tools and selecting Copy selector saves a lot of work.

What to do if your IP is blocked? Proxy IP to the rescue

Here's the point! Single-IP high-frequency access is like sneaking around the neighborhood a dozen times in the middle of the night, the security guards do not stare at you to stare at who? At this time it is necessary toipipgoof proxy services to cover up.

Here's the remodeled script:


require 'nokogiri'
require 'open-uri'

proxy_list = [
  'http://username:password@gateway.ipipgo.com:8080',
  'http://username:password@gateway.ipipgo.com:8081'
]

5.times do |i|
  do |i| begin
    html = URI.open('https://target-site.com',
      :proxy => proxy_list.sample, 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0)
      'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0)'
    ).read

     The parsing code is the same as above
    rescue => e
  rescue => e
    puts "Failed {i+1}th attempt: {e.message}"
  end
end

It's used here.ipipgoMultiple exit IPs provided, one randomly selected each time. It's like fighting a guerrilla war, shooting one shot and changing places.

A practical guide to avoiding the pit

Name a couple of common fallouts for newbies:

concern	prescription
SSL Certificate Error	Add ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE to the request
Load Timeout	Set the read_timeout parameter, 10-30 seconds is recommended.
Blocking User-Agents	Generating random browser fingerprints with the Faker gem

You may ask.

Q: Can't I use the free agent?
A: Free agents are like public restrooms, anyone can use them but they are easy to block. Commercial scenarios are still recommendedipipgoThis professional service with a large IP pool is also stable.

Q: What should I do if my agent is slow?
A: Choose a node close to the geographic location, such as climbing the domestic website with the East China server room. ipipgo's background can be self-selected export area, this point is quite convenient.

Q: How can I tell if a proxy is in effect?
A: Add a scriptputs html[0..100]Export the beginning of the page and see if the content is correct. Or use a third party website to check the export IP.

Upgrade Play

What to do with dynamically loaded data? It is possible to work withselenium-webdriver, making proxy settings more thorough:


require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--proxy-server=http://gateway.ipipgo.com:8080')

driver = Selenium::WebDriver.for :chrome, options: options
driver.navigate.to "https://target-site.com"

This way even JS rendered pages can be caught, equivalent to the crawler installed a telescope.

Lastly, using a proxy IP is like wearing a seatbelt in a car, which is usually troublesome, but can save your life at critical moments. Especially for commercial crawlers, don't save on this budget.ipipgoof new users have trial packages that cost a lot less to step in the hole.

Ruby Web Crawling: Nokogiri Parsing HTML Example

When crawlers meet CAPTCHA? Try this.

How does this Nokogiri thing work?

What to do if your IP is blocked? Proxy IP to the rescue

A practical guide to avoiding the pit

You may ask.

Upgrade Play

business scenario

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Follow us on WeChat

When crawlers meet CAPTCHA? Try this.

How does this Nokogiri thing work?

What to do if your IP is blocked? Proxy IP to the rescue

A practical guide to avoiding the pit

You may ask.

Upgrade Play

business scenario

Professional foreign proxy ip service provider-IPIPGO

Related articles

短效代理IP适合什么场景？高频切换型业务方案设计

长效代理IP推荐：24小时不断线的稳定资源哪家强？

轮换代理IP怎么用？自动切换频率与策略最佳实践

专线代理IP和普通代理IP有什么区别？稳定性差距太大了！

独享代理IP一个月多少钱？2026年各类型价格汇总表

移动代理IP是什么？4G/5G蜂窝网络代理有什么优势？

Contact Us

Follow us on WeChat