
当爬虫撞上反爬,油条代理来救场
If you're a crawler, you know that Nokogiri is the best at parsing web pages in the Ruby community, but lately I've been getting a lot of complaints from people who say that within two days of writing a script, the target site gets banned. But recently, many of my buddies complained to me that they just wrote a crawler script and ran it for a couple of days before the target site banned their IP, which is like eating hot pot without dipping sauce - almost funny.
Last week there was a buddy doing a price comparison system, catching the price of an e-commerce platform, three consecutive changes in the server IP are blocked. Then I let him tryDynamic Residential Proxy for ipipgoThe good guys, they come right back to life in the same place. The trick here is actually simple:With a real user level IP address, the site can't tell if it's a machine or a real personThe
Hands-on Vesting for Nokogiri
Let's start with a basic configuration template (remember to replace your_api_key with the real token that the ipipgo backend gets):
require 'nokogiri'
require 'net/http'
Fetch ipipgo dynamic proxy
def fetch_proxy
api_url = "https://api.ipipgo.com/v1/proxy?key=your_api_key&type=rotating"
response = Net::HTTP.get(URI(api_url))
JSON.parse(response)['proxies'].sample
end
proxy = fetch_proxy
uri = URI('https://target-site.com')
Net::HTTP.start(uri.host, uri.port, proxy_addr: proxy['ip'], proxy_addr
proxy_addr: proxy['ip'],
proxy_port: proxy['port'],
proxy_user: proxy['username'],
proxy_pass: proxy['password']) do |http|
doc = Nokogiri::HTML(http.get(uri.path).body)
Subsequent parsing operations...
end
Here's a couple.Guide to avoiding the pit::
- It's best to get a new proxy for each request (ipipgo's auto-rotation feature helps a lot)
- Pay attention to the type of agency agreement, e-commerce sites with a higher success rate of residential agents
- Don't set the timeout more than 15 seconds, otherwise it will affect the collection efficiency
Examples of tawdry operations in real life
Scenario 1: Breaking the Frequency Limit
Doing opinion monitoring for a client that needs to crawl a certain forum for new posts every hour. Use ipipgo'squantity-based billing packageThe first one is to randomly switch the User-Agent in the request header, together with the proxy IP pool, and froze to pull the collection success rate from 37% to 92%.
Scenario 2: Cracking the geo-blocking
There is a project to do local life service, need to collect data of merchants in different cities. Through ipipgo'sCity-level location agentsIn addition, it can accurately obtain IP addresses in specified regions, successfully bypassing the geographical filtering mechanism of websites.
| Anti-crawl type | Response program | Recommended Agent Type |
|---|---|---|
| IP frequency limitation | Dynamic rotation + request interval | Data Center Agents |
| geographic shielding | Static long-lasting IP | Residential Agents |
Five questions you definitely want to ask
Q: Will proxy IPs slow down the collection speed?
A: ipipgo's premium lines respond within 800ms on average, which is much faster than retrying after being blocked
Q: How often is it appropriate to change IPs?
A: high anti-climbing website is recommended to change every request, ordinary website can be changed in 5 minutes.
Q: What should I do if I encounter an HTTPS website?
A: ipipgo's proxy fully supports SSL connections, remember to use https://开头 in the code.
Q: How do I manage agents with multiple crawler threads open at the same time?
A: Use ipipgo's API to get proxy pools in bulk, with each thread assigned individually
Q: What is the difference between a free agent and a paid agent?
A: Let's put it this way, free proxies are like public restrooms - anyone can use them, but when it comes time to use them, you may not be able to find a pit stop. ipipgo's exclusive proxies are the equivalent of private restrooms, which are clean and stable.
Say something from the heart.
In fact, with the proxy IP with the game open plug-in like, the key to act naturally. Don't the whole script click and grab, the result is to use the IP of the room, which is not clearly tell people you are a robot. ipipipgopool of mixed dialing agents, being able to mix residential IP, data center IP, and mobile IP is a high level play.
One last piece of advice: don't be lazy on User-Agent! I've seen people use Nokogiri to grab data, and all the User-Agents for all the requests show Ruby/nethttp, so it's just waiting to be banned. With ipipgo agent use, remember to add a random User-Agent array in the code, this is the basic quality of professional players.

