
Ruby crawler encounter counter-crawl how to do? Try this proxy IP trick
Crawler brothers understand that the site blocked IP that is called a ruthless. Last week I wrote a script to catch the price of e-commerce, just started to run happy, the results of the next day on the rest - the target site directly to my IP black. This time we have to pull out the proxy IP this magic weapon, today we take Ruby to say how to play around with the proxy crawler.
How exactly do you load a proxy IP into Ruby?
Using proxies in Ruby is ridiculously easy, depending on what library you're using. For example, with HTTParty, configuring a proxy is a matter of three lines of code:
require 'httparty'
response = HTTParty.get('https://目标网站.com',
http_proxyaddr: 'Proxy IP assigned by ipipgo',
http_proxyport: port number, http_proxyuser: 'ipipgo assigned proxy IP', http_proxyport: port number, http_proxyuser: port number, http_proxyuser: port number
http_proxyuser: 'Account number',
httpproxypass: 'password'
)
Caution! Here's a pitfall, many newbies will forget to set the timeout. It is recommended to addtimeout: 30This parameter, otherwise the program stuck you do not know what happened.
How to choose between dynamic and static proxies? Depends on the scenario
There are three packages available at ipipgo home, and which one you choose depends on your business needs:
| typology | Applicable Scenarios | Price advantage |
|---|---|---|
| Dynamic residential (standard) | Routine data collection | 7.67 Yuan/GB |
| Dynamic Residential (Business) | High-frequency visit requirements | 9.47 Yuan/GB |
| Static homes | Long-term fixed operations | 35RMB/IP |
Last week, I helped a friend to do airfare comparisons, and I was able to brush 2000 requests in an hour with the dynamic enterprise version, and the IP pool was large enough not to repeat the same thing. If you are doing account formation, you have to use static, an IP corresponds to an account to be safe.
A practical guide to avoiding the pit
A real case: once with a free agent to catch data, the results are returned to the false content! Later changed to ipipgo's TK line to solve. Here to teach you a way to detect whether the proxy is effective:
def check_proxy
origin_ip = HTTParty.get('http://ip-api.com/json').parsed_response["query"]
proxy_ip = HTTParty.get('http://ip-api.com/json', proxy_params).parsed_response["query"]
puts "Original IP: {origin_ip} | proxy IP: {proxy_ip}"
end
If the two IPs are the same when running this code, it means that the proxy is not effective, so check the configuration parameters quickly. It is recommended to add this detection logic to the crawler and run it automatically every half hour.
Frequently Asked Questions QA
Q: What should I do if I always encounter CAPTCHA?
A: Use residential proxy + random UA header combo. ipipgo's client comes with UA randomization function, remember to adjust the request interval to 3-10 seconds random values
Q: What should I do if my agent is slow?
A:优先选地理位置近的节点,比如抓日本网站就用ipipgo的东京机房。他们的SERP API专线实测能压到200ms以内
Q: How do I get it if I need multiple threads?
A: Use Connection Pool to manage the proxy IP pool, each thread is assigned a separate IP. remember that the number of threads should not exceed the number of IPs, or it will be in vain!
Why do you recommend ipipgo?
this onecross-border rail lineIt is really fragrant, the last time to help customers catch Southeast Asian e-commerce data, with the ordinary agent success rate of only 40%, cut to their Singapore line directly soared to 92%. then say an internal news, their technical customer service 24 hours online, encountered problems directly to the error logs dumped over the past ten minutes will be able to give the solution.
Finally nagging sentence: do not try to cheap with a free agent, light blocking data heavy lawsuit. Regular business or have to use ipipgo this kind of serious qualification service providers, data security than that a little agent fee is much more important. Next time let's talk about how to do distributed crawler with the agent, to ensure than the market tutorials really!

