When crawlers meet CAPTCHA? Try Playwright + Proxy IP.
Recently, I've been asked what to do if I'm always getting IP bans from my target sites when I use Playwright for automation. I am too familiar with this matter! Last year, when I was doing e-commerce data collection, I had to change the IP address every three days, and then I realized that I had to change the IP address.Put a proxy IP on PlaywrightIt's the right thing to do, it's the same thing as changing the license plate on a car.
First of all, a real scene: last week to help a friend to get the price monitoring of the travel site, a single IP visit more than 50 times to be pinched. After switching to ipipgo's dynamic residential agent, it ran for three days without turning over. Inside the doorway, let's sub-language to nag.
Python players look here: three lines of code to inject proxies
Brothers who use Python remember this routine:
"`python
from playwright.sync_api import sync_playwright
with sync_playwright() as p.
The key is in the configuration of these three lines
proxy = {
"server": "gateway.ipipgo.com:8000",
"username": "Your account number".
"password": "your key"
}
browser = p.chromium.launch(proxy=proxy)
page = browser.new_page()
page.goto("https://目标网站")
...follow-up actions...
“`
take note ofusername and passwordNever use plaintext! It is recommended to store it in an environment variable. ipipgo's backend can generate this kind of authentication string directly, which is much less troublesome than some service providers who need to splice it by themselves.
JS Gamer Exclusive: Asynchronous Proxy Configuration Tips
Node.js environment is prone to encounter the problem of proxy does not work, mostly asynchronous loading screwed up. See here for the correct posture:
"`javascript
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({
proxy: {
server: 'http://gateway.ipipgo.com:8000',
username: process.env.IPIPGO_USER,
password: process.env.IPIPGO_PASS
}
});
const page = await browser.newPage();
await page.geto('https://需要访问的地址');
// Remember to check if the IP is active
console.log(await page.evaluate(() => document.body.innerHTML));;
})();
“`
Focused Reminder:Be sure to pass in the proxy at launch timeipipgo's proxy channel supports HTTP/HTTPS/SOCKS5 three protocols, measured with SOCKS5 protocols with the highest success rate.
Java veterans must see: proxy pool dynamic switching scheme
Enterprise applications are about aAgent pool rotation, using ipipgo's API to dynamically get IPs is the way to go:
"`java
import com.microsoft.playwright.
public class ProxyDemo {
public static void main(String[] args) {
try (Playwright playwright = Playwright.create()) {
// Get the latest proxy from the ipipgo interface
String[] currentProxy = getIpipgoProxy();
BrowserType.LaunchOptions options = new BrowserType.LaunchOptions()
.setProxy(new Proxy("http://"+currentProxy[0]))
.setHeadless(false);
Browser browser = playwright.chromium().launch(options);
Page page = browser.newPage();
page.navigate("https://业务网站");
System.out.println(page.title());
}
}
private static String[] getIpipgoProxy() {
// Call the ipipgo API to get a dynamic IP.
// Return format [ip:port, username, password].
}
}
“`
Delineate the focus:Before each Browser instance is createdYou need to refresh the proxy, never use an IP to death. ipipgo's concurrency pool mode is especially suitable for this scenario, and it can change 200+ residential IPs per second.
The Complete Pitfalls: I've Filled Every Pit You've Ever Encountered
symptomatic | etiology | antidote |
---|---|---|
The agent is configured but can't connect | Protocol header misspelling (e.g. http written as https) | Courtesy of ipipgoFully protocol-compatible channels |
Still recognized after logging in | Browser Fingerprint Leak | Work with playwright's device simulation parameters! |
Mobile environment failure | IP type mismatch | Switching to ipipgo's 4G/5G mobile agent pools |
Soul-searching question: Have you chosen the right agency service provider?
I've tested no less than 20 proxy services on the market, and I ended up locking in on ipipgo for three reasons:
- True Residential IPUnlike some service providers who pass off server room IPs as
- zero traffic limitYou don't have to worry about being overqualified.
- Exclusive API design: Getting an IP is as easy as hailing a cab
Especially theirIntelligent Routingfunction to automatically select the node with the lowest latency. The last time I did live streaming data collection, I ran 70GB of traffic in 8 hours, and the IP survival rate was still above 92%.
QA time: high-frequency questions in one place
Q: What should I do if my proxy IP fails frequently?
A: Switch to ipipgo's Dynamic Residential Proxy, which automatically switches IPs for each request, and doesn't give the site a chance to be blocked at all.
Q: What if I need to collect overseas websites?
A: ipipgo's global coverage is not blowing, measured even Mauritius IP can be obtained stably. But be careful to comply with local laws and regulations.
Q: How do I manage proxies with multiple browser instances open at the same time?
A: Use theirsession hold modeThe IP address of each browser instance is bound to a separate IP address, so that the business logic is clear and not serialized.
A final word from the heart: technical means are just tools.Choosing the right proxy service provider is king. Instead of tossing anti-blocking in the code, why not just use a reliable service like ipipgo and focus on the business logic doesn't it smell good?