IPIPGO ip proxy Python Parsing to JSON: Dictionary Data Handling Tips

Python Parsing to JSON: Dictionary Data Handling Tips

When the proxy IP hit JSON data, how to play not to overturn the car? Brothers engaged in crawling understand that every time from the Internet back to pick up data, nine times out of ten are JSON format. This thing looks refreshing, really want to deal with more trouble than dismantling Russian nesting dolls. Especially when using proxy IP to collect data, often encountered ...

Python Parsing to JSON: Dictionary Data Handling Tips

When proxy IP crashes into JSON data, how do you play it without rolling over?

Brothers engaged in crawling understand that every time from the Internet back to pick up data, nine times out of ten are JSON format. This thing looks refreshing, really want to deal with more troublesome than the demolition of Russian nesting dolls. Especially when using proxy IP to collect data, often encounteredConfusing data types, coding errors, too deep nestingThese moths. Last week I ran into a case: ipipgo dynamic residential agent to catch a certain e-commerce data, the results of the return JSON price field will be a string "199″, a while and then change the number of 199, almost to the database to get collapsed.


import json
from requests import Session

 ipipgo proxy configuration (see here for highlights)
proxy_config = {
    "http": "http://user:pass@gateway.ipipgo.com:9020",
    "https": "http://user:pass@gateway.ipipgo.com:9020"
}

session = Session()
response = session.get('https://api.example.com/products', proxies=proxy_config)

 There's a minefield buried here!
raw_data = json.loads(response.text)

Four Tips to Tame Wild JSON

First move:Data Type Sweep. Use this soo operation when encountering mixed type fields:


def clean_data(item).
     Convert price fields to floating point uniformly
    if 'price' in item.
        try.
            item['price'] = float(item['price'])
        except.
            item['price'] = 0.0
     Nested dictionary tiling
    if 'specs' in item.
        item.update(item.pop('specs'))
    return item

safe_data = [clean_data(x) for x in raw_data['results']]

Second move:Proxy Exception Catching Trifecta. Pay special attention to network fluctuations when using ipipgo's proxy:

Type of error response strategy
ConnectionError Automatic switching of proxy nodes
Timeout Wait 3-5 seconds before retrying
JSONDecodeError Record the content of the original response

Pits and life preservers in the real world

Once I used ipipgo's short-acting proxy to capture data, I encountered a strange JSON - emoji emoticons in the key name! This time do not use the standard library hard, on this program:


import demjson
from charset_normalizer import detect

 Detect the true encoding
encoding = detect(response.content)['encoding']
dirty_json = response.content.decode(encoding, errors='replace')

 Parsing with a third-party library
data = demjson.decode(dirty_json)

Remember to add in the request header"Accept-Encoding": "identity"If you want to use this website, some websites will return compressed data, which may be messed up when the proxy is relayed.

QA time (a must for newbies)

Q: What should I do if I keep receiving mutilated JSON with proxy IP?
A: 80% of the transmission is intercepted in the middle, we suggest: 1) check whether the traffic package of ipipgo background is used up 2) add "Connection": "keep-alive" in the header of the request 3) increase the timeout to 10 seconds or more

Q: What's the trick to dealing with multiple layers of nested JSON?
A: Using jsonpath a handful of shuttles is much cooler than writing multiple layers of for loops:


from jsonpath_ng import parse

expr = parse('$..products[? (@.price > 100)].sku')
matches = [match.value for match in expr.find(data)]

The hidden game of ipipgo

theirOn-demand billing agentsEspecially suitable for dealing with sudden large amounts of data. For example, if you suddenly want to parse a 10G JSON log file, you can do so:


import pandas as pd
from concurrent.futures import ThreadPoolExecutor

def parse_chunk(chunk).
    with ipipgo.create_session(duration='15min') as session.
        return pd.json_normalize(chunk)

 Chunking large files
results = []
with ThreadPoolExecutor(max_workers=5) as executor: for chunk in pd.read(chunk): return pd.json_normalize(chunk)
    for chunk in pd.read_json('bigfile.json', lines=True, chunksize=1000):
        results.append(executor.submit(parse_chunk, chunk))

One final word of advice: when dealing with JSON be sure toSchema checking before parsingIf you are using ipipgo's proxy, you can use HEAD request to probe the data structure first to avoid wasting traffic. Encountered a difficult problem, remember to go to their documents to turn over the "non-standard JSON processing guide", life-saving weapon.

我们的产品仅支持在境外网络环境下使用(除TikTok专线外),用户使用IPIPGO从事的任何行为均不代表IPIPGO的意志和观点,IPIPGO不承担任何法律责任。

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

IPIPGO-动态住宅ip全新升级

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish