I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.
How could I make this faster?
import pandas as pd import json  df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])  df.data.apply(json.loads) \        .apply(pd.io.json.json_normalize)\        .pipe(lambda x: pd.concat(x.values)) ###this returns a dataframe where each JSON key is a column You can convert JSON to Pandas DataFrame by simply using read_json() . Just pass JSON string to the function. It takes multiple parameters, for our case I am using orient that specifies the format of JSON string. This function is also used to read JSON files into pandas DataFrame.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
json_normalize takes an already processed json string or a pandas series of such strings.
pd.io.json.json_normalize(df.data.apply(json.loads)) setup
import pandas as pd import json  df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data']) I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:
df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])  a = df.data.apply(json.loads).values.tolist()  print (pd.DataFrame.from_records(a)) Another idea:
 df = pd.json_normalize(df['data']) If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With