Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create pandas dataframe from nested json with dictionary

I'm trying to create a pandas dataframe form json file. I've seen a multiple solutions to this problem which uses built in functions from_dict/json_normalize yet I'm unable to apply it to my code. Here's how my data is structured in json file:

     "data": [
   {
      "groups": {
         "data": [
               {
               "group": "Math",
               "year_joined": "2009"
               },
               {
               "group_name": "History",
               "year_joined": "2011"
               },
               {
               "group_name": "Biology",
               "year_joined": "2010"
               }
         ]
      },
      "id": "12512"
   },

When I'm trying to normalize this data with pandas function like this:

path = 'mypath'
f = open(path)
data = json.load(f)

test = pd.json_normalize(
            data['data'], 
            errors='ignore') 

I just receive something like this:

    id      groups.data
0   12512   [{'group_name': 'Math', 'year_joined': '2009', 'gr...
1   23172   [{'group_name': 'Chemistry', 'year_joined': '2005'...

I want this data to look like this (solution 1):

    id      group     year_joined
0   12512   group1    year1
1   12512   group2    year2
2   12512   group3    year3

Or like this (solution 2):

    id      group                   year_joined
0   12512   group1,group2,group3    year1,year2,year3
1   23172   group4,group5           year4,year5

How can i achieve it? I tried passing 'record_path' parameter to 'json_normalize' function but it doesn't change anything. I tried to use 'DataFrame.from_dict' function to work around this but I failed. The only way I was able to get to solution 1 was to just create multiple loops that iterated through everything in json file and add it to separate list. It kinda works but takes a lot of time on bigger datasets.

How could i use built-in pandas tools to process files which are nested as dictionaries in 3rd layer of the file as presented above?

like image 702
Towelie Avatar asked Nov 04 '25 11:11

Towelie


2 Answers

  • given you have dict with nested list
    1. create dataframe from overall structure
    2. explode() embedded list
    3. expand nested dict with apply(pd.Series)
d = {'groups': {'data': [{'group': 'Math', 'year_joined': '2009'},
   {'group_name': 'History', 'year_joined': '2011'},
   {'group_name': 'Biology', 'year_joined': '2010'}]},
 'id': '12512'}

pd.json_normalize(d).explode("groups.data").reset_index(drop=True).pipe(
    lambda d: d["id"].to_frame().join(d["groups.data"].apply(pd.Series))
)

id group year_joined group_name
0 12512 Math 2009 nan
1 12512 nan 2011 History
2 12512 nan 2010 Biology
like image 80
Rob Raymond Avatar answered Nov 06 '25 03:11

Rob Raymond


You need to collect the information from the data dictionary

solution 1

d = {}
for group in data["data"]:
    groups = [x["group_name"] for x in group['groups']["data"]]
    d['id'] = d.get('id', []) + [group['id']] * len(groups)
    d['group'] = d.get('group', []) + groups
    d['year_joined'] = d.get('year_joined', []) + [x["year_joined"] for x in group['groups']["data"]]

df = pd.DataFrame(d)

Output

      id      group year_joined
0  12512       Math        2009
1  12512    History        2011
2  12512    Biology        2010
3  23172  Chemistry        2007
4  23172  Economics        2008

solution 2

d = {}
for group in data["data"]:
    d['id'] = d.get('id', []) + [group['id']]
    d['group'] = d.get('group', []) + [','.join(x["group_name"] for x in group['groups']["data"])]
    d['year_joined'] = d.get('year_joined', []) + [','.join(x["year_joined"] for x in group['groups']["data"])]

df = pd.DataFrame(d)

Output

      id                 group     year_joined
0  12512  Math,History,Biology  2009,2011,2010
1  23172   Chemistry,Economics       2007,2008
like image 35
Guy Avatar answered Nov 06 '25 01:11

Guy