I'm trying to convert a JSON file with multiple levels of nesting into a dataframe in R. I've looked at some existing questions/answers on this problem (e.g., Convert JSON to R dataframe and Flatten nested JSON to dataframe in R and Nested JSON to dataframe in R), but as this is my first time handling a JSON file I'm really struggling.
The JSON file is pretty large and contains twitter data. Here's a sample to show what the nesting structure is like:
[
{
"Corpus": "ALM",
"Tweets": [
{
"tweet_id": "521033092132503552",
"tweet_text": "no tweet text available",
"date": "no date available",
"annotations": [
{
"annotator": "annotator00",
"annotation": "care"
},
{
"annotator": "annotator01",
"annotation": "care,purity"
},
{
"annotator": "annotator02",
"annotation": "care,purity"
},
{
"annotator": "annotator03",
"annotation": "care"
}
]
},
{
"tweet_id": "537681598989475841",
"tweet_text": "Wholeheartedly support these protests & acts of civil disobedience & will join when I can! #Ferguson #AllLivesMatter",
"date": "Wed Nov 26 18:57:37 +0000 2014",
"annotations": [
{
"annotator": "annotator00",
"annotation": "subversion"
},
{
"annotator": "annotator01",
"annotation": "subversion"
},
{
"annotator": "annotator02",
"annotation": "loyalty"
},
{
"annotator": "annotator03",
"annotation": "loyalty,subversion"
}
]
},
Here's some dput of this same data file, but different observations/tweets (note that because the file is so large the largest nesting level gets cut off):
list(tweet_id = "500745903054258177", tweet_text = "@MichaelSkolnik Thank you for joining the solidarity effort #DearMikesMom #AllLivesMatter",
date = "Sat Aug 16 20:48:21 +0000 2014", annotations = list(
list(annotator = "annotator01", annotation = "loyalty"),
list(annotator = "annotator02", annotation = "loyalty"),
list(annotator = "annotator03", annotation = "loyalty"))),
list(tweet_id = "621859689270120448", tweet_text = "no tweet text available",
date = "no date available", annotations = list(list(annotator = "annotator01",
annotation = "betrayal"), list(annotator = "annotator02",
annotation = "non-moral"), list(annotator = "annotator03",
annotation = "fairness"))), list(tweet_id = "551227029874438145",
My desired output would look like this:
corpus tweet_id tweet_text date annotator annotation
1 ALM 5210... no tweet text available no date available 00 care
2 ALM 5210... no tweet text available no date available 01 care, purity
3 ALM 5210... no tweet text available no date available 02 care, purity
4 ALM 5210... no tweet text available no date available 03 care
5 ALM 5376... Wholeheartedly suppo... Wed Nov 26 18:... 00 subversion
6 ALM 5376... Wholeheartedly suppo... Wed Nov 26 18:... 01 subversion
7 ALM 5376... Wholeheartedly suppo... Wed Nov 26 18:... 02 loyalty
8 ALM 5376... Wholeheartedly suppo... Wed Nov 26 18:... 03 loyalty, subversion
...
How can I convert the JSON file to my desired dataframe output?
I've imported the JSON file and tried to flatten it like so:
myData <- fromJSON(file = "my_json_file.json")
myData_flat <- as.data.frame(myData)
But this clearly does not suffice:
str(myData_flat)
'data.frame': 1 obs. of 352130 variables:
$ Corpus : Factor w/ 1 level "ALM": 1
$ Tweets.tweet_id : Factor w/ 1 level "521033092132503552": 1
$ Tweets.tweet_text : Factor w/ 1 level "no tweet text available": 1
$ Tweets.date : Factor w/ 1 level "no date available": 1
$ Tweets.annotations.annotator : Factor w/ 1 level "annotator00": 1
$ Tweets.annotations.annotation : Factor w/ 1 level "care": 1
$ Tweets.annotations.annotator.1 : Factor w/ 1 level "annotator01": 1
$ Tweets.annotations.annotation.1 : Factor w/ 1 level "care,purity": 1
$ Tweets.annotations.annotator.2 : Factor w/ 1 level "annotator02": 1
$ Tweets.annotations.annotation.2 : Factor w/ 1 level "care,purity": 1
$ Tweets.annotations.annotator.3 : Factor w/ 1 level "annotator03": 1
$ Tweets.annotations.annotation.3 : Factor w/ 1 level "care": 1
$ Tweets.tweet_id.1 : Factor w/ 1 level "537681598989475841": 1
$ Tweets.tweet_text.1 : Factor w/ 1 level "Wholeheartedly support these protests & acts of civil disobedience & will join when I can! #Ferguson #A"| __truncated__: 1
$ Tweets.date.1 : Factor w/ 1 level "Wed Nov 26 18:57:37 +0000 2014": 1
$ Tweets.annotations.annotator.4 : Factor w/ 1 level "annotator00": 1
$ Tweets.annotations.annotation.4 : Factor w/ 1 level "subversion": 1
$ Tweets.annotations.annotator.5 : Factor w/ 1 level "annotator01": 1
$ Tweets.annotations.annotation.5 : Factor w/ 1 level "subversion": 1
$ Tweets.annotations.annotator.6 : Factor w/ 1 level "annotator02": 1
$ Tweets.annotations.annotation.6 : Factor w/ 1 level "loyalty": 1
$ Tweets.annotations.annotator.7 : Factor w/ 1 level "annotator03": 1
$ Tweets.annotations.annotation.7 : Factor w/ 1 level "loyalty,subversion": 1
...
There are several packages in R
to read in JSON data and have a fromJSON()
function. RJSONOIO
, jsonlite
and rjson
are the ones I know about. It seems you are using the rjson::fromJSON()
in your code.
Since the json
format is very flexible in how it can store data and is able to store complex, nested structures, when we convert it into the relatively under complex format of a data.frame
, we have to be very explicit about the structure of the data. Your case is rather simple, but it can be very tedious, to capture the full range of data representations in a json file when converting to the rectangular shape of a data.frame
.
rjson::fromJSON()
generates a nested list, that resembles the structure of the JSON file.
The structure of your myData
list
looks something like this:
myData[[1]]:
- $Corpus
- $Tweets
- $[[1]]
- tweet data
- $[[2]]
- tweet data
In order to extract the data you are looking for, you want to cycle through the entries of the mydata[[1]]$tweets
sub-list, convert them to a data.frame
and then bind together all those data.frame
s to one big data.frame
. You can do that with a for
loop or something like lapply()
. I'd suggest using purrr::map_dfr()
as it automatically binds together the results of each nested operation into one data.frame
.
I found that the following should work for your data. Luckily tibble::as.tibble()
works well for your data. You apply it once to each tweet entry and after that apply it to each set of annotator
, annotion
and you'll have the result you are looking for.
library(rjson)
myData <- fromJSON(file = "my_json_file.json")
library(purrr)
library(dplyr)
myData_df <- map_dfr(myData[[1]]$Tweets, as.tibble)
annotations_df <- map_dfr(myData_df$annotations, as.tibble)
myData_df %>%
select(-annotations) %>%
bind_cols(annotations_df)
># A tibble: 8 x 5
> tweet_id tweet_text date annotator annotation
<chr> <chr> <chr> <chr> <chr>
1 521033092132… no tweet text available no date available annotato… care
2 521033092132… no tweet text available no date available annotato… care,purity
3 521033092132… no tweet text available no date available annotato… care,purity
4 521033092132… no tweet text available no date available annotato… care
5 537681598989… Wholeheartedly support these protests & acts of civ… Wed Nov 26 18:57… annotato… subversion
6 537681598989… Wholeheartedly support these protests & acts of civ… Wed Nov 26 18:57… annotato… subversion
7 537681598989… Wholeheartedly support these protests & acts of civ… Wed Nov 26 18:57… annotato… loyalty
8 537681598989… Wholeheartedly support these protests & acts of civ… Wed Nov 26 18:57… annotato… loyalty,subv…
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With