I'm trying to flatten a deeply/irregularly nested list/JSON object to a dataframe in R.
The key names are consistent, but the number of nested elements differ from one element to the next.
I've tried flattening the list using jsonlite and the tidyr::unnest function, but tidyr::unnest isn't able to unnest a list-column that contains multiple new columns. I've also tried using the map functions in the purrr package, but can't get anything to work.
A subset of the JSON data is below, and a list object is included at the end of this post.
[
{
"name": ["Hillary Clinton"],
"type": ["PERSON"],
"metadata": {
"mid": ["/m/0d06m5"],
"wikipedia_url": ["http://en.wikipedia.org/wiki/Hillary_Clinton"]
},
"salience": [0.2883],
"mentions": [
{
"text": {
"content": ["Clinton"],
"beginOffset": [132]
},
"type": ["PROPER"]
},
{
"text": {
"content": ["Mrs."],
"beginOffset": [127]
},
"type": ["COMMON"]
},
{
"text": {
"content": ["Clinton"],
"beginOffset": [403]
},
"type": ["PROPER"]
},
{
"text": {
"content": ["Mrs."],
"beginOffset": [398]
},
"type": ["COMMON"]
},
{
"text": {
"content": ["Hillary Clinton"],
"beginOffset": [430]
},
"type": ["PROPER"]
}
]
},
{
"name": ["Trump"],
"type": ["PERSON"],
"metadata": {
"mid": ["/m/0cqt90"],
"wikipedia_url": ["http://en.wikipedia.org/wiki/Donald_Trump"]
},
"salience": [0.245],
"mentions": [
{
"text": {
"content": ["Trump"],
"beginOffset": [24]
},
"type": ["PROPER"]
},
{
"text": {
"content": ["Mr."],
"beginOffset": [20]
},
"type": ["COMMON"]
}
]
}
]
And the desired output would be a dataframe like the one below, where the outer elements are repeated and each of the inner-most element have their own rows.
name type metadata.mid metadata.wikipedia_url salience mentions.text.content mentions.text.beginOffset mentions.type
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Clinton 132 PROPER
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Mrs. 127 COMMON
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Clinton 403 PROPER
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Mrs. 398 COMMON
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Hillary Clinton 430 PROPER
Trump PERSON /m/0cqt90 http://en.wikipedia.org/wiki/Donald_Trump 0.245 Trump 24 PROPER
Trump PERSON /m/0cqt90 http://en.wikipedia.org/wiki/Donald_Trump 0.245 Mr. 20 COMMON
Is there a general/scalable way to flatten this type of data?
An R list object:
nested_list <- list(structure(list(name = "Hillary Clinton", type = "PERSON",
metadata = structure(list(mid = "/m/0d06m5", wikipedia_url = "http://en.wikipedia.org/wiki/Hillary_Clinton"), .Names = c("mid",
"wikipedia_url")), salience = 0.28831193, mentions = list(
structure(list(text = structure(list(content = "Clinton",
beginOffset = 132L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")), structure(list(
text = structure(list(content = "Mrs.", beginOffset = 127L), .Names = c("content",
"beginOffset")), type = "COMMON"), .Names = c("text",
"type")), structure(list(text = structure(list(content = "Clinton",
beginOffset = 403L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")), structure(list(
text = structure(list(content = "Mrs.", beginOffset = 398L), .Names = c("content",
"beginOffset")), type = "COMMON"), .Names = c("text",
"type")), structure(list(text = structure(list(content = "Hillary Clinton",
beginOffset = 430L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")))), .Names = c("name",
"type", "metadata", "salience", "mentions")), structure(list(
name = "Trump", type = "PERSON", metadata = structure(list(
mid = "/m/0cqt90", wikipedia_url = "http://en.wikipedia.org/wiki/Donald_Trump"), .Names = c("mid",
"wikipedia_url")), salience = 0.24501903, mentions = list(
structure(list(text = structure(list(content = "Trump",
beginOffset = 24L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")), structure(list(
text = structure(list(content = "Mr.", beginOffset = 20L), .Names = c("content",
"beginOffset")), type = "COMMON"), .Names = c("text",
"type")))), .Names = c("name", "type", "metadata", "salience",
"mentions")))
One approach:
map_df(nested_list, function(x) {
df <- flatten_df(x[c("name", "type", "metadata", "salience")])
map_df(x$mentions, ~c(as.list(.$text), mentions_type=.$type)) %>%
mutate(name=df$name, type=df$type, mid=df$mid,
wikipedia_url=df$wikipedia_url, salience=df$salience)
}) %>% glimpse()
## Observations: 7
## Variables: 8
## $ content <chr> "Clinton", "Mrs.", "Clinton", "Mrs.", "Hillary Clinton", "Trump", "Mr."
## $ beginOffset <int> 132, 127, 403, 398, 430, 24, 20
## $ mentions_type <chr> "PROPER", "COMMON", "PROPER", "COMMON", "PROPER", "PROPER", "COMMON"
## $ name <chr> "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Trump", "Trump"
## $ type <chr> "PERSON", "PERSON", "PERSON", "PERSON", "PERSON", "PERSON", "PERSON"
## $ mid <chr> "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0cqt90", "/m/0cqt90"
## $ wikipedia_url <chr> "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikiped...
## $ salience <dbl> 0.2883119, 0.2883119, 0.2883119, 0.2883119, 0.2883119, 0.2450190, 0.2450190
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With