Is there a simpler version of this cypher query?

Question

I have constructed a query to find the people who follow each other and who have read books in the same genre. Here it is:

MATCH (u1:User)-[:READ]->(b1:Book)
WITH collect(DISTINCT b1.genre) AS genres,u1 AS user1
MATCH (u2:User)-[:READ]->(b2:Book)
WHERE (user1)<-[:FOLLOWS]->(u2) AND b2.genre IN genres
RETURN DISTINCT user1.username AS user1,u2.username AS user2

The idea is that we collect all the book genres for one of them, and if a book read by the other is in that list of genres (and they follow each other), then we return those users. This seems to work: we get a list of distinct pairs of individuals. I wonder, though, if there a quicker way to do this? My solution seems somewhat clumsy, but I found it surprisingly finicky trying to specify that they have read a book in the same genre without getting back all the pairs of books and duplicating individuals. For example, I first wrote the following:

MATCH (b1:Book)<-[:READ]-(u1:User)-[:FOLLOWS]-(u2:User)-[:READ]->(b2:Book)
WHERE b1.genre = b2.genre
RETURN DISTINCT u1.username AS user1, u2.username AS user2

Which seems simpler, but in fact it returned repeated names for all the books that were read in the same genre. Is my solution the simplest, or is there a simpler one?

Christophe Willemsen · Accepted Answer

This is one way of rewriting the query

MATCH (n1:User)-[:FOLLOWS]-(n2:User)
MATCH (n1)-[:READ]->(book), (n2)-[:READ]->(book2)
WHERE book.genre = book2.genre
RETURN n1.username, n2.username, count(*)

Here is another collecting genres for each user

MATCH (n1:User)-[:FOLLOWS]-(n2:User)
WITH n1, n2, 
[(n1)-[:READ]->(book) | book.genre] AS g1,
[(n2)-[:READ]->(book) | book.genre] AS g2
WHERE ANY(x IN g1 WHERE x IN g2)
RETURN n1, n2, count(*)

Note that sometimes longer queries are not especially better in the sense that the ways the data are retrieved need to make sense to yourself.

Your model however clearly shows that you would benefit from a bit of graph refactoring, extracting the genre into its own node, for eg

MATCH (n:Book)
MERGE (g:Genre {name: n.genre})
MERGE (n)-[:HAS_GENRE]->(g)

And this would be the new query which leverages a graph model

PROFILE
MATCH (n1:User)-[:FOLLOWS]-(n2:User)
WHERE (n1)-[:READ]->()-[:HAS_GENRE]->()<-[:HAS_GENRE]-()<-[:READ]-(n2)
RETURN n1.username, n2.username, count(*)

cybersam · Answer

As was suggested by @ChristopheWillemsen, you should consider creating unique Genre nodes and adding a relationship between each Book and its Genre.

Not only would that make your data model "more graphy" by directly storing (and making visible) the relationships between genres and books, but it can optimize your use case as well.

Here is a fast query that returns the names of all unique pairs of users who read at least one book with the same genre.

MATCH (u1:User)-[:FOLLOWS]-(u2:User)
WHERE
  ID(u1) < ID(u2) AND
  (u1)-[:READ]->()-[:HAS_GENRE]->()<-[:HAS_GENRE]-()<-[:READ]-(u2)
RETURN DISTINCT u1.username, u2.username

Explanations (there is a lot going on):

The above MATCH relationship pattern is "undirected" (specifies no relationship direction), so it will match a relationship in either direction. That is well and good, but a symmetrical undirected relationship pattern (where both end nodes have the same node pattern, or at least one node pattern is just ()) causes the same pair of nodes to be returned twice (except in opposite order). For your use case, presumably you do not want to treat 'Alice'/'Bob' and 'Bob'/'Alice' as different pairs of users.
- (A) One potential way to fix this is to use a directional relationship pattern instead. This is not acceptable for all use cases, but will work for yours. (However, if u2 also FOLLOWS u1, then you will still get duplicate pairs.) The DISTINCT option is only needed here if it is possible for there to be multiple FOLLOWS relationships in the same direction.
```
MATCH (u1:User)-[:FOLLOWS]->(u2:User)
WHERE (u1)-[:READ]->()-[:HAS_GENRE]->()<-[:HAS_GENRE]-()<-[:READ]-(u2)
RETURN DISTINCT u1.username, u2.username
```
- (B) The query presented at the top of this answer uses a different method of enforcing the order of the returned nodes -- that is, by native ID. With that query the DISTINCT option is probably wise to keep, since it will eliminate duplicate pairs no matter how many FOLLOWS relationships exist between the same 2 nodes, in either direction. When profiling (A) and (B) with my own test data, (B) turned out to use slightly fewer DB hits. But you should profile (A) and (B) yourself to see which one is better with your own actual data, if (A) is at all acceptable.
The WHERE expression (u1)-[:READ]->()-[:HAS_GENRE]->()<-[:HAS_GENRE]-()<-[:READ]-(u2) is a path pattern. A path pattern expression in a WHERE clause is a predicate that evaluates to true iff the pattern is found at least once. As soon as a single instance of the pattern is found no effort is made to look further, so it is very efficient if the path pattern can have multiple matches.

Is there a simpler version of this cypher query?

Tags:

neo4j

cypher

Dan Öz

2 Answers

Christophe Willemsen

cybersam

Recent Activity

Donate For Us

Is there a simpler version of this cypher query?

Tags:

neo4j

cypher

Dan Öz

2 Answers

Christophe Willemsen

cybersam

Related questions

Recent Activity

Donate For Us