I have a graph database which looks like this (simplified) diagram:

Each unique ID has many properties, which are represented as edges from the ID to unique values of that property. Basically that means that if two ID nodes have the same email, then their has_email edges will both point to the same node. In the diagram, the two shown IDs share both a first name and a last name.
I'm having difficulty writing an efficient Gremlin query to find matching IDs, for a given set of "matching rules". A matching rule will consist of a set of properties which must all be the same for IDs to be considered to have come from the same person. The query I'm currently using to match people based on their first name, last name, and email looks like:
g.V().match(
__.as("id").hasId("some_id"),
__.as("id")
.out("has_firstName")
.in("has_firstName")
.as("firstName"),
__.as("id")
.out("has_lastName")
.in("has_lastName")
.as("lastName"),
__.as("id")
.out("has_email")
.in("has_email")
.as("email"),
where("firstName", eq("lastName")),
where("firstName", eq("email")),
where("firstName", neq("id"))
).select("firstName")
The query returns a list of IDs which match the input some_id.
When this query tries to match an ID with a particularly common first name, it becomes very, very slow. I suspect that the match step is the problem, but I've struggled to find an alternative with no luck so far.
The performance of this query will depend on the edge degrees in your graph. Since many people share the same first name, you will most likely have a huge amount of edge going into a specific firstName vertex.
You can make assumptions, like: there are fewer people with the same last name than people with the same first name. And of course, there should be even fewer people who share the same email address. With that knowledge you just can start to traverse to the vertices with the lowest degree first and then filter from there:
g.V().hasId("some_id").as("id").
out("has_email").in("has_email").where(neq("id")).
filter(out("has_lastName").where(__.in("has_lastName").as("id"))).
filter(out("has_firstName").where(__.in("has_firstName").as("id")))
With that, the performance will mostly depend on the vertex with the lowest edge degree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With