What's the best way to store and search a database of natural language sentence structure trees?
Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the doc strings from my source code, generate these trees for all sentences in the doc strings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.
So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:
(TOP
  (S
    (NP (DT This))
    (VP
      (VBZ uploads)
      (NP (NNS files))
      (PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
    (. .)))
If someone entered the query "How can I upload files?", equating to the tree:
(TOP
  (SBARQ
    (WHADVP (WRB How))
    (SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
    (. ?)))
how would I store and query these trees in a SQL database?
I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.
And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords, but is obviously describing a completely different behavior.
Relational databases cannot store knowledge in a natural way, what you actually need is a knowledge base or ontology (though it may be constructed on top of relational database). It holds data in triplets <subject, predicate, object>, so your phrase will be stored as <upload_file(), upload, file>. There's a lot of tools and methods to search inside such KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like <?, upload, file>) or conjunctive queries and then search on your KB. OpenNLP will help you with translating, and the rest depends on concrete technique and technologies you decide to use.  
I agree with ffriend that you need to take a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested taking advantage of a little knowledge about grammatical relations, parse trees are really too complicated. If you simplify the parse into dependency triples, you can make the search problem much easier and get at the grammatical relations you were interested in in the first place. For instance, you could use the Stanford dependency parser, which generates a context-free parse and then extracts dependency triples from it. It produces output like this for "This function uploads files to a remote machine":
det(function-2, This-1)
nsubj(uploads-3, function-2)
dobj(uploads-3, files-4)
det(machine-8, a-6)
amod(machine-8, remote-7)
prep_to(uploads-3, machine-8)
In your database, you could store a simplified subset of these triples associated with the function, e.g.:
upload_file(): subj(uploads, function)
upload_file(): obj(uploads, file)
upload_file(): prep(uploads, machine)
When people search, you can find the function that has the most overlapping triples or something along those lines, where you probably also want to weight the different dependency relations or allow partial matches, etc. You probably also want to reduce the words in the triples to lemmas, maybe POS depending on what you need.
There are plenty of people who have worked on natural language search (like Powerset), so be sure to search for existing approaches. My proposed approach here is really minimal and I can think of tons of examples where it will have problems, but I think something along these lines could work reasonably well for a restricted domain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With