Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Smalltalk and tf-idf algorithm

Can anyone show a simple implementation or usage example of a tf-idf algorithm in Smalltalk for natural language processing? I've found an implementation in a package called NaturalSmalltalk, but it seems too complicated for my needs. A simple implementation in Python is like this one.

I've noticed there is another tf-idf in Hapax, but it seems related to analysis of software systems vocabularies, and I didn't found examples of how to use it.

like image 363
user1000565 Avatar asked Mar 08 '26 12:03

user1000565


2 Answers

I am the author of the original Hapax package for Visualworks. Hapax is a general purpose information retrieval package, it should be able to work with any kind of text files. I just happens so that I used to use it to analyze source code files.

The class that you are looking for is TermDocumentMatrix, there should be two methods globalWeighting: and localWeighting: to which you pass instances of InverseDocumentFrequency and either LogTermFrequency or TermFrequency depending on your needs. Typically when referring to tfidf people mean it to include logarithmic term frequencies.

There should best tests demonstrating the TDM class using a small example corpus. If the tests have not been ported to Squeak, please let me know so I can provide you with an example.

like image 115
akuhn Avatar answered Mar 11 '26 09:03

akuhn


TextLint is a system based on PetitParser to parse and match patterns in natural language. It doens't provide what you ask for, but it shouldn't be too difficult to extend the model to compute word frequencies.

like image 21
Lukas Renggli Avatar answered Mar 11 '26 08:03

Lukas Renggli



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!