I have written below code using stanford nlp packages.
GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);
But for the sentence "Annie goes to school", it is not able to identify the gender of Annie.
The output of application is:
[Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON]
[Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O]
[Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O]
[Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O]
[Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]
What is the correct approach to get the gender?
If your named entity recognizer outputs PERSON
for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:
Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.
There are a lot of approaches and one of them is outlined in nltk cookbook.
Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.
import nltk
import random
def extract_features(name):
name = name.lower()
return {
'last_char': name[-1],
'last_two': name[-2:],
'last_three': name[-3:],
'first': name[0],
'first2': name[:1]
}
f_names = nltk.corpus.names.words('female.txt')
m_names = nltk.corpus.names.words('male.txt')
all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
random.shuffle(all_names)
test_set = all_names[500:]
train_set= all_names[:500]
test_set_feat = [(extract_features(n), g) for n, g in test_set]
train_set_feat= [(extract_features(n), g) for n, g in train_set]
classifier = nltk.NaiveBayesClassifier.train(train_set_feat)
print nltk.classify.accuracy(classifier, test_set_feat)
This basic test gives you approximately 77% of accuracy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With