In a project there is a module takes a URL and determines whether it is of "Ecommerce" or "NON-Ecommerce" website.
I have tried following approaches:
Using Apache mahout, Classification : URL ---> Take html dump ---> pre process the html dump by a) removing all html tags
b) removing stop words(a.k.a common words) like CDATA, href, value, and, of , between etc.
c) training model and then testing it.
Following params i have used for training
bin/mahout trainclassifier \ -i training-data \ -o bayes-model \ > -type bayes -ng 1
Testing:
/bin/mahout testclassifier \
  -d test-data \
  -m bayes-model \
  -type bayes -source hdfs -ng 1 -method sequential
Accuracy i am getting as 73% and with cbayes algorithm getting 52%.
I am thinking to improve pre processing stage by extracting info which are found in ecommerce website like "Checkout button","pay pal link", "Prices/ dollar symbol", text like "Cash on delivery", "30 day gurantee" etc.
Any suggestions on how to extract this info or any other ways to predict a site as Ecommerce or Non-Ecommerce?
I am very astonished that you get such a good accuracy with just plain html extraction and a bayes classifier.
But you seem to be on the right track with the features like a checkout button and prices.
Here is a paper I found yesterday while reading about Yandex:
"To find out or to buy? Product review vs. Web shop classifier"
It is about how to distinct these two sites and some techniques they used. They also used SVM instead of naive bayes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With