I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says RESPONSECODENOTINDEXABLE.
I am not sure if have I authenticated correctly or not. In manifoldCF we have options for HTTP basic authentication, NTLM authentication and Session-based access credentials authentication method. I used Session based authentication method which more looks like a form based authentication rather than SAML authentication.
Has anybody crawled a website using manifoldCF which has SAML authentication? And if not manifoldCF, has anyone been able to accomplish this via Apache Nutch, because I am afraid, it also provides only HTTP basic , Digest and NTLM authentication.
Any insight would be helpful. Can provide more information regarding the issue, if anyone here thinks it can easily be accomplished. Basically when I crawl https://sites.google.com/a/my-sub-domain.com, it redirects to SSO login page and crawler refuses to crawl any more giving a 302 error. It's an intranet based website.
There is no support in Nutch forSSO authentication using SAML. You need to handle it by writing your custom plugin. We have extended proptocol selenium plugin to handle SAML flows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With