Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MOSS 2007 Crawl

I'm trying to get crawl to work on two separate farms I have but can't get it to work on either one. They both have two WFE's with an additional WFE configured as an Index server. There is one more server dedicated for Query and two clustered SQL 2005 back end servers for the database. I have unsuccessfully tried at least 50 different websites that I found with solutions from a search engine. I have configured (extended) my Web App to use http://servername:12345 as the default zone and http://abc.companyname.com as the custom and intranet zones. When I enter each of those into the content source and then try to run a crawl, I get a couple of errors in the crawl log:

http://servername:12345 returns:
"Could not connect to the server. Please make sure the site is accessible."

http://abc.companyname.com returns:
"Deleted by the gatherer. (The start address or content source that contained this item was deleted and hence this item was deleted.)"

However, I can click both URL's and the page is accessible.

Any ideas?


More info:

I wiped the slate clean, so to speak, and ran another crawl to provide an updated sample.

My content sources are as such:

http://servername:33333
http://sharepoint.portal.fake.com
sps3://servername:33333

My current crawl log errors are:

sps3://servername:33333
Error in PortalCrawl Web Service.

http://servername:33333/mysites
Content for this URL is excluded by the server because a no-index attribute.

http://servername:33333/mysites
Crawled

sts3://servername:33333/contentdbid={62a647a...
Crawled

sts3://servername:33333
Crawled

http://servername:33333
Crawled

http://sharepoint.portal.fake.com
The Crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly.

I double checked for typos above and I don't see any so this should be an accurate reflection.

like image 433
RJ Russell Avatar asked Dec 06 '25 04:12

RJ Russell


2 Answers

One thing to remember is that crawling SharePoint sites is different from crawling file shares or non-SharePoint websites.

A few other quick pointers:

  • the sps3: protocol is for crawling user profiles for People Search. You can disregard anything the crawler says about it until you're ready for user profiles.
  • your crawl account is supposed to have access to your entire farm. If you see permissions errors, find the KB article that tells you the how to reset your crawl account (it's a specific stsadm.exe command). If you're trying to crawl another farm's content, then you'll have to work something else out to grant your crawl account access. I think this is your biggest issue presently.
  • The crawler (running from the index server) will attempt to visit the public URL. I've had inter-server communication issues before; make sure all three servers can ping each other, and make sure the index server can reach the public URL (open IE on the index server and check it out). If you have problems, it's time to dirty up your index server's hosts file. This is something SharePoint does for you anyway, so don't feel too bad doing it. If you've set up anything aside from Integrated Windows Authentication, you'll have to work harder to get your crawler working.

Anyway, there's been a lot of back and forth in the responses, so I'm just shotgunning a bunch of suggestions out there, maybe one of them is on target.

I'm a little confused about your farm topology. A machine installed as a just a WFE cannot be an indexer. A machine installed as "complete" can be an indexer, query and/or a wfe...

Also, instead of changing the default content access account, you may want to add a crawl rule instead (once everything is up and running)

Can you see if anything helpful is in the %commonprogramfiles%/microsoft shared/web server extensions/12/logs on your indexer?

The log file may be a bit verbose, you can search for "started" or "full" and that will usually get you to the line in the log where your crawl started.

Also, on your sql machine, you may be able to get more information from the MSScrawlurlhistory table.

like image 37
RedDeckWins Avatar answered Dec 09 '25 00:12

RedDeckWins