Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get a list of URLs from a site [closed]

Tags:

web-crawler

I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous.

So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs.

I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than to find deeper pages.

like image 687
Oli Avatar asked May 13 '09 12:05

Oli


People also ask

How do I find a lost URL?

Check Google Analytics, Google Webmaster Tools, and Open Site Explorer to identify those lost URL's. Then ask your web developer or agency if they have an archive of your old site. If they do, you should have everything you need to track down your old URL's, and the inbound links they built up.


2 Answers

I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

like image 153
Oli Avatar answered Sep 21 '22 04:09

Oli


do wget -r -l0 www.oldsite.com

Then just find www.oldsite.com would reveal all urls, I believe.

Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.

like image 29
alamar Avatar answered Sep 19 '22 04:09

alamar