Thread: wget problem
View Single Post

   
  #5 (permalink)  
Old 02-19-2008, 05:20 PM
Mark Hill
 
Posts: n/a
Default Re: wget problem

On 16 Dec 2004 03:58:21 -0800,
guruteck@gmail.com <guruteck@gmail.com> wrote:
> Thanx u very much .It works greatly for me..
> But I didnt completely understand why it is working for me now??
> I didnt get complete information from man pages


The man page doesn't seem to cover robots.txt very much. (Perhaps this
is intentional.) It's worth reading /etc/wgetrc as that will give you
some more ideas as to what wget can do.

If the '-erobots=off' option worked for you, then this option told wget
to ignore the http://some.example.com/robots.txt file on the website.
robots.txt is part of the Robots Exclusion Standard that well-behaved
web robots (like wget) will follow. It tells web robots what part(s) of
the site should not be downloaded or indexed. There is more information
on the robotstxt site:
<http://www.robotstxt.org/>
<http://www.robotstxt.org/wc/norobots.html#introduction>

If the '-U' option worked for you, the website you're downloading from
is blocking requests from any client called "Wget/1.9.1". The -U option
allows wget to look like another client, like firefox for instance:
wget -U \
"Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0" \
http://some.example.com

--
Mark Hill
Reply With Quote