wget - URLパターンに基づいてWebページをダウンロードする方法は？

2024-06-24 • tag-icon

次のディレクトリ構造を持つウェブサイトwww.music.comを考えてみましょう。

/piano
   /covers
     /Chopin 
        apple.html
        bannan.js
        balloon.html
        index.html
     /Franz Liszt
        index.html
        roses.js
        Love Dream.html

     /Frodo
        index.html
        linkenso.html

/violin
   /covers
      /David
         Viva.html
      /Ross
         index.html

サブディレクトリ名が「Fr」で始まるmusic.com/piano/coversのネストされたディレクトリからindex.htmlファイルをインポートしたいと思います。上記の例では、2つのファイルのみをダウンロードしようとしています。

www.music.com/piano/covers/Franz Listz/index.html
www.music.com/piano/covers/Frodo/index.html

wgetを使って次のことを試してみました。

$ wget 
   --mirror 
   --header="Accept: text/html"  
   --page-requisites 
   --html-extension 
   --convert-links 
   --restrict-file-names=windows 
   --domains=www.music.com/piano/covers 
   --accept-regex=/piano/covers/Fr.*/index.html  
        http://www.music.com

私のサイトでも同じことをしましたが、間違ったファイルだけを受け取りました。

www.music.com/index.html

上記のオプションを使用した理由は何ですか？

--recursiveエラーが続くので、orを使用することは-r問題ではありません。また、--page-requisitesすべての情報を提供するためにサーバーが必要ないため、このオプションがより優れています。
--domains：指定されたURL以外のものをダウンロードしないように注意してください。piano/coversフォルダ外のリソースが必要ないため、この場合にする必要があります。
--header: Accept をオーバーライドしたい。：* / *私の要求がすべてを要求するのを防ぎます。
--html-extensions：htmlファイルのみダウンロード

--accept-regexどうやらその部分は考慮すらされていないようです。ただし、-A必要なファイルが複数のディレクトリに分散されているため、このオプションを使用することをお勧めします。パラメータで指定された2つのファイルを取得する方法についてのアイデアはありますか--accept-regex？

編集1：

上記の例で使用されているURLにアクセスすると、404エラーが発生します。だから私は私のウェブサイトにコンテキストを提供し、実際にこれをやろうとしています。

www.ajayhalthor.comのディレクトリ構造：

/piano
   /nightwish-sahara
   /nightwish-amaranth
   /skillet-hero
   /skillet-the-last-night
   /breaking-benjamin-diary-of-jane
   /skillet-comatose
   /one-republic-counting-stars
   /skillet-falling-inside-the-black
   /63/index.html
   /a/few/more/links/index.html

/about
   /other/links/index.html
/Home
   /main/links/index.html

この構造では、www.ajayhalthor.com/pianoで「Sk」で始まるファイルを検索したいと思います。次のファイルを検索したいと思います。

www.ajayhalthor.com/piano/skillet-hero
www.ajayhalthor.com/piano/skillet-the-last-night
www.ajayhalthor.com/piano/skillet-comatose
www.ajayhalthor.com/piano/skillet-falling-inside-the-black

次のコマンドを実行します。

$ wget 
   --mirror 
   --header="Accept: text/html"  
   --page-requisites 
   --html-extension 
   --convert-links 
   --restrict-file-names=windows 
   --domains=www.ajayhalthor.com/piano
   --accept-regex="piano/sk.*"
        http://www.ajayhalthor.com

次の結果が表示されます。

Resolving www.ajayhalthor.com (www.ajayhalthor.com)... 23.229.213.7
Connecting to www.ajayhalthor.com (www.ajayhalthor.com)|23.229.213.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.ajayhalthor.com/index.html’

www.ajayhalthor.com/index.html              [ <=>                                                                          ]  24.28K  --.-KB/s    in 0.1s    

Last-modified header missing -- time-stamps turned off.
2017-01-19 01:56:11 (245 KB/s) - ‘www.ajayhalthor.com/index.html’ saved [24862]

FINISHED --2017-01-19 01:56:11--
Total wall clock time: 1.4s
Downloaded: 1 files, 24K in 0.1s (245 KB/s)
Converting links in www.ajayhalthor.com/index.html... 11-1
Converted links in 1 files in 0.003 seconds.

www.ajayhalthor.com/index.html ファイルは 1 つだけダウンロードされました。私はそれを--accept-regex正しく使用していますか？

wget - URLパターンに基づいてWebページをダウンロードする方法は？

ベストアンサー1

おすすめ記事