Webコンテンツをクロールするときは、数字を固定数字と一致させます。

2024-06-20 • tag-icon

ソースWebページを解析して、次のようなすべてのhrefを見つけようとします。

href='http://example.org/index.php?showtopic=509480

ここで、次の数字showtopic=はランダムです（固定数字6個（例：123456 - 654321））。

while read -r line
do
    source=$(curl -L line) #is this the right way to parse the source?
    grep "href='http://example.org/index.php?showtopic=" >> output.txt 
done <file.txt #file contains a list of web pages

どの番号がわからない場合は、どのようにすべての行をキャッチできますか？正規表現で2番目のgrepを実行できますか？ awkで次のような範囲を使用したいと思います。

awk "'/href='http://example.org/index.php?showtopic=/,/^\s/'" >> file.txt

または、次のようにgrepを2倍にします。

grep "href='http://example.org/index.php?showtopic=" | grep -e ^[0-9]{1,6}$ >> output.txt

ベストアンサー1

cat input.txt |grep "href='http://example.org/index.php?showtopic=" > output.txt

catはgrepにパイプされたファイルの内容を出力します。 grepはこれを1行ずつ比較し、行全体を出力テキストに書き込みます。

またはsedを使用することもできます：

 sed -n "\#href='http://example.org/index.php?showtopic=#p"  input.txt >  output-sed.txt

ベストアンサー1

おすすめ記事