awk, sed, grep, perl... この場合、どのようなものを印刷する必要がありますか？

Question

HTMLの形式がわからない限り、これを制御でき、エラーなどは問題になりません。正規表現を使用できますが、上記のようにお勧めできません。

私はそれを直接使用しますが、主に単純なデータを一度に抽出するときに使用します。

たとえば、Perl を使用できます。HTML::TokeParser::シンプル。

非常に単純化されました：

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use HTML::Entities;
use utf8;

die "$0 [file | url]\n" unless defined $ARGV[0];

my $tp;
if ($ARGV[0] =~ /^http:\/\//) {
    $tp = HTML::TokeParser::Simple->new(url => $ARGV[0]);
} else {
    $tp = HTML::TokeParser::Simple->new(file => $ARGV[0]);
}

if (!$tp) {
    die "No HTML file found.\n";
}

# Array to store data.
my @val;
# Index
my $i = 0;

# A bit mixed code with some redundancy. 
# Could be done much simpler, - or much more safe. 
# E.g. Check for thead, tbody etc and call a sub to parse those.
# You could off course also print directly (not save to array),
# but you might want to use the data for something?
while (my $token = $tp->get_token) {
    if ($token->is_start_tag('th') && $token->get_attr('class') eq 'x') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('th') && $token->get_attr('class') eq 'R') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('td') && (
            ($token->get_attr('class') eq 'x') ||
            ($token->get_attr('class') eq 'R'))) {
        $val[$i++] = decode_entities($tp->get_token->as_is);
    }
}

my @width_col = (10, 8);

if ($i > 2 && !($i % 2)) {
    $i = 0;
    printf("%*s %*s\n",
        $width_col[0], "$val[$i++]",
        $width_col[1], "$val[$i++]"
    );
    while ($i < $#val) {
        printf("%*s %*d\n",
            $width_col[0], "$val[$i++]",
            $width_col[1], "$val[$i++]"
        );
    }
} else {
    die "ERR. Unable to extract data.\n"
}

結果の例：

$ ./extract htmlsample 
   seconds     reqs
         0    10927
   <= 0.01  1026471
 0.01-0.02   535390
 0.02-0.05    93298

Answer 1

HTMLの形式がわからない限り、これを制御でき、エラーなどは問題になりません。正規表現を使用できますが、上記のようにお勧めできません。

私はそれを直接使用しますが、主に単純なデータを一度に抽出するときに使用します。

たとえば、Perl を使用できます。HTML::TokeParser::シンプル。

非常に単純化されました：

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use HTML::Entities;
use utf8;

die "$0 [file | url]\n" unless defined $ARGV[0];

my $tp;
if ($ARGV[0] =~ /^http:\/\//) {
    $tp = HTML::TokeParser::Simple->new(url => $ARGV[0]);
} else {
    $tp = HTML::TokeParser::Simple->new(file => $ARGV[0]);
}

if (!$tp) {
    die "No HTML file found.\n";
}

# Array to store data.
my @val;
# Index
my $i = 0;

# A bit mixed code with some redundancy. 
# Could be done much simpler, - or much more safe. 
# E.g. Check for thead, tbody etc and call a sub to parse those.
# You could off course also print directly (not save to array),
# but you might want to use the data for something?
while (my $token = $tp->get_token) {
    if ($token->is_start_tag('th') && $token->get_attr('class') eq 'x') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('th') && $token->get_attr('class') eq 'R') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('td') && (
            ($token->get_attr('class') eq 'x') ||
            ($token->get_attr('class') eq 'R'))) {
        $val[$i++] = decode_entities($tp->get_token->as_is);
    }
}

my @width_col = (10, 8);

if ($i > 2 && !($i % 2)) {
    $i = 0;
    printf("%*s %*s\n",
        $width_col[0], "$val[$i++]",
        $width_col[1], "$val[$i++]"
    );
    while ($i < $#val) {
        printf("%*s %*d\n",
            $width_col[0], "$val[$i++]",
            $width_col[1], "$val[$i++]"
        );
    }
} else {
    die "ERR. Unable to extract data.\n"
}

結果の例：

$ ./extract htmlsample 
   seconds     reqs
         0    10927
   <= 0.01  1026471
 0.01-0.02   535390
 0.02-0.05    93298

awk, sed, grep, perl... この場合、どのようなものを印刷する必要がありますか？

ベストアンサー1

おすすめ記事