カンマ区切りの3つの参照フィールドがあります。
last crawled,linking page,domain
"Nov 17, 2018","https://allestoringen.be/problemen/bwin/antwerpen","allestoringen.be"
"Aug 11, 2017","http://casino.linkplek.be/","linkplek.be"
"Nov 17, 2018","http://pronoroll.blogspot.com/p/blog-page_26.html","pronoroll.blogspot.com"
etc
[日付]フィールドから重複エントリを削除し、各固有日付(列$ 2)に対して一意にリンクされたページ数と、その固有日付(列$ 3)の固有ドメイン数を見つける必要があります。私は試した:
awk '{A[$1 OFS $2]++} END {for(k in A) print k, A[k]}' FPAT='([^,]*)|("[^"]+")' file
awk '{A[$1 OFS $3]++} END {for(k in A) print k, A[k]}' FPAT='([^,]*)|("[^"]+")' file
しかし、一度に3つの列をすべて得ることは少し混乱しています。
ベストアンサー1
# script.awk
BEGIN {
FPAT = "[^\"]+"
}
{
# the first if is to avoid processing title row and rows that do not contain fields
if (NF > 1) {
# the third field is the linking page column; the second field is a comma
if ($3 != "" && $1 $3 in unique_linking_pages == 0) {
unique_linking_pages[$1 $3]
unique_linking_page_counts[$1]++
}
# the fifth field is the domain column; the fourth field is a comma
if ($5 != "" && $1 $5 in unique_domains == 0) {
unique_domains[$1 $5]
unique_domain_counts[$1]++
}
# in case of empty fields in columns 2 and or 3 of the input file,
# this ensures that all the dates are recorded
dates[$1]
}
}
END {
printf "Date, Unique Linking Page Count, Unique Domain Count\n"
for (date in dates)
output = output date " | " unique_linking_page_counts[date] " | " unique_domain_counts[date] "\n"
system("column -t -s '|' <<< '" output "'")
}