破損したCSVファイルからネストされた二重引用符をエスケープします。

Question

1行に10個のフィールドを持つサンプル入力ファイルを作成しました。ここでは、フィールド4と9を参照できます。

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

次に、このスクリプトを作成して（GNU awkを3番目の引数として使用match()）、各入力行のタイプを識別し、それに応じて参照フィールドを変更します。

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

。

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

出力は、常に入力で参照できる2つのフィールドを参照します。これが気に入らない場合は、練習で簡単に調整できます。また、CSVで二重引用符を「エスケープ」するより伝統的な方法を使用しました。これは二重引用符を2倍にすることです。必要\"に応じて、これもマイナーな変更です""。バラよりhttps://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awkCSV と CSV 「標準」で awk を使用する方法に関する追加情報。

Answer 1

1行に10個のフィールドを持つサンプル入力ファイルを作成しました。ここでは、フィールド4と9を参照できます。

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

次に、このスクリプトを作成して（GNU awkを3番目の引数として使用match()）、各入力行のタイプを識別し、それに応じて参照フィールドを変更します。

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

。

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

出力は、常に入力で参照できる2つのフィールドを参照します。これが気に入らない場合は、練習で簡単に調整できます。また、CSVで二重引用符を「エスケープ」するより伝統的な方法を使用しました。これは二重引用符を2倍にすることです。必要\"に応じて、これもマイナーな変更です""。バラよりhttps://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awkCSV と CSV 「標準」で awk を使用する方法に関する追加情報。

破損したCSVファイルからネストされた二重引用符をエスケープします。

ベストアンサー1

おすすめ記事