破損したCSVファイルからネストされた二重引用符をエスケープします。

破損したCSVファイルからネストされた二重引用符をエスケープします。

入れ子になった二重引用符を持つ破損した「CSV」ファイルがあります。たとえば、

123,"I wonder how to escape "these" quotes with backslashes.",123,456
456,"I wonder how to escape "these" quotes with backslashes.",456,789

この問題を解決する方法を知っていますか?

修正する実際の例を見てください:

n9sih438,4994fa72322,PMC,Rapid Identification of Malaria Vaccine Candidates Based on alpha-Helical Coiled Coil Protein Motif,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"To identify malaria antigens for vaccine development, we selected alpha-helical coiled coil domains of proteins predicted to be present in the parasite erythrocytic stage. The corresponding synthetic peptides are expected to mimic structurally "native" epitopes. Indeed the 95 chemically synthesized peptides were all specifically recognized by human immune sera, though at various prevalence. Peptide specific antibodies were obtained both by affinity-purification from malaria immune sera and by immunization of mice. These antibodies did not show significant cross reactions, i.e., they were specific for the original peptide, reacted with native parasite proteins in infected erythrocytes and several were active in inhibiting in vitro parasite growth. Circular dichroism studies indicated that the selected peptides assumed partial or high alpha-helical content. Thus, we demonstrate that the bioinformatics/chemical synthesis approach described here can lead to the rapid identification of molecules which target biologically active antibodies, thus identifying suitable vaccine candidates. This strategy can be, in principle, extended to vaccine discovery in a wide range of other pathogens.",2007-07-25

タイトルフィールド(4番目のフィールド)とサマリーフィールド(9番目のフィールド)には、入れ子になった二重引用符が表示されることがあります。

ベストアンサー1

1行に10個のフィールドを持つサンプル入力ファイルを作成しました。ここでは、フィールド4と9を参照できます。

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

次に、このスクリプトを作成して(GNU awkを3番目の引数として使用match())、各入力行のタイプを識別し、それに応じて参照フィールドを変更します。

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

出力は、常に入力で参照できる2つのフィールドを参照します。これが気に入らない場合は、練習で簡単に調整できます。また、CSVで二重引用符を「エスケープ」するより伝統的な方法を使用しました。これは二重引用符を2倍にすることです。必要\"に応じて、これもマイナーな変更です""。バラよりhttps://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awkCSV と CSV 「標準」で awk を使用する方法に関する追加情報。

おすすめ記事