特定のXML要素型の子要素抽出

Question

コマンドラインなどのXMLファイルsed処理が本当に必要な場合は、awkXML処理コマンドラインツールの使用を検討する必要があります。私が見た最も一般的に使用されるツールは次のとおりです。

また、いくつかのXML関連のプログラミング/クエリ言語があることも知っておく必要があります。

（有効なXMLになるには）XMLデータにルートノードが必要で、属性値を引用符で囲む必要があります。つまり、データファイルは次のようになります。

<!-- data.xml -->

<instances>

    <instance ab='1'>
        <a1>aa</a1>
        <a2>aa</a2>
    </instance>

    <instance ab='2'>
        <b1>bb</b1>
        <b2>bb</b2>
    </instance>

    <instance ab='3'>
        <c1>cc</c1>
        <c2>cc</c2>
    </instance>

</instances>

データ型が有効なXMLの場合は、次のものを使用できます。Xパスそしてxmlstarlet非常に簡潔なコマンドで欲しいものを入手してください。

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

これにより、次のような出力が生成されます。

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

あるいは、Python（私が個人的に好きな選択肢）を使うこともできます。以下は同じことをするPythonスクリプトです。

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
    print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

スクリプトを実行する方法は次のとおりです。

python extract_instance_children.py data.xml

これは以下を使用します。Python標準ライブラリのxmlパッケージこれも厳密なXMLパーサーです。

XMLの形式が正しいかどうか気にせず、提示されたファイルとほぼ同様のテキストファイルを解析したい場合は、シェルスクリプトと標準のコマンドラインツールを使用して必要なものを入手できます。。以下はスクリプトですawk（要求時に）。

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN {
    addchild=0;
    children="";
}

{
    # Opening tag for "instance" element - set the "addchild" flag
    if($0 ~ "^ *<instance[^<>]+>") {
        addchild=1;
    }

    # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
    else if($0 ~ "^ *</instance>" && addchild == 1) {
        addchild=0;
        printf("%s\n", children);
        children="";
    }

    # Concatenating child elements - strip whitespace
    else if (addchild == 1) {
        gsub(/^[ \t]+/,"",$0);
        gsub(/[ \t]+$/,"",$0);
        children=children $0;
    }
}

ファイルからスクリプトを実行するには、次のコマンドを使用できます。

awk -f extract_instance_children.awk data.xml

以下は、目的の出力を生成するBashスクリプトです。

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

    # Set the instance flag to true if we come across an opening tag
    if echo "${line}" | grep -q '<instance.*>'; then
        instance=1

    # Set the instance flag to false and print a newline if we come across a closing tag
    elif echo "${line}" | grep -q '</instance>'; then
        instance=0
        echo

    # If we're inside an instance tag then print the child element
    elif [[ ${instance} == 1 ]]; then
        printf "${line}"
    fi

done < "${1}"

次のように実行できます。

bash extract_instance_children.bash data.xml

または、もう一度Pythonに戻って次のように使用できます。美しいスープパック。 Beautiful Soupは、無効なXMLを解析するときに標準のPython XMLモジュール（そして私が遭遇した他のすべてのXMLパーサ）よりもはるかに柔軟です。以下は、望ましい結果を得るためにBeautiful Soupを使用するPythonスクリプトです。

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
    soup = Soup(xmlfile.read(), "html.parser")
    for instance in soup.findAll('instance'):
        print(''.join([str(child) for child in instance.findChildren()]))

Answer 1