閑話休題： XSLT2.0でテキストファイルをXML化する

ここではunparsed-text()でスタイルシートからテキストファイルにアクセスできることを実例で紹介したいと思います．目的はもちろんテキストからのXML化です．

■ 最初は「超」簡単な例．
テキストファイルを改行(U+000A）で区切り～で囲んで出力します．

[入力ファイル]
<?xml version="1.0" encoding="UTF-8" ?>
<document/>

[入力テキストファイル:input.txt]
The design of the Darwin Information Typing Architecture (DITA) is based on deriving multiple information types, or topic types, from a common, generic topic. This language reference describes the elements that comprise the topic DTD and its initial, information-typed descendents: concept, reference, task, and glossentry. It also describes the DITA map DTD and its current specialization (bookmap), as well as various topic and map based DITA domains.
This specification describes specific details of each element in the OASIS DITA language. The separate DITA Architectural Specification includes detailed information about DITA specialization, when to use each topic type, how topics and maps interact, details of complex behaviors such as conref and conditional processing, and many other best practices for working with DITA.
The elements that make up the DITA design represent a set of different authoring concerns, each of which is grouped into its own chapter. Major sections include:

[スタイルシート]
<xsl:output encoding="UTF-8" indent="yes"/>
<xsl:template match="document">
 <xsl:variable name="inputText"
 select="unparsed-text(resolve-uri('input.txt',base-uri(/)),'UTF-8')"/>
 <xsl:copy>
 <xsl:analyze-string select="$inputText" regex="\n">
 <xsl:non-matching-substring>
 
 <xsl:value-of select="."/>
 
 </xsl:non-matching-substring>
 </xsl:analyze-string>
 </xsl:copy>
</xsl:template>

[実行結果]
<?xml version="1.0" encoding="UTF-8"?>
<document>
 The design of the Darwin Information Typing Architecture (DITA) is based on deriving multiple information types, or topic types, from a common, generic topic. This language reference describes the elements that comprise the topic DTD and its initial, information-typed descendents: concept, reference, task, and glossentry. It also describes the DITA map DTD and its current specialization (bookmap), as well as various topic and map based DITA domains.
 This specification describes specific details of each element in the OASIS DITA language. The separate DITA Architectural Specification includes detailed information about DITA specialization, when to use each topic type, how topics and maps interact, details of complex behaviors such as conref and conditional processing, and many other best practices for working with DITA.
 The elements that make up the DITA design represent a set of different authoring concerns, each of which is grouped into its own chapter. Major sections include:
</document>

きれいにタグで囲まれて出力できました．<xsl:analyze-string select="$inputText" regex="\n">で、U+000Aを区切りにテキストを分割し、<xsl:non-matching-substring>でU+000A以外のテキストを～で囲んで出力しています．簡単ですね．

■ CSVファイルを処理する
こんどは"," 区切りのCSVファイルを処理してみます．

[入力ファイル]
<?xml version="1.0" encoding="UTF-8" ?>
<data/>

[入力テキストファイル:input.csv]
鉛筆,三菱,100
鉛筆,トンボ,120
鉛筆,コーリン,150

[スタイルシート]
<xsl:template match="data">
 <xsl:variable name="inputText"
 select="unparsed-text(resolve-uri('input.csv',base-uri(/)),'UTF-8')"/>
 <xsl:copy>
 <xsl:analyze-string select="$inputText" regex="\n">
 <xsl:non-matching-substring>
 <xsl:analyze-string select="." regex="([^,]*),([^,]*),([^,]*)">
 <xsl:matching-substring>
 <product>
 <name>
 <xsl:value-of select="regex-group(1)"/>
 </name>
 <maker>
 <xsl:value-of select="regex-group(2)"/>
 </maker>
 <price>
 <xsl:value-of select="regex-group(3)"/>
 </price>
 </product>
 </xsl:matching-substring>
 </xsl:analyze-string>
 </xsl:non-matching-substring>
 </xsl:analyze-string>
 </xsl:copy>
</xsl:template>

[実行結果]
<?xml version="1.0" encoding="UTF-8"?>
<data>
 <product>
 <name>鉛筆</name>
 <maker>三菱</maker>
 <price>100</price>
 </product>
 <product>
 <name>鉛筆</name>
 <maker>トンボ</maker>
 <price>120</price>
 </product>
 <product>
 <name>鉛筆</name>
 <maker>コーリン</maker>
 <price>150</price>
 </product>
</data>

いかがでしょうか？U+000Aでテキストを区切り、その結果の文字列を","で区切ってグループ化し、regex-group(n)で区切られたグループを抽出します．

正規表現の書き方に慣れていないと戸惑いますが、こんなに簡単にテキストのXML化が出来るとは思いませんでした．XSLT 2.0はこの機能で非常にテキスト処理の可能性を広げたと思います．

＃今回の例では、入力テキストファイルはエディタの設定で改行文字をU+000Aだけとするようにしました．一般にはWindowsでは改行文字がU+000D,U+000Aです．設定を変更しないと、U+000Dが出力に残ってしまいます．