loading xml files with xml db

XML DB is Oracle's native XML-processing application. Introduced in 9i, it provides a wide range of functionality for generating, loading, storing and manipulating XML inside the database. Oracle's XML capabilities have changed at an alarming rate from 8i to 10g, but with XML DB, the methodology appears to be stabilising. The technology itself, however, is continually being improved and extended.

This article is an introduction to XML DB in relation to one very specific area; converting in-bound XML files into relational datasets. In my limited experience with XML, this is the only requirement I've had for XML processing (all of it batch-based). Before XML DB, this type of processing was achievable through a variety of methods (and still is), but they were cumbersome, slow and continually changing. XML DB makes batch-loading much simpler and more efficient.

oracle versions

The examples in this article have been tested on 10.2.0.1 (XE), 10.1.0.5 and 9.2.0.8 instances on Windows XP. To enable the examples to run on 9i, a small concession has been made in the sample XML schema document (see the xml db objects section of this article for details). To run the initial administrative tasks, users must have the XDBADMIN role granted in addition to the "usual" CREATE privileges (TABLE, TRIGGER etc). To be able to create the directories used in this article, users will also require CREATE ANY DIRECTORY. Users can check to ensure that XML DB is installed by simply checking for the XDB user. This account is usually locked (and should ideally remain so).

sample files

For the examples, I have created a very simple XML file based on DBA_OBJECTS and created a corresponding schema document (XSD). See the end of this article for links to the source files. The files are named as follows (we will need to reference these names in our example code).

db_objects.xsd (schema document)
db_objects.xml (data file)

getting started: creating folders

XML DB is a database application that manages its metadata in a repository inside the host database. Areas within the repository are addressed via a "folder" structure, which are named locations for the files we eventually wish to load into XML DB and process.

To begin the demonstration of XML DB, therefore, we must first create the folder structure in the repository. For this we use DBMS_XDB as follows. For completeness only, the example includes an exception-handler to cater for the event that a folder already exists. Note that there are several security options we can apply to these folders using the DBMS_XDB APIs, but these have been overlooked for the simple examples herein.

SQL> DECLARE
  2     TYPE folder_ntt IS TABLE OF VARCHAR2(1024);
  3     nt_folders folder_ntt := folder_ntt(
  4                                 '/public/',
  5                                 '/public/demo/',
  6                                 '/public/demo/xsd/',
  7                                 '/public/demo/xml/'
  8                                 );
  9     v_return BOOLEAN;
 10     folder_exists EXCEPTION;
 11     PRAGMA EXCEPTION_INIT(folder_exists, -31003);
 12  BEGIN
 13     FOR i IN nt_folders.FIRST .. nt_folders.LAST LOOP
 14        BEGIN
 15           v_return := DBMS_XDB.CREATEFOLDER(nt_folders(i));
 16        EXCEPTION
 17           WHEN folder_exists THEN
 18              NULL;
 19        END;
 20     END LOOP;
 21     COMMIT;
 22  END;
 23  /

PL/SQL procedure successfully completed.

We now have a repository structure. We can browse the contents of the folders using either of the following methods.

http://[hostname_or_ip]:8080/; or
WebDAV in Windows Explorer (for XP Professional users).

WebDAV can be setup from "Start->Settings->Network Connections->My Network Connections->Add a network place". Follow the wizard to map a Windows Explorer link to the XML DB repository. The address will be the same as the http:// option provided above. The username and password prompt will be for a database account that has the XDBADMIN role.

registering an xml schema with xml db

As stated earlier, this article is focussed on loading an XML document into XML DB and accessing it via SQL. To do this, we are going to register an XML schema document (XSD) with XML DB. This XSD describes the content and structure of our XML files and enables XML DB to "understand" the incoming data.

To register our XSD with XML DB, we must first "upload" the db_objects.xsd source document to the repository. We can do this using any of the following methods:

ftp://[hostname_or_ip]:2100/
WebDAV HTTP protocol (via Windows Explorer using copy-paste or cut-paste)
DBMS_XDB.CREATERESOURCE API

We will use the DBMS_XDB.CREATERESOURCE API. We must load the XSD from an Oracle directory (which represents the staging location on the server). The directory is created as follows.

SQL> CREATE OR REPLACE DIRECTORY xsd_dir AS 'o:\xmldb\xsd';

Directory created.

Once we've ensured that that the source XSD is in the relevant directory on the server, we can upload and register it. Uploading is done via the DBMS_XDB.CREATERESOURCE API and schema-registration is performed by DBMS_XMLSCHEMA.REGISTERSCHEMA. The following example demonstrates both calls. Readers might prefer to separate the operations, but either way, a schema cannot be registered until the source file is uploaded into XML DB.

SQL> BEGIN
  2    IF DBMS_XDB.CREATERESOURCE(
  3       abspath => '/public/demo/xsd/db_objects.xsd',
  4       data    => BFILENAME('XSD_DIR','db_objects.xsd')
  5       )
  6    THEN
  7       DBMS_XMLSCHEMA.REGISTERSCHEMA(
  8          schemaurl => 'http://localhost:8080/public/demo/xsd/db_objects.xsd',
  9          schemadoc => sys.UriFactory.getUri('/public/demo/xsd/db_objects.xsd')
 10          );
 11       COMMIT;
 12    END IF;
 13  END;
 14  /

PL/SQL procedure successfully completed.

When we upload a source document into XML DB (we loaded our XSD into the /public/demo/xsd folder), it is known as a "resource". In the above example, the resource was our XSD, from which XML DB was able to generate a schema.

We can see our schema details in the XXX_XML_SCHEMAS view family. The SCHEMA column is an XMLType column that contains the XSD. The following SQL just displays a few bytes.

SQL> desc user_xml_schemas

 Name                          Null?    Type
 ----------------------------- -------- --------------------
 SCHEMA_URL                             VARCHAR2(700)
 LOCAL                                  VARCHAR2(3)
 SCHEMA                                 SYS.XMLTYPE(XMLSchem
                                        a "http://xmlns.orac
                                        le.com/xdb/XDBSchema
                                        .xsd" Element "schem
                                        a")
 INT_OBJNAME                            VARCHAR2(4000)
 QUAL_SCHEMA_URL                        VARCHAR2(767)
 HIER_TYPE                              VARCHAR2(11)

SQL> SELECT schema_url, schema
  2  FROM   user_xml_schemas;

SCHEMA_URL                                              SCHEMA
------------------------------------------------------- --------------------------------
http://localhost:8080/public/demo/xsd/db_objects.xsd    <?xml version="1.0"?>
                                                        <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
                                                        xml

xml db objects

When we register a schema, XML DB creates a range of supporting object-relational database structures based on the elements, sequences and complex types described by the XSD. We are able to influence the naming of these system-generated objects. For examples, search the supplied db_objects.xsd file for "xdb:SQLType", "xdb:defaultTable" and "xdb:SQLCollType". If such tags are omitted, Oracle will use system-generated names for the objects.

Now that our schema is registered, we can see what objects were created for us (note that these examples were run on an empty schema, hence the unconstrained select from USER_OBJECTS).

SQL> SELECT object_name, object_type
  2  FROM   user_objects;

OBJECT_NAME                    OBJECT_TYPE
------------------------------ ------------------
DB_OBJECTS_ROW_OT              TYPE
DB_OBJECTS_ROW_NTT             TYPE
DB_OBJECTS_OT                  TYPE
DB_OBJECTS                     TABLE
SYSNTjakx4WHmQ4u5f36ct7Nawg==  TABLE
SYS_C004041                    INDEX
EXTRADATA401_L                 LOB
NAMESPACES402_L                LOB
SYS_C004042                    INDEX
DB_OBJECTS$xd                  TRIGGER

10 rows selected.

Our "primary" table is DB_OBJECTS (we know this because we tagged the XSD accordingly). We can now describe the structure of this table as follows. The USER_TAB_COLS view shows us the hidden columns, which for DB_OBJECTS is actually everything after the first column.

SQL> SELECT column_name
  2  ,      data_type
  3  FROM   user_tab_cols
  4  WHERE  table_name = 'DB_OBJECTS';

COLUMN_NAME                    DATA_TYPE
------------------------------ -------------------------
SYS_NC_OID$                    RAW
SYS_NC_ROWINFO$                XMLTYPE
XMLEXTRA                       XMLTYPEEXTRA
SYS_NC00004$                   XMLTYPEPI
SYS_NC00005$                   XMLTYPEPI
XMLDATA                        DB_OBJECTS_OT
SYS_NC00007$                   RAW
SYS_NC00008$                   DB_OBJECTS_ROW_NTT
SYS_NC0000800009$              RAW
ACLOID                         RAW
OWNERID                        RAW

11 rows selected.

We can see a range of unfamiliar data types in the DB_OBJECTS table. Even the column that is exposed to us (SYS_NC_OID$) is a RAW. The clue to where our data resides, however, seems to be in the column XMLDATA of data type DB_OBJECTS_OT. This type was another result of our XSD annotation, else this would have a system-generated name. It is the type that we used to describe one record or our XML data.

We can examine the structure of this object type in addition to the other underlying data types that describe our data.

SQL> break on type_name skip 1
 
SQL> SELECT type_name, attr_name, attr_type_name
  2  FROM   user_type_attrs
  3  WHERE  type_name LIKE 'DB_OBJECTS%'
  4  ORDER  BY
  5         type_name, attr_no;

TYPE_NAME                      ATTR_NAME                      ATTR_TYPE_NAME
------------------------------ ------------------------------ ----------------------
DB_OBJECTS_OT                  THISROW                        DB_OBJECTS_ROW_NTT

DB_OBJECTS_ROW_OT              OWNER                          VARCHAR2
                               OBJECT_NAME                    VARCHAR2
                               OBJECT_TYPE                    VARCHAR2
                               OBJECT_ID                      NUMBER
                               CREATED                        TIMESTAMP

6 rows selected.

SQL> SELECT type_name, elem_type_name
  2  FROM   user_coll_types
  3  WHERE  type_name LIKE 'DB_OBJECTS%';

TYPE_NAME                ELEM_TYPE_NAME
------------------------ ------------------------
DB_OBJECTS_ROW_NTT       DB_OBJECTS_ROW_OT

1 row selected.

From our XSD tags and the above metadata, we know that we have an object type (DB_OBJECTS_ROW_OT) that describes a single row of our data (this has five attributes). We then have a sequence of these rows, which we annotated as a collection type named DB_OBJECTS_ROW_NTT. Finally, at the "surface", we have an object type named DB_OBJECTS_OT that contains all the attributes needed to present a single sequence of XML data (which actually contains just one attribute of DB_OBJECTS_ROW_NTT).

The introduction to this article stated that we had to make a concession for the examples to run on 9i. If we look at the attribute list for the DB_OBJECTS_ROW_OT above, we can see the result of this concession highlighted. The underlying source data for the sample files (DBA_OBJECTS) defines the CREATED column as a DATE. The XML schema defines the corresponding complex type attribute as an xs:dateTime. In Oracle 10g we can annotate the CREATED attribute in the XSD with xdb:SQLType="DATE". This will enable XML DB to correctly define the CREATED attribute as a DATE in the object type it creates. This annotation doesn't work in 9i (tested on 9.2.0.8), so XML DB creates this attribute as a TIMESTAMP in the database. If necessary, this can easily be converted to a DATE in SQL using the CAST function.

loading xml files into xml db

We are now ready to load an XML file into the database, via XML DB. We'll load the file into the XML DB repository from a new Oracle directory (remember we created one for our XSD files earlier). The following directory identifies the staging area where the db_objects.xml sample file is located.

SQL> CREATE OR REPLACE DIRECTORY xml_dir AS 'o:\xmldb\xml';

Directory created.

We are going to load the db_objects.xml file using the same DBMS_XDB.CREATERESOURCE API that we used to upload the XSD earlier. Again, the same upload alternatives (ftp or WebDAV) apply. Creating a resource is a transaction that must be committed. Note that resources can optionally be locked during loading.

SQL> DECLARE
  2     v_return BOOLEAN;
  3  BEGIN
  4     v_return := DBMS_XDB.CREATERESOURCE(
  5                    abspath => '/public/demo/xml/db_objects.xml',
  6                    data    => BFILENAME('XML_DIR', 'db_objects.xml')
  7                    );
  8     COMMIT;
  9  END;
 10  /

PL/SQL procedure successfully completed.

When we successfully load an XML resource into XML DB, some magic happens. Oracle reads through our XML data file and "shreds" the data into the object-relational structures we created on schema registration. Hence in creating a resource as the example above, we are actually loading data into the database.

querying xml db

Having loaded the XML file, we can begin to query it, either in its XML format, or by "flattening" the data into a relational structure. Remember that the focus of this article is to process XML data to a relational format, so this is how we will proceed.

There are different methods available to us depending on how we access our DB_OBJECTS table. First we can use the XMLSEQUENCE and EXTRACT functions to transform the single XML document of repeating records (stored in DB_OBJECTS as an XMLType column) into a sequence of XMLType records (i.e. one XMLType per record). We then address the attributes via XPath expressions (in the XSD, we named the repeating records "THISROW") and the EXTRACTVALUE function. These functions and expressions are used as follows.

SQL> SELECT EXTRACTVALUE(xseq.column_value, '/THISROW/OWNER')       AS owner
  2  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/OBJECT_NAME') AS object_name
  3  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/OBJECT_TYPE') AS object_type
  4  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/OBJECT_ID')   AS object_id
  5  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/CREATED')     AS created
  6  FROM   db_objects do
  7  ,      TABLE(XMLSEQUENCE(EXTRACT(VALUE(do), '/ROWSET/THISROW'))) xseq
  8  WHERE  ROWNUM <= 10;

OWNER      OBJECT_NAME        OBJECT_TYPE        OBJECT_ID CREATED
---------- ------------------ ------------------ --------- ------------------------------
SYS        SUMKEY$            TABLE                    400 14-SEP-05 00.00.00.000000
SYS        I_SUMKEY$_1        INDEX                    401 14-SEP-05 00.00.00.000000
SYS        SUMAGG$            TABLE                    402 14-SEP-05 00.00.00.000000
SYS        I_SUMAGG$_1        INDEX                    403 14-SEP-05 00.00.00.000000
SYS        SUMJOIN$           TABLE                    404 14-SEP-05 00.00.00.000000
SYS        I_SUMJOIN$_1       INDEX                    405 14-SEP-05 00.00.00.000000
SYS        SUMDEP$            TABLE                    406 14-SEP-05 00.00.00.000000
SYS        I_SUMDEP$_1        INDEX                    407 14-SEP-05 00.00.00.000000
SYS        I_SUMDEP$_2        INDEX                    408 14-SEP-05 00.00.00.000000
SYS        SUMPRED$           TABLE                    409 14-SEP-05 00.00.00.000000

10 rows selected.

Some points to note are:

VALUE. We can see lots of nested functions, so we'll start at the lowest-level. The DB_OBJECTS table is an object table. The VALUE function accepts a table alias and returns object instances associated with each row of the object table. In our example, the VALUE function will return each instance of an XMLType held in our DB_OBJECTS table.
EXTRACT. This function returns a fragment of each instance of XMLType data according to the XPath address we supply.
XMLSEQUENCE. Due to the way the XSD was designed, the /ROWSET/THISROW XPath contains repeating rows of data. Hence we need to use the XMLSEQUENCE function to transform the extracted XML data into a VARRAY.
TABLE. As most readers will be aware, this enables us to read collections of data as virtual tables. This is commonly used with table/pipelined functions and to access collections stored in tables. In our example, we have a collection of XML data fragments in "THISROW" format.
EXTRACTVALUE. This function is used to access the data attributes within an XMLType fragment. We extracted data at the "THISROW" level, hence this becomes the top-level for the XPath expressions we use at this point.

The second access method is courtesy of Adrian Christie. With this method, we access the hidden XMLDATA column directly. Remember from our dictionary queries above that the DB_OBJECTS.XMLDATA column is of DB_OBJECTS_OT type. This type is an object type with a single nested table attribute. The nested table itself has five attributes (it is based on an object type of our underlying record structure). Hence, using the TABLE operator, we can easily select the data from each object record as follows.

SQL> SELECT xml.owner
  2  ,      xml.object_name
  3  ,      xml.object_type
  4  ,      xml.object_id
  5  ,      xml.created
  6  FROM   db_objects do
  7  ,      TABLE(do.xmldata.thisrow) xml
  8  WHERE  ROWNUM <= 10;

OWNER      OBJECT_NAME        OBJECT_TYPE        OBJECT_ID CREATED
---------- -----------------  ------------------ --------- ------------------------------
SYS        SUMKEY$            TABLE                    400 14-SEP-05 00.00.00.000000
SYS        I_SUMKEY$_1        INDEX                    401 14-SEP-05 00.00.00.000000
SYS        SUMAGG$            TABLE                    402 14-SEP-05 00.00.00.000000
SYS        I_SUMAGG$_1        INDEX                    403 14-SEP-05 00.00.00.000000
SYS        SUMJOIN$           TABLE                    404 14-SEP-05 00.00.00.000000
SYS        I_SUMJOIN$_1       INDEX                    405 14-SEP-05 00.00.00.000000
SYS        SUMDEP$            TABLE                    406 14-SEP-05 00.00.00.000000
SYS        I_SUMDEP$_1        INDEX                    407 14-SEP-05 00.00.00.000000
SYS        I_SUMDEP$_2        INDEX                    408 14-SEP-05 00.00.00.000000
SYS        SUMPRED$           TABLE                    409 14-SEP-05 00.00.00.000000

10 rows selected.

This method is far more simple to both use and understand. Some developers might be uncomfortable using a hidden (and as far as I know undocumented) method to access the data, but the Autotrace examples in the next section indicate that the underlying data access for these two queries is generally the same.

performance characteristics

We will now compare the two methods for speed and resource usage using our full XML file, Autotrace and the "wall-clock". We'll begin with the first SQL method that uses the various functions described earlier.

SQL> set autotrace traceonly
SQL> set timing on
 
SQL> SELECT EXTRACTVALUE(xseq.column_value, '/THISROW/OWNER')       AS owner
  2  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/OBJECT_NAME') AS object_name
  3  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/OBJECT_TYPE') AS object_type
  4  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/OBJECT_ID')   AS object_id
  5  ,      EXTRACTVALUE(xseq.column_value, '/THISROW/CREATED')     AS created
  6  FROM   db_objects do
  7  ,      TABLE(XMLSEQUENCE(EXTRACT(VALUE(do), '/ROWSET/THISROW'))) xseq;

38420 rows selected.

Elapsed: 00:00:00.64

Execution Plan
----------------------------------------------------------
Plan hash value: 1787826955

----------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                          | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                               | 38777 |   224M|   125   (4)| 00:00:02 |
|*  1 |  HASH JOIN         |                               | 38777 |   224M|   125   (4)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| DB_OBJECTS                    |     1 |    30 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| SYSNTjakx4WHmQ4u5f36ct7Nawg== | 38777 |   223M|   121   (3)| 00:00:02 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("NESTED_TABLE_ID"="DB_OBJECTS"."SYS_NC0000800009$")
   2 - filter(SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1)
   3 - filter("SYS_NC_TYPEID$" IS NOT NULL)

Note
-----
   - dynamic sampling used for this statement


Statistics
----------------------------------------------------------
         15  recursive calls
          0  db block gets
       3438  consistent gets
          0  physical reads
          0  redo size
    1735982  bytes sent via SQL*Net to client
      28551  bytes received via SQL*Net from client
       2563  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      38420  rows processed

Now we can compare the second technique as follows.

SQL> SELECT xml.owner
  2  ,      xml.object_name
  3  ,      xml.object_type
  4  ,      xml.object_id
  5  ,      xml.created
  6  FROM   db_objects do
  7  ,      TABLE (do.xmldata.thisrow) xml;

38420 rows selected.

Elapsed: 00:00:00.62

Execution Plan
----------------------------------------------------------
Plan hash value: 1787826955

----------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                          | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                               | 38777 |   224M|   125   (4)| 00:00:02 |
|*  1 |  HASH JOIN         |                               | 38777 |   224M|   125   (4)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| DB_OBJECTS                    |     1 |    30 |     3   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| SYSNTjakx4WHmQ4u5f36ct7Nawg== | 38777 |   223M|   121   (3)| 00:00:02 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("NESTED_TABLE_ID"="DB_OBJECTS"."SYS_NC0000800009$")
   2 - filter(SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1)

Note
-----
   - dynamic sampling used for this statement


Statistics
----------------------------------------------------------
         15  recursive calls
          0  db block gets
       3243  consistent gets
          0  physical reads
          0  redo size
    1735982  bytes sent via SQL*Net to client
      28551  bytes received via SQL*Net from client
       2563  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      38420  rows processed

We can see that the two methods are directly comparable for speed and are in fact almost the same SQL statement. We say almost because there is no filter at step 3 in the execution plan of the second statement. This step appears to some kind of a semantic check against the type in the object table (this is the object with the system-generated name - see the dictionary query from earlier in the article). Whatever this type check might be, it has little impact on the overall timings and the workload is the same.

dropping objects from xml db

Finally, we'll remove the various XML DB objects we created during this article. We will work from the bottom upwards, so in the following example we will remove the XML and XSD resources (db_objects.xml and db_objects.xsd), the registered schema and the XML DB folders in that order. Note that the procedures are DML rather than DDL and as such require a COMMIT to complete the removal. Note also that the URL in schema de-registration must be the same as that used during registration. IP addresses and domain names are not interchangeable else we can expect an ORA-31000 error.

SQL> BEGIN
  2     DBMS_XDB.DELETERESOURCE(
  3        abspath => '/public/demo/xml/db_objects.xml'
  4        );
  5     DBMS_XDB.DELETERESOURCE(
  6        abspath => '/public/demo/xsd/db_objects.xsd'
  7        );
  8     DBMS_XMLSCHEMA.DELETESCHEMA(
  9        schemaurl => 'http://localhost:8080/public/demo/xsd/db_objects.xsd',
 10        delete_option => 4
 11        );
 12     DBMS_XDB.DELETERESOURCE(
 13        abspath       => '/public',
 14        delete_option => 4
 15        );
 16     COMMIT;
 17  END;
 18  /

PL/SQL procedure successfully completed.

source code

The source code for the examples in this article can be downloaded from here.

Adrian Billington, March 2006

oracle-developer.net