pipelined functions in oracle 9i
Since Oracle 8.0, it has been possible to select from a collection of data (usually loaded via a function) as a "virtual table". This technique became popular in the 8i timeframe thanks to Tom Kyte and numerous web forums where "SELECT FROM TABLE(CAST(plsql_function AS collection_type))" became a common technique for binding user-generated lists of data. However, as a technique for processing large volumes of data, "table functions" of this sort are limited due to their potentially large memory footprints.
In 9i Release 1 (9.0), Oracle has introduced pipelined table functions (known simply as pipelined functions). These build on the table function principle but with some critical differences, three of which stand out in particular:
- first, data is piped (or streamed) to the caller in small arrays of prepared data, rather than fully materialised as with original table functions;
- second, pipelined functions can be parallelised by Oracle which means that PL/SQL can be executed by multiple slaves for the first time; and
- third, pipelined functions make it easy to convert PL/SQL procedures into rowsources for bulk SQL operations, combining complex transformation logic with the benefits of SQL.
This article introduces the fundamentals of pipelined functions, starting with a simple example.
a simple pipelined function
The best way to visualise a pipelined function is to see a simple example. For any pipelined function we require the following elements:
- a collection type; and
- the pipelined PL/SQL function (usually contained inside a package).
In the following example, we will create a simple row-generator that pipes out a given number of records. To begin, we require a collection type that is structured according to the data that we wish to pipe. In our row-generator function, we will pipe a single number attribute, so we can simply create a collection type of Oracle's NUMBER datatype.
SQL> CREATE OR REPLACE TYPE number_ntt AS TABLE OF NUMBER; 2 /
Type created.
Oracle will use this collection type to "buffer" small numbers of records as it pipes data to the function caller. We can now create our first pipelined function. We will create a standalone function for simplicity (we will normally use packaged functions) and some syntax will be unfamiliar. Some comments follow the code.
SQL> CREATE FUNCTION row_generator ( 2 rows_in IN PLS_INTEGER 3 ) RETURN number_ntt PIPELINED IS 4 BEGIN 5 FOR i IN 1 .. rows_in LOOP 6 PIPE ROW (i); 7 END LOOP; 8 RETURN; 9 END; 10 /
Function created.
Note in particular the following:
- the PIPELINED keyword in the function definition. There are several extensions to this that we will see later in this article;
- the return type must be a collection type (in our example we are using the simple NUMBER_NTT collection type based on NUMBER);
- the new PIPE ROW syntax to add a single record into the return collection stream. The record we pipe must be based on the same underlying type as the collection type. In our row-generator, we pipe out records of a single number attribute;
- the "empty" RETURN statement. In a pipelined function, the PIPE ROW statement is responsible for "returning" data and not the RETURN statement.
Now we have created a simple pipelined function, we can use it to generate 10 rows of data, as follows.
SQL> SELECT * 2 FROM TABLE( row_generator(10) );
COLUMN_VALUE ------------ 1 2 3 4 5 6 7 8 9 10 10 rows selected.
Note how we wrap the function call in the TABLE operator. This "virtualises" the collection and allows us to select from it as a datasource such as a table or view. The TABLE operator has been available since Oracle 8.0 (when it was named "THE" - it was renamed to "TABLE" in 8i).
Note also the name of the column generated by our function. Remember we based our collection type on the built-in datatype NUMBER. In situations such as this (i.e. when the collection type is not based on any explicitly-named attributes), Oracle defaults to COLUMN_VALUE. In practice, we are more likely to base our collection types on more complex structures than a single scalar built-in, as we will see later in this article.
streaming vs materialising: a visual example
It is quite difficult to visualise how a pipelined function differs from a table function from a static example in an article. In the following example, we will attempt to demonstrate the "streaming" effect of pipelined functions and compare this to a standard table function. First we will create a simple table function that will return a small collection of our previous collection type.
SQL> CREATE FUNCTION table_function RETURN number_ntt AS 2 nt number_ntt := number_ntt(); 3 BEGIN 4 FOR i IN 1 .. 5 LOOP 5 DBMS_LOCK.SLEEP(1); 6 nt.EXTEND; 7 nt(nt.LAST) := i; 8 END LOOP; 9 RETURN nt; --<-- return whole collection 10 END table_function; 11 /
Function created.
Note how the table function loads a collection variable in full before returning it to the "consumer". Next we create a pipelined function to generate the same number of rows as the table function.
SQL> CREATE FUNCTION pipelined_function RETURN number_ntt PIPELINED AS 2 BEGIN 3 FOR i IN 1 .. 5 LOOP 4 DBMS_LOCK.SLEEP(1); 5 PIPE ROW (i); --<-- send row to consumer 6 END LOOP; 7 RETURN; 8 END pipelined_function; 9 /
Function created.
We can see that both functions include a 1 second sleep inside their respective loops and that the overall duration will be approximately 5 seconds. We will use this to demonstrate the difference between the execution of the functions and the delivery of the first rows. To help us with this, we will create a small function to return a TIMESTAMP against each generated row (using the built-in LOCALTIMESTAMP function directly will not work as it returns a constant value for the entire recordset).
SQL> CREATE FUNCTION get_time RETURN TIMESTAMP IS 2 BEGIN 3 RETURN LOCALTIMESTAMP; 4 END get_time; 5 /
Function created.
Finally, we setup our session's timestamp format and set the sqlplus arraysize to 1 to ensure that the pipelined function streams a single record at a time.
SQL> ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'HH24:MI:SS.FF3';
Session altered.
SQL> set arraysize 1
We are now ready to demonstrate the effect of the pipelined function. We will begin by selecting from the simple table function.
SQL> SELECT get_time() AS ts FROM DUAL;
TS ---------------------------------------------------------- 07:02:25.000 1 row selected.
SQL> SELECT column_value 2 , get_time() AS ts 3 FROM TABLE(table_function);
COLUMN_VALUE TS ------------ --------------------------------------------- 1 07:02:30.062 --<-- first row takes 5 seconds 2 07:02:30.062 3 07:02:30.062 4 07:02:30.062 5 07:02:30.062 5 rows selected.
We can see quite clearly that the table function did not deliver any data until the function completed. The elapsed time between the call to the function and the delivery of the data was a little over 5 seconds. We can compare this now to the pipelined function, as follows.
SQL> SELECT get_time() AS ts FROM dual;
TS ---------------------------------------------------------- 07:02:35.359 1 row selected.
SQL> SELECT column_value 2 , get_time() AS ts 3 FROM TABLE(pipelined_function);
COLUMN_VALUE TS ------------ --------------------------------------------- 1 07:02:36.390 --<-- first row after 1 second 2 07:02:37.390 3 07:02:38.390 4 07:02:39.390 5 07:02:40.390 --<-- last row after 5 seconds 5 rows selected.
The pipelined function starts to return data as soon as it is prepared (in this case after a 1 second sleep). This means that the consumer is provided with data almost as soon as it is ready and not when the entire rowsource is materialised (in Oracle literature, pipelined functions often feed other pipelined functions). In actual fact, no rowsource is ever materialised with pipelined functions and this leads to memory savings and performance improvements. We will now see the impact on memory.
streaming vs materialising: memory usage
To continue with our comparison of pipelined functions with table functions, we will now examine the memory usage and its effect on timings. In the following example, we will create two functions; one table and one pipelined. Both functions will generate 1,000,000 records of 10 bytes. We start by creating a general VARCHAR2 collection type and the two functions.
SQL> CREATE OR REPLACE TYPE varchar2_ntt AS TABLE OF VARCHAR2(4000); 2 /
Type created.
SQL> CREATE OR REPLACE FUNCTION table_function RETURN varchar2_ntt AS 2 nt varchar2_ntt := varchar2_ntt(); 3 BEGIN 4 FOR i IN 1 .. 1000000 LOOP 5 nt.EXTEND; 6 nt(nt.LAST) := RPAD('x',10); 7 END LOOP; 8 RETURN nt; --<-- return whole collection 9 END table_function; 10 /
Function created.
SQL> CREATE OR REPLACE FUNCTION pipelined_function RETURN varchar2_ntt PIPELINED AS 2 BEGIN 3 FOR i IN 1 .. 1000000 LOOP 4 PIPE ROW (RPAD('x',10)); --<-- send row to consumer 5 END LOOP; 6 RETURN; 7 END pipelined_function; 8 /
Function created.
Next we create a small function to give us the PGA memory usage in our session.
SQL> CREATE FUNCTION get_memory RETURN NUMBER IS 2 n NUMBER; 3 BEGIN 4 SELECT value INTO n 5 FROM v$mystat 6 WHERE statistic# = 20; --<-- 'pga memory max' 7 RETURN n; 8 END get_memory; 9 /
Function created.
We can now compare the memory "footprint" of the functions. For each function we will capture the memory usage, run a full select from the function using autotrace (to limit screen output) and output the memory delta. We will begin with the table function.
SQL> var v_memory NUMBER; SQL> exec :v_memory := get_memory();
PL/SQL procedure successfully completed.
SQL> set timing on SQL> set autotrace traceonly statistics SQL> SELECT * 2 FROM TABLE(table_function);
1000000 rows selected. Elapsed: 00:00:14.03 Statistics ---------------------------------------------------------- 21 recursive calls 0 db block gets 23 consistent gets 0 physical reads 0 redo size 13067026 bytes sent via SQL*Net to client 733825 bytes received via SQL*Net from client 66668 SQL*Net roundtrips to/from client 5 sorts (memory) 0 sorts (disk) 1000000 rows processed
SQL> BEGIN 2 DBMS_OUTPUT.PUT_LINE( 3 TO_CHAR(get_memory() - :v_memory) || ' bytes of PGA used.' 4 ); 5 END; 6 /
38736696 bytes of PGA used. PL/SQL procedure successfully completed.
We can see that the table function used approximately 35Mb of PGA memory to materialise the entire collection. For this reason, table functions are neither scalable nor suitable for concurrency. This somewhat restricts their suitability to smaller array "helper" functions, such as in-list binding of user-inputs (this is a common technique that has become an FAQ on some online forums).
Pipelined functions, on the other hand, are targetted directly at high-volume processing of complex data transformations, which means they are designed to be scalable and efficient. To be scalable, they must be intelligent with memory. To test whether this is the case, we can repeat the simple memory test with our single-attribute pipelined function as follows.
SQL> exec :v_memory := get_memory();
PL/SQL procedure successfully completed.
SQL> set autotrace traceonly statistics SQL> SELECT * 2 FROM TABLE(pipelined_function);
1000000 rows selected. Elapsed: 00:00:12.01 Statistics ---------------------------------------------------------- 21 recursive calls 0 db block gets 26 consistent gets 0 physical reads 0 redo size 13067026 bytes sent via SQL*Net to client 733825 bytes received via SQL*Net from client 66668 SQL*Net roundtrips to/from client 5 sorts (memory) 0 sorts (disk) 1000000 rows processed
SQL> BEGIN 2 DBMS_OUTPUT.PUT_LINE( 3 TO_CHAR(get_memory() - :v_memory) || ' bytes of PGA used.' 4 ); 5 END; 6 /
196608 bytes of PGA used. PL/SQL procedure successfully completed.
We can see a huge saving in the memory footprint. In fact, it doesn't really matter if we process 1 row or 1 million rows from the pipelined function. The memory usage will be low (assuming we don't pipe massive LOBs of course).
complex pipelined functions
So far we have seen a simple, single-attribute pipelined function based on a built-in datatype (we have also seen a lot of detail on the mechanics of pipelined functions). Of course, in "real" systems, we are more likely to be processing more complex data structures (i.e. records with multiple attributes). To process complex records, a pipelined function requires three elements as follows:
- a defining record structure. This can be an object type or a PL/SQL record type defined in a package specification;
- a collection type of the underlying record structure. This can be a collection type created via the CREATE TYPE... command or a collection type declared in a PL/SQL package specification; and
- the pipelined function itself, which returns the collection type (as we saw with our simple row-generator example).
Therefore, we have two ways to define the underlying types; either by creating them or by declaring them in a package. We will examine both methods below.
pipelined functions with user-defined types
This method requires us to create an object type to define the record structure that is returned by our pipelined function. The object types we create can have methods if required, but in most cases we will use them purely as persistent record definitions.
In the following example, we will create a pipelined function based on the ubiquitous EMP table. We will define our record and collection structures using an object and collection type, respectively. The pipelined function will simply fetch and pipe EMP data. Obviously this is not the intended use for pipelined functions, but the focus here is to demonstrate the mechanics as clearly as possible.
We will begin by creating our EMP "record" as an object type.
SQL> CREATE TYPE emp_ot AS OBJECT 2 ( empno NUMBER(4) 3 , ename VARCHAR2(10) 4 , job VARCHAR2(9) 5 , mgr NUMBER(4) 6 , hiredate DATE 7 , sal NUMBER(7,2) 8 , comm NUMBER(7,2) 9 , deptno NUMBER(2) 10 ); 11 /
Type created.
Next we create a collection type based on this structure. This will be the return type of our pipelined function.
SQL> CREATE TYPE emp_ntt AS TABLE OF emp_ot; 2 /
Type created.
Using these types, we will now create our pipelined function to pipe out EMP data as follows. Note the use of the object type constructor to prepare a single row in the correct format for piping.
SQL> CREATE FUNCTION pipelined_emp RETURN emp_ntt PIPELINED AS 2 BEGIN 3 FOR r_emp IN (SELECT * FROM emp) LOOP 4 PIPE ROW ( emp_ot( r_emp.empno, 5 r_emp.ename, 6 r_emp.job, 7 r_emp.mgr, 8 r_emp.hiredate, 9 r_emp.sal, 10 r_emp.comm, 11 r_emp.deptno ) ); 12 END LOOP; 13 RETURN; 14 END pipelined_emp; 15 /
Function created.
We can now select from this function as follows. The columns names are derived from the underlying object type that we used to define a single record (EMP_OT).
SQL> SELECT pe.empno 2 , pe.ename 3 , pe.job 4 , pe.sal 5 FROM TABLE(pipelined_emp) pe;
EMPNO ENAME JOB SAL ---------- ---------- --------- ---------- 7369 SMITH CLERK 800 7499 ALLEN SALESMAN 1600 7521 WARD SALESMAN 1250 7566 JONES MANAGER 2975 7654 MARTIN SALESMAN 1250 7698 BLAKE MANAGER 2850 7782 CLARK MANAGER 2450 7788 SCOTT ANALYST 3000 7839 KING PRESIDENT 5000 7844 TURNER SALESMAN 1500 7876 ADAMS CLERK 1100 7900 JAMES CLERK 950 7902 FORD ANALYST 3000 7934 MILLER CLERK 1300 14 rows selected.
pipelined functions with pl/sql types
In addition to creating our own types to support a pipelined function, we can also let Oracle do it for us. This alternative method enables us to define our underlying record and collection types inside a PL/SQL package specification. Oracle use these types as the basis for creating a system-generated set of object/collection types to support the pipelined function. We can see a small example of this below.
The following package specification contains an EMP record type and associated collection type. We declare our pipelined function to use these types.
SQL> CREATE PACKAGE pipelined_pkg AS 2 3 TYPE emp_rt IS RECORD 4 ( empno NUMBER(4) 5 , ename VARCHAR2(10) 6 , job VARCHAR2(9) 7 , mgr NUMBER(4) 8 , hiredate DATE 9 , sal NUMBER(7,2) 10 , comm NUMBER(7,2) 11 , deptno NUMBER(2) 12 ); 13 14 TYPE emp_ntt IS TABLE OF pipelined_pkg.emp_rt; 15 16 FUNCTION pipelined_emp 17 RETURN pipelined_pkg.emp_ntt PIPELINED; 18 19 END pipelined_pkg; 20 /
Package created.
Remember that Oracle uses these PL/SQL types as the basis for creating SQL types. We can see this in the dictionary as follows. The types are system-named according to the object ID of the package specification.
SQL> SELECT type_name 2 , typecode 3 , attributes 4 FROM user_types 5 WHERE INSTR( type_name, (SELECT object_id 6 FROM user_objects 7 WHERE object_name = 'PIPELINED_PKG' 8 AND object_type = 'PACKAGE') ) > 0;
TYPE_NAME TYPECODE ATTRIBUTES ---------------------------- ------------------ ---------- SYS_PLSQL_33433_71_1 COLLECTION 0 SYS_PLSQL_33433_9_1 OBJECT 8 SYS_PLSQL_33433_DUMMY_1 COLLECTION 0 3 rows selected.
SQL> SELECT type_name 2 , attr_no 3 , attr_name 4 , attr_type_name 5 FROM user_type_attrs 6 WHERE INSTR( type_name, (SELECT object_id 7 FROM user_objects 8 WHERE object_name = 'PIPELINED_PKG' 9 AND object_type = 'PACKAGE') ) > 0 10 ORDER BY 11 attr_no;
TYPE_NAME ATTR_NO ATTR_NAME ATTR_TYPE_NAME ---------------------- -------- ---------------- ------------------- SYS_PLSQL_33433_9_1 1 EMPNO NUMBER SYS_PLSQL_33433_9_1 2 ENAME VARCHAR2 SYS_PLSQL_33433_9_1 3 JOB VARCHAR2 SYS_PLSQL_33433_9_1 4 MGR NUMBER SYS_PLSQL_33433_9_1 5 HIREDATE DATE SYS_PLSQL_33433_9_1 6 SAL NUMBER SYS_PLSQL_33433_9_1 7 COMM NUMBER SYS_PLSQL_33433_9_1 8 DEPTNO NUMBER 8 rows selected.
SQL> SELECT type_name 2 , coll_type 3 , elem_type_name 4 FROM user_coll_types 5 WHERE INSTR( type_name, (SELECT object_id 6 FROM user_objects 7 WHERE object_name = 'PIPELINED_PKG' 8 AND object_type = 'PACKAGE') ) > 0;
TYPE_NAME COLL_TYPE ELEM_TYPE_NAME --------------------------- ------------------- --------------------------- SYS_PLSQL_33433_71_1 TABLE SYS_PLSQL_33433_9_1 SYS_PLSQL_33433_DUMMY_1 TABLE NUMBER 2 rows selected.
Note that the record type in this example actually matches the EMP table, so we could instead do the following, replacing emp%ROWTYPE for emp_rt wherever needed and dispensing with the explicit record type for brevity.
SQL> CREATE OR REPLACE PACKAGE pipelined_pkg AS 2 3 TYPE emp_ntt IS TABLE OF emp%ROWTYPE; 4 5 FUNCTION pipelined_emp 6 RETURN pipelined_pkg.emp_ntt PIPELINED; 7 8 END pipelined_pkg; 9 /
Package created.
The main difference between this and an explicit record structure is that Oracle uses the EMP table as the basis for creating the underlying object type (i.e. we can search the dictionary for the system-generated types with the EMP table's OBJECT_ID in the name).
Either way, we now have a defining record type (emp%ROWTYPE or emp_rt) and a collection type (emp_ntt). We can now implement our pipelined function in the package body. Note that this package body matches our original specification that included the explicit emp_rt record type.
SQL> CREATE PACKAGE BODY pipelined_pkg AS 2 3 FUNCTION pipelined_emp RETURN pipelined_pkg.emp_ntt PIPELINED IS 4 5 r_emp pipelined_pkg.emp_rt; 6 7 BEGIN 8 9 FOR r_tmp IN (SELECT * FROM emp WHERE ROWNUM <= 5) LOOP 10 11 /* 12 || Our implicit cursor-for-loop record matches the target 13 || record type so we can actually pipe it out... 14 */ 15 PIPE ROW (r_tmp); 16 17 /* 18 || We are more likely to use an explicit record variable. 19 || Let's pipe out another record... 20 */ 21 r_emp := r_tmp; 22 r_emp.ename := LOWER(r_emp.ename); --<-- dummy transformation 23 r_emp.job := LOWER(r_emp.job); --<-- dummy transformation 24 PIPE ROW (r_emp); 25 26 END LOOP; 27 28 RETURN; 29 30 END pipelined_emp; 31 32 END pipelined_pkg; 33 /
Package body created.
The key difference to note is that we do not need to "understand" the underlying object type. We work only with our PL/SQL record type, which is possibly more familiar to many developers. Oracle is responsible for casting the records to the underlying type that the SQL engine understands.
For demonstration purposes only, we are piping two records for every source record in this example. This enables us to see that as long as the record variable matches the type that underpins the returning collection, we can pipe it. In our package body we pipe out the implicit record from our cursor-for-loop in addition to the explicit record variable in our function declaration. These happen to be of the same structure. The following SQL statement shows clearly that we have piped two output records for every input record (remember we transformed some of the attributes of the explicit record variable).
SQL> SELECT pe.empno 2 , pe.ename 3 , pe.job 4 , pe.sal 5 FROM TABLE(pipelined_pkg.pipelined_emp) pe;
EMPNO ENAME JOB SAL ---------- ---------- --------- ---------- 7369 SMITH CLERK 800 7369 smith clerk 800 7499 ALLEN SALESMAN 1600 7499 allen salesman 1600 7521 WARD SALESMAN 1250 7521 ward salesman 1250 7566 JONES MANAGER 2975 7566 jones manager 2975 7654 MARTIN SALESMAN 1250 7654 martin salesman 1250 10 rows selected.
recap: complex pipelined functions
To recap therefore, we can use one of two methods to define the underlying record structures for use in our pipelined functions. We can create object and collection types explicitly (CREATE TYPE...) or we can use standard PL/SQL declarations in a package specification (record and collection types). If we choose the former, we create additional dependencies and more source objects, but we have the full implementation of the application under our control. If we choose the latter, we can use familiar PL/SQL record syntax but need to be aware that Oracle will create additional database objects on our behalf.
The examples we have seen so far have been contrived and extremely basic. We have not seen anything in these functions that actually warrants their use! For the remainder of this article we will examine pipelined functions from a more real-life perspective, including options for performance and their practical application.
parallel pipelined functions
Pipelined functions have a unique capability over any other form of PL/SQL processing - they are able to be parallelised by Oracle. There are certain conditions required for this to happen, but it basically means that Oracle can divide a unit of serialised PL/SQL processing among a set of parallel slaves.
To create a parallel pipelined function we require two additional elements to our code:
- additional parallel-enabling syntax in our function definition; and
- source data supplied via a cursor variable (i.e. not statically compiled within the function itself as in our previous examples).
We will see a simple example of a parallel pipelined function as follows. We will use explicit types rather than PL/SQL types and, to keep the code simple, we will create a standalone function. We will use a larger dataset based on a multiple of ALL_SOURCE (we will generate approximately 1 million records) to compare the impact of parallel processing.
First, we create our types as follows.
SQL> CREATE TYPE all_source_ot AS OBJECT 2 ( owner VARCHAR2(30) 3 , name VARCHAR2(30) 4 , type VARCHAR2(12) 5 , line NUMBER 6 , text VARCHAR2(4000) 7 ); 8 /
Type created.
SQL> CREATE TYPE all_source_ntt 2 AS TABLE OF all_source_ot; 3 /
Type created.
We create our parallel-enabled pipelined function as follows. Note how we include the session SID in the output. This will enable us to see the effect of parallelism (i.e. each slave will have its own session).
SQL> CREATE FUNCTION parallel_pipelined_function( 2 cursor_in IN SYS_REFCURSOR 3 ) RETURN all_source_ntt 4 PIPELINED 5 PARALLEL_ENABLE (PARTITION cursor_in BY ANY) IS 6 7 TYPE incoming_data_ntt IS TABLE OF all_source%ROWTYPE; 8 v_incoming incoming_data_ntt; 9 10 v_outgoing all_source_ot; 11 12 v_sid INTEGER; 13 14 BEGIN 15 16 /* 17 || This will help us to see parallelism in action... 18 */ 19 SELECT sid INTO v_sid 20 FROM v$mystat 21 WHERE ROWNUM = 1; 22 23 /* 24 || Process the incoming datasource... 25 */ 26 LOOP 27 28 FETCH cursor_in BULK COLLECT INTO v_incoming LIMIT 500; 29 30 FOR i IN 1 .. v_incoming.COUNT LOOP 31 32 v_outgoing := all_source_ot( v_sid, 33 v_incoming(i).name, 34 v_incoming(i).type, 35 v_incoming(i).line, 36 v_incoming(i).text ); 37 38 PIPE ROW (v_outgoing); 39 40 END LOOP; 41 42 EXIT WHEN cursor_in%NOTFOUND; 43 44 END LOOP; 45 CLOSE cursor_in; 46 47 RETURN; 48 49 END parallel_pipelined_function; 50 /
Function created.
We can see some new syntax in our parallel pipelined function. In particular, note the following:
- Line 2: the function accepts a cursor parameter. This is a prerequisite of parallel pipelined functions, i.e. we cannot parallelise cursors that are embedded within the function itself (such as those in our earlier examples). In our example we have used the built-in weak refcursor type SYS_REFCURSOR, but we can alternatively use our own weak or strong refcursor types (defined in a package specification);
- Line 5: we declare the function parallel-enabled (non-pipelined functions can also be declared with PARALLEL_ENABLE) and the syntax in parentheses is specific to pipelined functions. The PARTITION BY clause tells Oracle how to divide the source dataset (pointed to by the cursor parameter) among the parallel slaves. For weak refcursors such as SYS_REFCURSOR, we can only use the ANY keyword (i.e. Oracle will decide how to divide the source data). If we use a strong refcursor type, however, we can partition our source dataset by HASH or RANGE on a named cursor attribute or list of attributes (the attribute names are derived from the strong refcursor's return record type). This is useful when the incoming data needs to be processed in related groups and maybe in a certain order; using RANGE or HASH partitioning ensures that inter-dependent data is sent to the same parallel slave for processing. We will not see any examples of RANGE or HASH parallel pipelined functions in this article, but note that performance will degrade slightly with the PARTITION BY RANGE option as Oracle will need to sort the source data first. The PARTITION BY HASH option displays similar performance to PARTITION BY ANY;
- Lines 7-8 and 28: this is purely optional but seeing as we are coding for performance, we will fetch data from the source cursor in arrays using BULK COLLECT;
- Lines 19-21 and 32: we have included the session SID in the output. This will enable us to see the effect of parallelism when we use this function later (i.e. each slave will have its own session ID). This is purely for demonstration purposes and has nothing to do with parallel pipelined function syntax; and
- Line 45: the cursor parameter is already open when it is passed into the pipelined function. As our function is the only user of this cursor, we need to close it.
Moving on, we now have a parallel pipelined function ready for testing. To test it, we will create a large input dataset based on ALL_OBJECTS. We will create a single table of this dataset for simplicity as follows.
SQL> CREATE TABLE source_table 2 NOLOGGING 3 AS 4 SELECT a.* 5 FROM all_source a 6 , TABLE(row_generator(20));
Table created.
SQL> SELECT COUNT(*) FROM source_table;
COUNT(*) ---------- 998960 1 row selected.
To test our parallel pipelined function, we will select from it twice; once in parallel and once in serial. We will time each query and verify that Oracle parallelised the query in two ways: first using the returned data itself (remember that we piped the session ID from our function) and second using V$PQ_SESSTAT. First we will run the parallel version.
SQL> SELECT nt.owner 2 , COUNT(*) 3 FROM TABLE( 4 parallel_pipelined_function( 5 CURSOR(SELECT * FROM source_table))) nt 6 GROUP BY 7 nt.owner;
OWNER COUNT(*) ------------------------------ ---------- 10 496553 13 502407 2 rows selected. Elapsed: 00:00:38.07
SQL> SELECT * 2 FROM v$pq_sesstat 3 WHERE statistic = 'Queries Parallelized';
STATISTIC LAST_QUERY SESSION_TOTAL ------------------------------ ---------- ------------- Queries Parallelized 1 1 1 row selected.
We can see that Oracle used two parallel slaves and divided the work relatively evenly between them. Note how we passed in the cursor parameter using the CURSOR expression. This opens the cursor for the embedded SQL statement and passes it through to the pipelined function as a cursor variable.
Finally we can test the parallel pipelined function in serial as follows.
SQL> ALTER SESSION DISABLE PARALLEL QUERY;
Session altered.
SQL> set timing on SQL> SELECT nt.owner 2 , COUNT(*) 3 FROM TABLE( 4 parallel_pipelined_function( 5 CURSOR(SELECT * FROM source_table))) nt 6 GROUP BY 7 nt.owner;
OWNER COUNT(*) ------------------------------ ---------- 11 998960 1 row selected. Elapsed: 00:00:49.01
SQL> SELECT * 2 FROM v$pq_sesstat 3 WHERE statistic = 'Queries Parallelized';
STATISTIC LAST_QUERY SESSION_TOTAL ------------------------------ ---------- ------------- Queries Parallelized 0 2 1 row selected.
This time we can see that the function has executed in serial. It is interesting to note that we haven't saved much time with the parallel execution on the demonstration database. Note that on larger database servers, this approach will yield much better gains where the number of CPUs/slaves will be higher and the degree of parallelism can be controlled by appropriate hints.
pipelined functions and the optimizer
We have seen that pipelined functions are alternative rowsources to tables. Our examples in this article have been simple, but usually pipelined functions will be full of complex transformation logic that turns input row A into output row B (and maybe B2, B3 etc). It is usually the case that they will generate significant volumes of data (as they are primarily an ETL tool).
Because pipelined functions generate data, the CBO needs to know how much, especially if a function is one of several rowsources in a SQL statement (i.e. it is joined to tables/views/other pipelined functions). The following demonstrates the execution plan for a simple select from our parallel pipelined function using Autotrace.
SQL> set autotrace traceonly explain SQL> SELECT * 2 FROM TABLE( 3 parallel_pipelined_function( 4 CURSOR(SELECT * FROM source_table))) nt;
Execution Plan -------------------------------------------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=11 Card=8168 Bytes=898480) 1 0 VIEW* (Cost=11 Card=8168 Bytes=898480) :Q84001 2 1 COLLECTION ITERATOR* (PICKLER FETCH) OF 'PARALLEL_PIPELINED_FUNCTION' :Q84001 3 2 TABLE ACCESS* (FULL) OF 'SOURCE_TABLE' (Cost=360 Card=998960 Bytes=87908480) :Q84000 1 PARALLEL_TO_SERIAL SELECT C0 FROM (SELECT VALUE(A2) C0 FROM TABLE("SCOTT"."PARALLEL_PIPELINED_FUNCT 2 PARALLEL_COMBINED_WITH_PARENT 3 PARALLEL_TO_PARALLEL SELECT /*+ NO_EXPAND ROWID(A1) */ A1."OWNER" C0,A1."NAME" C1,A1."TYPE" C2,A1."LI
Two things stand out from this plan. First, the COLLECTION ITERATOR PICKLER FETCH. This is Oracle's mechanism for materialising the collection data from memory. Second, and more important, is the CBO's estimated rowcount. This defaults to 8,168 which is clearly incorrect. There is no way for Oracle to correctly identify the number of rows that will be generated from the execution of a pipelined function, even though it is clear that it is accessing all the source table's data (Step 3). The CBO cannot see inside the function's logic and even if it could, the function might generate multiple rows from one input or might discard most rows due to certain conditions; the possibilities are endless.
Note that Step 3 in the plan is related to parallel execution only. This is the step used by Oracle to assign the incoming data to parallel slaves for processing. It disappears from the plan with parallel query disabled (we will see this below).
There is an undocumented CARDINALITY hint that enables us to tell the CBO how many rows will be generated by a pipelined function. The problem with this hint though is that its behaviour is seemingly erratic (and is not explicitly supported by Oracle). In the following Autotrace explain plans, we can see the effect (or lack of effect) of this hint. First we can see the effect on our simple pipelined_emp function.
SQL> set autotrace traceonly explain SQL> SELECT * 2 FROM TABLE(pipelined_emp);
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 COLLECTION ITERATOR (PICKLER FETCH) OF 'PIPELINED_EMP'
SQL> SELECT --+ CARDINALITY(e,1000) 2 * 3 FROM TABLE(pipelined_emp) e;
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=11 Card=1000 Bytes=2000) 1 0 COLLECTION ITERATOR (PICKLER FETCH) OF 'PIPELINED_EMP'
In the hinted version, we have told the CBO that our function will generate 1,000 records and we can see this in the plan. In the unhinted version, Oracle does not even make an estimate (this appears to be the case with selects from pipelined functions without either a join or an input cursor). Next we can see the effect of this hint with a join.
SQL> SELECT * 2 FROM dept d 3 , TABLE(pipelined_emp) e 4 WHERE d.deptno = e.deptno;
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=14 Card=8168 Bytes=163360) 1 0 HASH JOIN (Cost=14 Card=8168 Bytes=163360) 2 1 TABLE ACCESS (FULL) OF 'DEPT' (Cost=2 Card=4 Bytes=72) 3 1 COLLECTION ITERATOR (PICKLER FETCH) OF 'PIPELINED_EMP'
SQL> SELECT --+ CARDINALITY(e,100000) 2 * 3 FROM dept d 4 , TABLE(pipelined_emp) e 5 WHERE d.deptno = e.deptno;
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=14 Card=100000 Bytes=2000000) 1 0 HASH JOIN (Cost=14 Card=100000 Bytes=2000000) 2 1 TABLE ACCESS (FULL) OF 'DEPT' (Cost=2 Card=4 Bytes=72) 3 1 COLLECTION ITERATOR (PICKLER FETCH) OF 'PIPELINED_EMP'
We can see the 8,168 figure appearing again in our first example. It is interesting that this time the cardinality is applied to the join and not the pipelined function fetch. We can see this more clearly with the second example where we told the CBO that the function would generate 100,000 rows. Oracle has assumed a join for every row and applied this cardinality to Step 1 (hash join between the two rowsources). If we return to our parallel pipelined function, we can see that this hint has no such effect; in fact, it is not recognised at all (note that this is a serial plan, rather than parallel, which explains why the CURSOR SQL does not appear).
SQL> SELECT * 2 FROM TABLE( 3 parallel_pipelined_function( 4 CURSOR(SELECT * FROM source_table))) nt;
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=11 Card=8168 Bytes=898480) 1 0 VIEW (Cost=11 Card=8168 Bytes=898480) 2 1 COLLECTION ITERATOR (PICKLER FETCH) OF 'PARALLEL_PIPELINED_FUNCTION'
SQL> SELECT --+ CARDINALITY(nt, 100000) 2 * 3 FROM TABLE( 4 parallel_pipelined_function( 5 CURSOR(SELECT * FROM source_table))) nt;
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=11 Card=8168 Bytes=898480) 1 0 VIEW (Cost=11 Card=8168 Bytes=898480) 2 1 COLLECTION ITERATOR (PICKLER FETCH) OF 'PARALLEL_PIPELINED_FUNCTION'
As a strategy, therefore, the undocumented CARDINALITY hint does not seem very suitable! Its behaviour is inconsistent and applies to very limited scenarios. For "sensitive" queries involving joins to pipelined functions, we would be far better using some other form of plan stability, whether it be by hints or stored outlines.
practical uses for pipelined functions
As has been stated, the examples in this article are deliberately simple for the purposes of demonstration. In "real" systems and applications, pipelined functions have far more flexibility and can solve a number of problems, including performance issues. To complete this article, I will briefly describe some examples of my pipelined function implementations.
usage 1: pl/sql-based etl
This is by far the most important and exciting application for pipelined functions and is how Oracle markets the technology. In many cases, PL/SQL ETL routines seem to follow a standard "cursor-for-loop" process that can be expressed by the following pseudo-code:
PROCEDURE load_business_data IS CURSOR cursor_name IS SELECT some_columns FROM some_staging_tables WHERE they_join_and_match_certain_criteria; BEGIN FOR record IN cursor LOOP --[do transformations, lookups, validations]-- INSERT INTO parent_table VALUES (record.attributes, variables, etc...); INSERT INTO child_table VALUES (record.attributes, variables, etc...); --[and so on]-- END LOOP; COMMIT; END;
Sometimes, the "do transformations, lookups, validations" stage is just too complex to be expressed as a bulk SQL statement (i.e. the "best practice" for ETL in Oracle). Pipelined functions can therefore transform these ETL processes to provide a middle-ground between the slow PL/SQL-only implementation and the fast SQL-only implementation. The pseudo-code for a pipelined function-based ETL process is as follows:
PIPELINED FUNCTION ------------------ FUNCTION pipelined_transformation( cursor_in IN refcursor_type ) RETURN collection_type PIPELINED PARALLEL_ENABLE (PARTITION cursor_in BY ANY) IS fetch_array array_type; pipe_record record_type; BEGIN LOOP FETCH cursor_in BULK COLLECT INTO fetch_array LIMIT arraysize; EXIT WHEN fetch_array.COUNT = 0; FOR i in 1 .. fetch_array.COUNT LOOP --[do transformations, lookups, validations]-- --[prepare pipe_record]-- PIPE ROW (pipe_record); END LOOP; END LOOP; CLOSE cursor_in; RETURN; END; LOAD PROCEDURE -------------- PROCEDURE load_business_data IS BEGIN INSERT ALL INTO parent_table VALUES (...) INTO child_table VALUES (...) --[and so on]-- SELECT * FROM TABLE( pipelined_transformation( CURSOR( SELECT some_columns FROM some_staging_tables WHERE they_join_and_match_certain_criteria ))); COMMIT; END;
I have used this technique several times and it is very easy to convert existing row-by-row PL/SQL routines to use this method. In general, the main body of the existing code (the looping PL/SQL parts) become the pipelined function logic. Insert statements are replaced with PIPE ROW statements and the embedded SQL cursor is removed and a cursor parameter added (if the aim is to use parallel SQL and DML). The existing load procedure itself becomes an INSERT..SELECT from the new pipelined function as seen in the pseduo-code above. The performance gains from this re-factoring are two-fold: first, the performance benefits of a bulk SQL INSERT statement; second the potentially massive performance gains of parallel SQL and DML. With the former in serial mode I have achieved up to 30% gain in performance, but with the latter I have achieved up to 80% reduction in load time.
Note that it is also possible to have the pipelined function return different record types for loads involving multiple tables. This will be the subject of a future article.
usage 2: wrappers over dictionary views with long columns
Several views in the data dictionary have LONG columns. These are problematic as we cannot search inside or copy these columns easily. The common views where this causes problems are XXX_VIEWS, XXX_TRIGGERS, XXX_TAB_COLUMNS, XXX_CONSTRAINTS, XXX_TAB_PARTITIONS and XXX_SUBPARTITIONS (where XXX = USER/ALL/DBA as appropriate). Using pipelined functions, we can create a simple wrapper over each view to overcome these problems, taking advantage of PL/SQL's ability to implicitly convert LONG columns to VARCHAR2 (up to 32,767 bytes).
For the underlying object type to each pipelined function, we take a copy of the respective dictionary view structure but with the LONG changed to a CLOB for the relevant attribute. During processing, any LONG values that exceed 32,767 bytes (some large view-texts for example) are handled as special cases using DBMS_SQL. As these are exceptional cases, they do not generally slow down the processing of the majority of records that have LONG values under 32,767 bytes in length. We can then add a view over the pipelined function and read and query the CLOB column as required. In 9i, CLOB columns can be interrogated using Oracle's standard string functions such as INSTR, SUBSTR and LIKE. For additional functionality, we can also add filters to the views using application contexts. This prevents us from returning every record from the underlying views with every query we issue.
usage 3: querying objects with stale/empty statistics
The DBMS_STATS.GATHER_SCHEMA_STATS procedure contains an option to list tables and indexes with stale or empty statistics. As this is a PL/SQL procedure that returns an index-by array of a PL/SQL record, its use is restricted to PL/SQL. However, we can easily add a pipelined function (or even a simple table function) wrapper to this procedure to be able to present a SQL view of the objects. We can also create a view to encapsulate the pipelined function call.
further reading
For an alternative overview of table and pipelined functions, see the online PL/SQL User's Guide and Reference.
acknowledgements
My original parallel_pipelined_function example relied on the V$PQ_SESSTAT view to demonstrate that parallelism was happening. I have since updated the example to include each slave's SID from V$MYSTAT in the output. This method is found in Tom Kyte's Expert Oracle Database Architecture.
source code
The source code for the examples in this article can be downloaded from here.
Adrian Billington, May 2002 (updated May 2007)
Back to Top