Wednesday, July 15, 2015

A Few Lines of SQL Tames a Weird Requirement

A few days back I answered a question on the OTN SQL Discussion Forum and thought it blog worthy.

The requirement itself is kind of strange, but I'll try and explain it. Basically there is an outbound trip and an inbound trip. Each record has a direction identifier, a sequence and various waypoints aka tp points.

The data looks like this:

create table a(dirid number, seq number, tp number);

insert into a values (1,1,101);
insert into a values (1,2,143);
insert into a values (1,3,120);
insert into a values (1,4,119);

insert into a values (4,1,61);
insert into a values (4,2,119);
insert into a values (4,3,120);
insert into a values (4,4,143);
insert into a values (4,5,101);


The requirement is for the data to be transformed 
from this                                 to this
Dirid Seq  Tr                     Dirid Seq  Tr
1     1    101                    1     1    101
1     2    143                    1     2    143
1     3    120                    1     3    120
1     4    119                    1     4    119
4     1    61                     1     5    NULL
4     2    119                    4     5    61
4     3    120                    4     4    119
4     4    143                    4     3    120
4     5    101                    4     2    143
                                  4     1    101

So I see a couple of things right off the bat. First off, we've added a row. It appears to be an endpoint to dirid = 1. Secondly we've changed the sequence values for dirid = 4. Thirdly, the sort order are different for the various dirid's. Ascending for dirid=1 and descending for dirid=4

Hmmm, how do we attack this.

Since, we're manufacturing rows, I'm thinking MODEL clause.
Next, I'll need to populate that NULL row with the maximum seq value.
I'll need to modify the seq values where dirid=4. I can do that with a CASE statement around multiple ROW_NUMBER analytics.
Lastly, I'll need to again put a CASE statement around multiple ROW_NUMBER analytics, because he sorting rules change based on dirid.

I warned you the requirement seemed a little sketchy, but fun nonetheless. 

So let's start to put it all together.

First, let's add that row using the MODEL clause

SELECT dirid,seq, tp
  FROM a
 MODEL
DIMENSION BY ( dirid,seq )
    MEASURES ( tp )
RULES UPSERT ( tp [1,0] = NULL)

which gives us the following output...so far so good.

Dirid Seq  Tr
1     1    101
1     2    143
1     3    120
1     4    119
4     1    61
4     2    119
4     3    120
4     4    143
4     5    101
1     0    NULL

Next we need to populate the new row with the correct seq value. Let's use a MAX analytic function

SELECT dirid,DECODE(seq,0,MAX(seq) OVER (PARTITION BY NULL),seq) seq,tp
  FROM (SELECT dirid,seq, tp
          FROM a
         MODEL
        DIMENSION BY ( dirid,seq )
            MEASURES ( tp )
        RULES UPSERT ( tp [1,0] = NULL))

which gives us the following output...again, so far so good.

Dirid Seq  Tr
1     1    101
1     2    143
1     3    120
1     4    119
4     1    61
4     2    119
4     3    120
4     4    143
4     5    101
1     5    NULL

Next we need to modify the seq values where dirid = 4. Let's place a CASE statement around a few ROW_NUMBER analytics.

SELECT dirid,seq,tp,
       CASE 
        WHEN dirid = 1 THEN  
             row_number() OVER (PARTITION BY dirid ORDER BY seq) 
        ELSE  
             row_number() OVER (PARTITION BY dirid ORDER BY seq DESC) 
        END rn
  FROM (SELECT dirid,DECODE(seq,0,MAX(seq) OVER (PARTITION BY NULL),seq) seq,tp
          FROM (SELECT dirid,seq, tp
                  FROM a
                 MODEL
                DIMENSION BY ( dirid,seq )
                    MEASURES ( tp )
                RULES UPSERT ( tp [1,0] = NULL)))

Cool. The rn column contains the values needed for seq.

Dirid Seq  Tr    rn
1     1    101   1
1     2    143   2
1     3    120   3
1     4    119   4
1     5    NULL  5
4     1    61    5
4     2    119   4
4     3    120   3
4     4    143   2
4     5    101   1

Let's replace seq with rn and again wrap a CASE statement to handle the different sorting requirements of dirid=1 and dirid=4


SELECT dirid,rn seq,tp,
       CASE 
         WHEN dirid = 1 THEN  
              row_number() OVER (PARTITION BY dirid ORDER BY rn) 
         ELSE  
              row_number() OVER (PARTITION BY dirid ORDER BY rn DESC) 
       END rn2
FROM (SELECT dirid,seq,tp,
             CASE 
                WHEN dirid = 1 THEN  
                     row_number() OVER (PARTITION BY dirid ORDER BY seq) 
                ELSE  
                     row_number() OVER (PARTITION BY dirid ORDER BY seq DESC) 
             END rn
       FROM (SELECT dirid,DECODE(seq,0,MAX(seq) OVER (PARTITION BY NULL),seq)seq
                    ,tp
              FROM (SELECT dirid,seq, tp
                    FROM a
                    MODEL
                    DIMENSION BY ( dirid,seq )
                    MEASURES ( tp )
                    RULES UPSERT ( tp [1,0] = NULL))))

Almost there.

Dirid Seq  Tr    rn2
1     1    101   1
1     2    143   2
1     3    120   3
1     4    119   4
1     5    NULL  5
4     1    101   5
4     2    143   4
4     3    120   3
4     4    119   2
4     5    61    1


The final step is to sort on dirid, rn2.

SELECT dirid,seq,tp
FROM (   SELECT dirid,rn seq,tp,
                CASE 
                   WHEN dirid = 1 THEN  
                        row_number() OVER (PARTITION BY dirid ORDER BY rn) 
                   ELSE  
                        row_number() OVER (PARTITION BY dirid ORDER BY rn DESC) 
                END rn2
          FROM (SELECT dirid,seq,tp,
                      CASE 
                        WHEN dirid = 1 THEN  
                             row_number() OVER (PARTITION BY dirid ORDER BY seq) 
                        ELSE  
                             row_number() OVER (PARTITION BY dirid ORDER BY seq DESC) 
                      END rn
                  FROM (SELECT dirid,DECODE(seq,0,MAX(seq) OVER (PARTITION BY NULL),seq) seq,tp
                          FROM (SELECT dirid,seq, tp
                                  FROM a
                                 MODEL
                                DIMENSION BY ( dirid,seq )
                                    MEASURES ( tp )
                                RULES UPSERT ( tp [1,0] = NULL)))))
ORDER BY dirid,rn2   


Sweetness! The outbound trip (dirid=1) goes out to the following waypoint in seq order, then the inbound trip (dirid=4)  comes back to the starting point in reverse.

Dirid Seq  Tr    
1     1    101   
1     2    143   
1     3    120   
1     4    119   
1     5    NULL  
4     5    61    
4     4    119   
4     3    120   
4     2    143   
4     1    101   


I warned you it was a weird requirement, but those are the kind of things that make you think "I can do that in a few lines of SQL"....now with a smattering of analytic functions you can too.


...our journey continues


Thursday, July 9, 2015

NTILE vs WIDTH_BUCKET and the Winner is...PIVOT

At a recent meeting of developers, a requirement was discussed where a finite number of columns are to be populated in alphabetical order - think the opposite of telephone book.

As is often the case, there are multiple ways to do this, but at it's heart, this really is a pure pivot. I think we should rule out both NTILE and WIDTH_BUCKET.

NTILE creates equiheight histograms, while WIDTH_BUCKET creates equiwidth histograms. We could probably force the use of either, but we would still need to pivot based on the quartile values.

NTILE, in reality, allows us to specify the number of rows not the number of columns across. For example,

SELECT ename , NTILE(5) OVER ( ORDER BY ename ) AS quartile FROM emp

ENAMEQUARTILE
ADAMS1
ALLEN1
BLAKE1
CLARK2
FORD2
JAMES2
JONES3
KING3
MARTIN3
MILLER4
SCOTT4
SMITH4
TURNER5
WARD5

Modifying the query actually adds to the number of columns. For example,

SELECT ename , NTILE(3) OVER ( ORDER BY ename ) AS quartile FROM emp

ENAMEQUARTILE
ADAMS1
ALLEN1
BLAKE1
CLARK1
FORD1
JAMES2
JONES2
KING2
MARTIN2
MILLER2
SCOTT3
SMITH3
TURNER3
WARD3


At first glance, WIDTH_BUCKET would seem to make better sense, because we want to specify a number of columns, but these columns are assigned based on numerical ranges, not strings.


So the most straightforward way to do this, I feel, is a pure PIVOT...something like this.

SELECT col_1 , col_2 , col_3 
  FROM  ( SELECT ename ,
                 TRUNC ( ( ROW_NUMBER() OVER ( ORDER BY ename ) - 1 ) / 3 ) AS row_num ,
                     MOD ( ROW_NUMBER() OVER ( ORDER BY ename ) - 1, 3 )    AS col_num 
            FROM emp ) PIVOT
          ( MAX(ename) FOR col_num IN ( 0 as col_1 , 1 as col_2 , 2 as col_3 )
        )
ORDER BY row_num

COL_1COL_2COL_3
ADAMSALLENBLAKE
CLARKFORDJAMES
JONESKINGMARTIN
MILLERSCOTTSMITH
TURNERWARD

To display 4 columns instead of 3....

SELECT col_1 , col_2 , col_3 , col_4
  FROM  ( SELECT ename ,
                 TRUNC ( ( ROW_NUMBER() OVER ( ORDER BY ename ) - 1 ) / 4 ) AS row_num ,
                     MOD ( ROW_NUMBER() OVER ( ORDER BY ename ) - 1, 4 )    AS col_num 
            FROM emp ) PIVOT
          ( MAX(ename) FOR col_num IN ( 0 as col_1 , 1 as col_2 , 2 as col_3 , 3 as col_4 )
        )
ORDER BY row_num

COL_1COL_2COL_3COL_4
ADAMSALLENBLAKECLARK
FORDJAMESJONESKING
MARTINMILLERSCOTTSMITH
TURNERWARD



To completely eliminate the PIVOT, one could use the MODEL clause, but that can get pretty tedious, pretty fast.

SELECT col_1, col_2, col_3
  FROM   emp
 MODEL
  RETURN UPDATED ROWS
  DIMENSION BY
  ( TRUNC ( ( ROW_NUMBER() OVER ( ORDER BY ename ) - 1 ) / 3 ) AS row_num ,
        MOD ( ROW_NUMBER() OVER ( ORDER BY ename ) - 1, 3 )    AS col_num
  )
  MEASURES ( ename AS col_1, ename AS col_2, ename AS col_3 )
  RULES UPSERT
...not this kind of MODEL
  (
    col_2[ANY,0] = col_1[CV(),1] ,
    col_3[ANY,0] = col_1[CV(),2]
  )
ORDER BY row_num 


COL_1COL_2COL_3
ADAMSALLENBLAKE
CLARKFORDJAMES
JONESKINGMARTIN
MILLERSCOTTSMITH
TURNERWARD



...our journey continues