Description
1. What are the two major challenges in the problem of text analysis? (7 points)
2. What is a reverse index? (7 points)
3. Why is the corpus metrics dynamic. Provide an example and a scenario that explains the dynamism of the corpus metrics. (7 points)
4. How does tf idf enhance the relevance of a search result? (7 points)
5. List and discuss a few methods that are deployed in text analysis to reduce the dimensions. (7 points)
Advanced Analytics – Technology and Tools
Copyright © 2014 EMC Corporation. All Rights Reserved.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
1
Module 5: Advanced Analytics – Technology and Tools
1
Advanced Analytics – Technology and Tools
Advanced SQL and MADlib
During this lesson the following topics are covered:
Advanced SQL and MADlib:
• Window functions
• User defined functions and aggregates
• Ordered Aggregates
• MADlib
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
2
This lesson covers advanced SQL and MADlib functions
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
2
Window Functions
• About Window Functions
Returns a value per row, unlike aggregate functions
Has its results interpreted in terms of the current row and its
corresponding window partition or frame
Is characterized by the use of the OVER clause
Defines the window partitions, or groups of rows to apply
the function
Defines ordering of data within a window
Defines the positional or logical framing of a row with
respect to its window
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
3
A window function performs a calculation across a set of table rows that are somehow related
to the current row. This is comparable to the type of calculation that can be done with an
aggregate function. But unlike regular aggregate functions, use of a window function does not
cause rows to become grouped into a single output row — the rows retain their separate
identities. Behind the scenes, the window function is able to access more than just the current
row of the query result.
Window functions are a new class of functions introduced in Greenplum.
Window functions allow application developers to more easily compose complex OLAP queries
using standard SQL commands. For example:
• Moving averages or sums can be calculated over various intervals.
• Aggregations and ranks can be reset as selected column values change.
• Complex ratios can be expressed in simple terms.
Window functions can only be used in the SELECT list, between the SELECT and FROM
keywords of a query.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
3
Defining Window Specifications (OVER Clause)
When defining the window function:
Include an OVER() clause
Specify the window of data to which the function applies
• Define:
Window partitions, using the PARTITION BY clause
Ordering within a window partition, using the ORDER BY clause
Framing within a window partition, using ROWS and RANGE
clauses
The ORDER BY clause also defines a frame of unbounded
preceding to current in the partition
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
5
All window functions must have an OVER() clause. The window function specifies the window
of data to which the function applies
It defines:
•
•
•
•
Window partitions using the PARTITION BY clause.
Ordering within a window partition using the ORDER BY clause.
Framing within a window partition (ROWS/RANGE clauses).
The ORDER BY clause also defines a frame of unbounded preceding to current in the
partition.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
5
About the PARTITION BY Clause
The PARTITION BY clause:
• Can be used by all window functions
• Organizes result sets into groupings based on unique values
• Allows the function to be applied to each partition
independently
A fly and a flea in a flue
Were imprisoned, so what could they do
Said the fly, let us flee. Let us fly said the flee
So they flew through a flaw in the flue.
A canner exceedingly canny
One morning remarked to his granny
A canner can can anything that he can
But a canner can’t can a can can he?
Note: If the PARTITION BY clause is omitted, the entire
result set is treated as a single window partition.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
6
About the PARTITION BY Clause
The PARTITION BY clause:
• Can be used by all window functions. However, it is not a required clause. Windows that
do not use the PARTITION BY clause present the entire result set as a single window
partition.
• Organizes the result set into groupings based on the unique values of the specified
expression or column.
• Allows the function to be applied to each partition independently.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
6
Window Partition Example
SELECT * ,
row_number()
OVER()
FROM sale
ORDER BY cn;
row_number | cn | vn | pn | dt
| qty | prc
————+—-+—-+—–+————+——+—1
| 1 | 10 | 200 | 1401-03-01 | 1
| 0
2
| 1 | 30 | 300 | 1401-05-02 | 1
| 0
3
| 1 | 50 | 400 | 1401-06-01 | 1
| 0
4
| 1 | 30 | 500 | 1401-06-01 | 12
| 5
5
| 1 | 20 | 100 | 1401-05-01 | 1
| 0
6
| 2 | 50 | 400 | 1401-06-01 | 1
| 0
7
| 2 | 40 | 100 | 1401-01-01 | 1100 | 2400
8
| 3 | 40 | 200 | 1401-04-01 | 1
| 0
(8 rows)
SELECT * ,
row_number()
row_number | cn | vn | pn | dt
| qty | prc
————+—-+—-+—–+————+——+—1
| 1 | 10 | 200 | 1401-03-01 | 1
| 0
2
| 1 | 30 | 300 | 1401-05-02 | 1
| 0
3
| 1 | 50 | 400 | 1401-06-01 | 1
| 0
4
| 1 | 30 | 500 | 1401-06-01 | 12
| 5
5
| 1 | 20 | 100 | 1401-05-01 | 1
| 0
1
| 2 | 50 | 400 | 1401-06-01 | 1
| 0
2
| 2 | 40 | 100 | 1401-01-01 | 1100 | 2400
1
| 3 | 40 | 200 | 1401-04-01 | 1
| 0
(8 rows)
OVER(PARTITION
BY cn)
FROM sale
ORDER BY cn;
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
7
Window Partition Example
The example on the slide uses the row_number window function. This function returns a row
number for each unique row in the result set.
In the first example, the OVER clause does not have a PARTITION BY. The entire result set is
treated as one window partition.
In the second example, the window is partitioned by the customer number. Note that the
result of row number is calculated within each window partition.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
7
RANK and ORDER BY
The ORDER BY clause:
• Can always be used by window functions
• Is required by some window functions such as RANK
• Specifies ordering within a window partition
The RANK built-in function:
• Calculates the rank of a row
• Gives rows with equal values for the specified criteria the same
rank
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
8
The ORDER BY clause is used to order the resulting data set based on an expression or
column. It is always allowed in windows functions and is required by some window functions,
including RANK. The ORDER BY clause specifies ordering within a window partition.
The RANK function is a built-in function that calculates the rank of a row in an ordered group of
values. Rows with equal values for the ranking criteria receive the same rank. The number of
tied rows are added to the rank number to calculate the next rank value. In this case, ranks may
not be consecutive numbers.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
8
Using the OVER (ORDER BY…) Clause
SELECT vn, sum(prc*qty)
FROM sale
GROUP BY vn
ORDER BY 2 DESC;
vn | sum
—-+——–40 | 2640002
30 | 180
50 | 0
20 | 0
10 | 0
(5 rows)
SELECT vn, sum(prc*qty), rank()
vn | sum
| rank
—-+———+—–40 | 2640002 | 1
30 | 180
| 2
50 | 0
| 3
20 | 0
| 3
10 | 0
| 3
(5 rows)
OVER (ORDER BY sum(prc*qty)
DESC)
FROM sale
GROUP BY vn
ORDER BY 2 DESC;
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
9
Using the OVER (ORDER BY…) Clause
The slide shows an example of two queries that rank vendors by sales totals.
The first query shows a window function grouped on the vendor column, vn.
The second query uses the RANK function to output a ranking number for each row. Note that
the PARTITION BY clause is not used in this query. The entire result is one window partition.
Also, do not confuse ORDER BY of a window specification with the ORDER BY of a query.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
9
Designating a Sliding (Moving) Window
A moving window:
• Defines a set or rows in a window partition
• Allows you to define the first row and last row
• Uses the current row as the reference point
• Can be expressed in rows with the ROWS clause
• Can be expressed as a range with the RANGE clause
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 10
A moving or rolling window defines a set of rows within a window partition. When you
define a window frame, the window function is computed with respect to the contents
of this moving frame, rather than against the fixed content of the entire window
partition. Window frames can be row-based, represented by the ROWS clause, or value
based, represented by a RANGE.
When the window frame is row-based, you define the number of rows offset from the
current row. If the window frame is range-based, you define the bounds of the window
frame in terms of data values offset from the value in the current row.
If you specify only a starting row for the window, the current row is used as the last row
in the window.
The window frame can be defined as:
• UNBOUNDED or expression PRECEDING – This clause defines the first row of the
window using the current row as a reference point. The starting row is expressed
in terms of the number of rows preceding the current row. If you define a ROWS
window frame as 5 PRECEDING, the window frame starts at the fifth row
preceding the current row. If the definition is for a RANGE window frame, the
window starts with the first row whose ordering column value precedes that of
the current row by 5. If the term UNBOUNDED is used, the first row of the
partition acts as the first row of the window.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
10
Window Framing Example
A rolling window moves through
a partition of data, one row at a
time.
SELECT vn, dt, prc * qty
ma = AVG(prc*qty)
OVER (PARTITION BY vn
ORDER BY dt
ROWS BETWEEN
2 PRECEDING AND
2 FOLLOWING)
FROM sale;
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 12
Window Framing Example
While window framing clauses require an ORDER BY clause, not all window functions allow
framing.
The ROWS and RANGE clauses specify a positional or logical rolling window that moves through
a window partition of data.
In the example shown on the slide, the rolling frame applies to its partition, in this case,
vendor, and ordering within that partition, date.
The example shows positional framing using the ROWS BETWEEN clause where the result is
interpreted with respect to the CURRENT ROW position in the partition. The focus of the
window frame moves from row to row within its partition only.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
12
Window Framing Example (Continued)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 13
Window Framing Example (Continued)
The focus of the frame moves from the first selectable row in the window partition using the
criteria, 2 preceding and 2 following, for the rolling window.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
13
General Syntax of Window Function
A moving window:
• Is defined as part of a window with the ORDER BY clause as
follows:
WINDOW window_name AS (window_specification)
where window_specification can be:
[window_name]
[PARTITION BY expression [, …]]
[ORDER BY expression [ASC | DESC | USING operator] [, …]
[{RANGE | ROWS}
{ UNBOUNDED PRECEDING
| expression PRECEDING
| CURRENT ROW
| BETWEEN window_frame_bound AND window_frame_bound }]]
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 14
Syntax of a moving window is specified here.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
14
Built-In Window Functions
Built-In Function
Description
dist()
Calculates the cumulative distribution of a value in a group of
values. Rows with equal values always evaluate to the same
cumulative distribution value.
dense_rank()
Computes the rank of a row in an ordered group of rows
without skipping rank values. Rows with equal values are given
the same rank value.
first_value(expr)
Returns the first value in an ordered set of values.
lag(expr [,offset]
[,default])
Provides access to more than one row of the same table
without doing a self join. Given a series of rows returned from a
query and a position of the cursor, LAG provides access to a row
at a given physical offset prior to that position. If offset is not
specified, the default offset is 1. default sets the value that is
returned if the offset goes beyond the scope of the window. If
default is not specified, the default value is null.
A fly and a flea in a flue
Were imprisoned, so what could they do
Said the fly, let us flee. Let us fly said the flee
So they flew through a flaw in the flue.
A canner exceedingly canny
One morning remarked to his granny
A canner can can anything that he can
But a canner can’t can a can can he?
Note: Any aggregate function used with the OVER clause
can also be used as a window function.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 15
Built-In Window Functions
The slide shows built-in window functions supported within Greenplum. These built-in
functions require an OVER clause.
For more detailed information on the functions, refer to the Greenplum Database
Administrator Guide.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
15
Built-In Window Functions (Continued)
Built-In Function
Description
last_value(expr)
Returns the last value in an ordered set of values.
lead()
Provides access to more than one row of the same table
without doing a self join. Given a series of rows returned from a
query and a position of the cursor, LEAD provides access to a
row at a given physical offset after that position. If offset is not
specified, the default offset is 1. default sets the value that is
returned if the offset goes beyond the scope of the window. If
default is not specified, the default value is null.
ntile(expr)
Divides an ordered dataset into a number of buckets (as defined
by expr) and assigns a bucket number to each row.
percent_rank()
Calculates the rank of a hypothetical row R minus 1, divided by
1 less than the number of rows being evaluated (within a
window partition).
row_number()
Assigns a unique number to each row to which it is applied
(either each row in a window partition or each row of the
query).
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 16
Built-In Window Functions (Continued)
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
16
Check Your Knowledge
• Describe how this code will work:
Your Thoughts?
SELECT dt, region, revenue,
count(*) OVER (twdw) AS moving_count,
avg(revenue) OVER (twdw) AS moving_average
FROM moving_average_data mad
WINDOW twdw AS (PARTITION BY region
ORDER BY dt RANGE BETWEEN
‘7 days’::interval PRECEDING AND
‘0 days’::interval FOLLOWING)
ORDER BY region, dt
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 17
Let us understand the code above. Describe below how this code will work:
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
17
User Defined Functions and Aggregates
Greenplum supports several function types, including:
• Query language functions where the functions are written in SQL
• Procedural language functions where the functions are written in:
PL/pgSQL
PL/TcL
Perl
Python
R
• Internal functions
• C-language functions
• Use Case examples:
Second largest element in a column?
Online auction: Who is the second highest bidder?
Module 5: Advanced Analytics – Technology and Tools 18
Copyright © 2014 EMC Corporation. All Rights Reserved.
Greenplum supports a variety of methods for developing functions, including:
• Query language support for functions developed in SQL.
• Procedural language support for functions written in languages such as PL/PGSQL,
which is a subset of PL/SQL, PL/TcL, Perl, Python, and R, a programming.
• Language for statistical computing and graphics.
• Internal functions.
• C-language functions.
The Data Scientist may need to create a function that could be used in the downstream
analysis. Some use case examples are shown in the slide.
Note: Greenplum supports PL/pgSQL, PL/Perl, and PL/Python out of the box. Other languages
can be added with the createlang utility.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
18
Anatomy of a User-Defined Function
• Example:
CREATE FUNCTION times2(INT)
RETURNS INT
Start function body
AS $$
SELECT 2 * $1
Function body
$$ LANGUAGE sql
End function body
SELECT times2(1);
times2
——-2
(1 row)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 19
Here we present a simple function that you can create.
Whenever you pass in a parameter, you can identify it as:
•
A base or primitive type, such as integer, char, or varchar
In this example, when you call this function you pass the parameter INT. The value is multiplied by 2
and returned as numeric. $1 indicates the first parameter.
Creating, Modifying, and Dropping Functions
Functions that operate on tables must be created in the same schema. If you modify a table, you must
have access to a schema. You:
•
Create a function with the CREATE FUNCTION command. You must have CREATE access to
the schema to create a function. A function can be created with or without parameters.
•
Replace an existing function with the CREATE OR REPLACE FUNCTION command. This
command either creates a function if one did not exist before, or replaces an existing function.
If you are replacing an existing function, you must specify the same number of parameters and
the same data types found in the original function. If not, you are actually creating a new
function.
•
Change a function with the ALTER FUNCTION command. You must own the function before
you can modify it. If the function is to be created in another schema, you must have CREATE
privilege on that schema.
•
Drop or remove a function with the DROP FUNCTION command. Because you can have
multiple functions with the same name but different number of parameters and/or parameter
types, you must include the appropriate number of parameters and parameter types as part of
the command. You must also be the owner of the function to remove the function from the
schema.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
19
User-Defined Aggregates
• Perform a single table scan
• Example: Second largest number
Keep a state: maximum 2 numbers
New number can displace the smaller one in the state
Greenplum extension: Merge two states
• Example :Create a sum of cubes aggregate:
CREATE FUNCTION scube_accum(numeric, numeric) RETURNS numeric
AS ‘select $1 + $2 * $2 * $2’
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT;
CREATE AGGREGATE scube(numeric) (
SFUNC = scube_accum,
STYPE = numeric,
INITCOND = 0 );
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 20
User defined aggregates performs a single table scan and it keeps state. A state is a maximum of two
numbers.
In the example we create a user defined aggregate that returns a maximum of two numbers. We will
learn more about it in the Lab.
CREATE AGGREGATE defines a new aggregate function. Some basic and commonly-used aggregate
functions such as count, min, max, sum, avg and so on are already provided in the Greenplum Database.
If one defines new types or needs an aggregate function not already provided, then CREATE
AGGREGATE can be used to provide the desired features.
An aggregate function is made from one, two or three ordinary functions (all of which must be
IMMUTABLE functions): a state transition function sfunc, an optional preliminary segment-level
calculation function prefunc, and an optional final calculation function ffunc. These are used as follows:
sfunc( internal-state, next-data-values ) —> next-internal-state
prefunc( internal-state, internal-state ) —> next-internal-state
ffunc( internal-state ) —> aggregate-value
In the example shown above we only have the sfunc.
To test this aggregate you can try the following code:
CREATE TABLE x(a INT);
INSERT INTO x VALUES (1),(2),(3);
SELECT scube(a) FROM x;
Correct answer for reference:
SELECT sum(a*a*a) FROM x;
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
20
Ordered Aggregates
• Output of aggregates may depend on order
Example:
SELECT array_agg(letter) FROM alphabet
SQL does not guarantee a particular order
Output could be {a,b,c} or {b,c,d} or … depending on query optimizer,
distribution of data, …
• Sample Use Case:
Maximum value of discrete derivative? For example:
Largest single-day stock increase during last year?
• Greenplum 4.1 introduces ordered aggregates:
SELECT array_agg(column ORDER BY expression [ASC|DESC]) FROM table
• Median can be implemented using an ordered call of array_agg()
This will be covered in the Lab
Module 5: Advanced Analytics – Technology and Tools 21
Copyright © 2014 EMC Corporation. All Rights Reserved.
Support has been added for ordered aggregate functions in Greenplum, providing a method for
controlling the order in which values are fed to an aggregate function.
In a Greenplum Database, only aggregate functions defined as ORDERED can be called with an ORDER
BY clause. This can be followed by other arguments to specify a system-defined ordering.
The three built-in ordered aggregates and optional ORDER BY clauses that have been implemented in
4.1, are shown in the following table :
Aggregate Function
Description
array_agg(any element)
Concatenates any element into an array.
Example: SELECT array_agg(anyelement ORDER
BY anyelement) FROM table;
string_agg(text)
Concatenates text into a string. Example:
SELECT string_agg(text ORDER BY text) FROM table;
string_agg(text, delimiter)
Concatenates text into a string delimited by
delimiter. Example: SELECT string_agg(text, ‘,’
ORDER BY text) FROM table;
The columns in an ORDER BY clause are not necessarily the same as the aggregated column, as shown
in the following statement that references a table named product with columns store_id,
product_name, and quantity.
SELECT store_id, array_agg(product_name ORDER BY quantity desc) FROM product GROUP BY store_id;
Note: There can only be one aggregated column. Multiple columns can be specified in the ORDER BY
clause.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
21
MADlib: Definition
• MAD stands for:
• lib stands for library of:
• advanced (mathematical, statistical, machine learning)
• parallel & scalable
• in-database functions
• Mission: to foster widespread development of scalable analytic
skills, by harnessing efforts from commercial practice,
academic research, and open-source development.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 22
MADlib was first reported at VLDB 2009 in which MAD Skills: New Analysis Practices for Big
Data was presented
• http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
MADlib: Getting started
1) Visit http://MADlib.net
2) Download the latest release:
3) Click the MADlib Wiki link and…
4) Follow the installation guide for PostgreSQL or Greenplum
We have installed MADlib for you in the lab environment and you used MADlib functions in
Module 4 labs.
Many functions are implemented as User defined functions and User defined aggregates and
some examples are as follows:
•
•
•
•
•
•
•
•
testdb=# \da madlib.mreg*
List of aggregate functions
Schema | Name
|
Argument data types
|
———–+———————+————————————————–+—-madlib | mregr_coef
| double precision, double precision[] |
madlib | mregr_pvalues | double precision, double precision[] |
madlib | mregr_r2
| double precision, double precision[] |
madlib | mregr_tstats | double precision, double precision[] |
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
22
MADlib: Getting Help…
• Check out the user guide with examples at: http://doc.madlib.net
• Need more help?
Try: http://groups.google.com/group/madlib-user-forum
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 23
The slide provides additional details on obtaining help with MADlib functions.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
23
Greenplum In-database Analytical Functions
Descriptive Statistics
Modeling
Quantile
Association Rule Mining
Profile
K-Means Clustering
CountMin (Cormode-Muthukrishnan) Sketchbased Estimator
Naïve Bayes Classification
FM (Flajolet-Martin) Sketch-based Estimator
Linear Regression
MFV (Most Frequent Values) Sketch-based
Estimator
Logistic Regression
Frequency
Support Vector Machines
Histogram
SVD Matrix Factorization
Bar Chart
Decision Trees/CART
Box Plot Chart
Neural Networks
Correlation Matrix
Parallel Latent Dirichlet Allocation
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 24
Listed are the in-database analytic functions available natively in Greenplum and as Madlib
functions (MADlib functions in bold). This list keeps expanding with every update and as the
user community contributes to the MADlib.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
24
Advanced Analytics – Technology and Tools
Summary
During this lesson the following advanced functions were covered:
• Window functions
• User defined functions and aggregates
• Ordered Aggregates
• MADlib
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 25
Advanced SQL and MADlib functions were covered in this lesson.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
25
Advanced Analytics – Technology and Tools
Summary
The Key Points covered in this module were:
• MapReduce , Hadoop and Hadoop ecosystems
• In-database analytics with advanced SQL functions and MADlib
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
26
These are the key topics covered in this module.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
26
Advanced Analytics – Technology and Tools
Copyright © 2014 EMC Corporation. All Rights Reserved.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
1
Module 5: Advanced Analytics – Technology and Tools
1
Advanced Analytics – Technology and Tools
In-database Analytics SQL essentials
During this lesson the following topics are covered:
•
SQL Essentials
•
•
•
•
•
SET Operations
Online analytical processing (OLAP) features
GROUPING SETS, ROLLUP,CUBE
GROUPING, GROUP_ID functions
Text processing, Pattern matching
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
2
These topics are covered in this lesson.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
2
Set Operations
Greenplum supports the following set operations as part of a
SELECT statement:
• INTERSECT – Returns rows that appear in all answer sets
• EXCEPT – Returns rows from the first answer set and excludes
those from the second
• UNION ALL – Returns a combination of rows from multiple
SELECT statements with repeating rows
• UNION – Returns a combination of rows from multiple SELECT
statements with no repeating rows
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
3
Set Operations
Set operators:
• Manipulate the results sets of two or more queries by combining the results of
individual queries into a single results set.
• Do not perform row level filtering.
Set operations supported by Greenplum are:
• INTERSECT which returns rows that appear in all answer sets generated by individual
SELECT statements.
• EXCEPT returns all rows from the first SELECT except for those which also selected
by the second SELECT. This operation is the same as the MINUS operation.
• UNION ALL combines all the results of two or more SELECT statements. There may
be repeating rows.
• UNION combines the results of two or more SELECT statements. There will be no
repeating rows.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
3
Set Operations – INTERSECT
INTERSECT:
• Returns only the rows that appear in both SQL queries
• Removes duplicate rows SELECT t.transid,
c.custname
FROM
facts.transaction t
JOIN dimensions.customer c
ON c.customerid = t.customerid
Intersect
Table A
Table B
Copyright © 2014 EMC Corporation. All Rights Reserved.
INTERSECT
SELECT t1.transid,
c1.custname
FROM
facts.transaction t1
JOIN dimensions.customer c1
ON c1.customerid = t1.customerid
WHERE t1.transdate BETWEEN
‘2008-01-01’ AND ‘2008-01-21’
Module 5: Advanced Analytics – Technology and Tools
4
Set Operations – INTERSECT
A set operation takes the results of two queries and returns only the results that appear in both
result sets. Duplicate rows are removed from the final set returned.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
4
Set Operations – EXCEPT
EXCEPT:
• Returns all rows from the first SELECT statement
• Omits all rows that
SELECT
t.transid,
appear in the second
c.custname
FROM
facts.transaction
SELECT statement
t
JOIN dimensions.customer c
ON c.customerid = t.customerid
Table A
Table B
Results of
first query
Results of
second query
Table A minus Table B
Copyright © 2014 EMC Corporation. All Rights Reserved.
EXCEPT
SELECT t1.transid,
c1.custname
FROM
facts.transaction t1
JOIN dimensions.customer c1
ON c1.customerid = t1.customerid
WHERE t1.transdate BETWEEN
‘2008-01-01’ AND ‘2008-01-21’
Module 5: Advanced Analytics – Technology and Tools
5
Set Operations – EXCEPT
The EXCEPT set operation takes the distinct rows of the first query and returns all of the rows
that do not appear in the result set of the second query.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
5
Set Operations – UNION ALL
UNION ALL:
• Combines rows from
the first query with rows
from the second query
• Does not remove
duplicates rows
Table A
Table B
Results of
first query
Results of
second query
Table A plus Table B
Copyright © 2014 EMC Corporation. All Rights Reserved.
SELECT t.transid,
c.custname
FROM
facts.transaction t
JOIN dimensions.customer c
ON c.customerid = t.customerid
WHERE t.transdate BETWEEN
‘2008-01-01’ AND ‘2008-05-17’
UNION ALL
SELECT t1.transid,
c1.custname
FROM
facts.transaction t1
JOIN dimensions.customer c1
ON c1.customerid = t1.customerid
WHERE t1.transdate BETWEEN
‘2008-01-01’ AND ‘2008-01-21’
Module 5: Advanced Analytics – Technology and Tools
6
Set Operations – UNION ALL
The UNION ALL set operation is like the UNION operation but it does not remove duplicate or
repeating rows.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
6
Set Operations – UNION
UNION:
• Combines rows from
the first query with rows
from the second query
• Removes duplicates or
repeating rows
Table A
Table B
Results of
first query
Results of
second query
Duplicate
Table A plus Table B minus duplicates
Copyright © 2014 EMC Corporation. All Rights Reserved.
SELECT t.transid,
c.custname
FROM
facts.transaction t
JOIN dimensions.customer c
ON c.customerid = t.customerid
WHERE t.transdate BETWEEN
‘2008-01-01’ AND ‘2008-05-17’
UNION
SELECT t1.transid,
c1.custname
FROM
facts.transaction t1
JOIN dimensions.customer c1
ON c1.customerid = t1.customerid
WHERE t1.transdate BETWEEN
‘2008-01-01’ AND ‘2008-01-21’
Module 5: Advanced Analytics – Technology and Tools
7
Set Operations – UNION
A union operation combines the results of the SELECT statement from the first table with the
results from the query on the second table. The result set does not contain any repeating rows.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
7
SET Operations
• Types of Join
• Inner
• Left outer
• Right outer
• Full outer
• Cross
Inner Join
Left Outer Join
Full Outer Join
Right Outer Join
Cross Join
Module 5: Advanced Analytics – Technology and Tools
Copyright © 2014 EMC Corporation. All Rights Reserved.
8
The type of SET operations you can perform are:
• Inner join – The inner join is possibly the most common type of join. The resulting data
set is obtained by combining two tables on a common column. Each row of the left
table is compared against each row of the right table. All matching rows are returned as
part of the result set. An equijoin is an inner join that uses only equality comparisons in
the join predicate.
• Left outer join – Left outer join returns all of the rows from the left table even if there is
no matching row in the right table. It also returns matching rows from the right table.
Rows in the right table that do not match are not included as part of the result set.
• Right outer join – Right outer join returns all of the rows from the right table even if
there is no matching row in the left table. It also returns the matching rows from the
left table.
• Full outer join – Full outer join returns all rows from both tables where there is a match
and returns NULL for rows that do not have a match.
• Cross join – Cross join returns the Cartesian product of rows from tables in the join. The
resulting data set consists of a combination of each row in the left table with each row
in the right table. Two tables, each with five rows, will produce a resulting data set that
contains twenty-five rows.
.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
8
Left Outer Join
• Correlated sub-queries do not run efficiently in Greenplum
though support has been introduced in Version 4.2
SELECT * FROM transaction t
WHERE NOT EXISTS (
SELECT 1 FROM customer c
WHERE c.customerid = t.customerid)
• Use LEFT OUTER JOIN
SELECT t.* FROM transaction t
LEFT OUTER JOIN
customer c ON t.customerid=c.customerid
WHERE c.customerid IS NULL
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools
9
Correlated sub queries are supported in a Greenplum environment 4.2 onwards. The first
example shown in the slide is a correlated sub query in which we have a nested correlated sub
query using the first variable from the first SELECT statement (on table “transaction” ) used for
selection with table “Customer” in the query nested with a WHERE clause. In the context of
MPP architecture of Greenplum supporting correlated sub queries are not efficient.
The code example shown at the bottom accomplishes the same with a LEFT OUTER JOIN. It is
recommended that the multidimensional queries that are traditionally done with sub queries
and correlated sub queries be optimally coded with the proper use of the MPP architecture of
Greenplum.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
9
Sub-query vs. Inner Join
• IN clause is fully supported …
SELECT *
FROM transaction t
WHERE t.customerid IN
(SELECT customerid FROM customer)
• However, generally better idea as long as c.customerid is unique:
SELECT t.*
FROM transaction t
INNER JOIN customer c
ON c.customerid = t.customerid
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 10
Sub queries and nested queries are commonly used for multi-dimensional queries in which we
use IN clause with another SELECT statement. Sub-queries such as the one shown above are
supported in a Greenplum environment.
They are supported and valid but it is generally a good idea to use a INNER JOIN to accomplish
the same result. The performance query with INNER JOIN is far superior to the nested query in
the first example.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
10
Greenplum SQL OLAP Grouping Extensions
Greenplum supports the following grouping extensions:
• Standard GROUP BY
• ROLLUP
• GROUPING SETS
• CUBE
• grouping(column [, …]) function
• group_id() function
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 11
Greenplum SQL OLAP Grouping Extensions
Greenplum introduced support for extensions to the standard GROUP BY clause, which is fully
supported. These clauses can simplify the expression of complex groupings:
• ROLLUP – This extension provides hierarchical grouping.
• CUBE – Complete cross-tabular grouping, or all possible grouping combinations, is
provided with this extension.
• GROUPING SETS – Generalized grouping is provided with the GROUPING SETS
clause.
• grouping function – This clause helps identify super-aggregated rows from regular
grouped rows.
• group_id function – This clause is used to identify duplicate rows in grouped output.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
11
Standard GROUP BY Example
GROUP BY:
• Group results based on one or more specified columns
• Is used with aggregate statements
The following example summarizes
pn | vn | sum
product sales by vendor:
—–+—-+——SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY pn, vn
ORDER BY 1,2,3;
Copyright © 2014 EMC Corporation. All Rights Reserved.
100
100
200
200
300
400
500
600
700
800
(10
| 20 |
| 40 |
| 10 |
| 40 |
| 30 |
| 50 |
| 30 |
| 30 |
| 40 |
| 40 |
rows)
0
2640000
0
0
0
0
120
60
1
1
Module 5: Advanced Analytics – Technology and Tools 12
Standard GROUP BY Example
The standard GROUP BY clause groups results based on one or more columns specified. It is
used in conjunction with aggregate statements, such as SUM, MIN, or MAX. This helps to make
the resulting data set more readable.
The slide shows an example of a standard GROUP BY clause used to summarize product sales
by vendor.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
12
Standard GROUP BY Example with UNION ALL
This example extends the previous
example by adding
sub-totals and a grand total :
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY pn, vn
UNION ALL
SELECT pn, null, sum(prc*qty)
FROM sale
GROUP BY pn
UNION ALL
SELECT null, null,
sum(prc*qty)
FROM SALE
ORDER BY 1,2,3;
Copyright © 2014 EMC Corporation. All Rights Reserved.
pn | vn | sum
—–+—-+——–100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
| 60
700 | 40 | 1
700 |
| 1
800 | 40 | 1
800 |
| 1
|
| 2640182
(19 rows)
Module 5: Advanced Analytics – Technology and Tools 13
Standard GROUP BY Example with UNION ALL
In this follow-on example, the requirements for the query have been extended to include subtotals and a grand total. You would need to use a UNION ALL to continue the grouping and
provide for the additional requirements.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
13
ROLLUP Example
The following example meets the
requirement where the sub-total
and grand totals are to be included:
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY ROLLUP(pn, vn)
ORDER BY 1,2,3;
Copyright © 2014 EMC Corporation. All Rights Reserved.
pn | vn | sum
—–+—-+——–100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
| 60
700 | 40 | 1
700 |
| 1
800 | 40 | 1
800 |
| 1
|
| 2640182
(19 rows)
Module 5: Advanced Analytics – Technology and Tools 14
ROLLUP Example
This slide meets the requirements provided in the previous slide, but uses the ROLLUP
grouping extension. ROLLUP allows you to perform hierarchical grouping and helps to reduce
the code.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
14
GROUPING SETS Example
The following example shows how
to achieve the same results with the
GROUPING SETS clause:
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY GROUPING SETS
( (pn, vn), (pn), () )
ORDER BY 1,2,3;
Subtotals for
each product
Summarize sales
by product & vendor
Copyright © 2014 EMC Corporation. All Rights Reserved.
Grand total
pn | vn | sum
—–+—-+——–100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
| 60
700 | 40 | 1
700 |
| 1
800 | 40 | 1
800 |
| 1
|
| 2640182
(19 rows)
Module 5: Advanced Analytics – Technology and Tools 15
GROUPING SETS Example
The GROUPING SETS extension allows you to specify grouping sets. If you use the
GROUPING SETS clause to meet the earlier requirements so that it produced the same output
as ROLLUP, it would use the following groups:
• (pn,vn) – This grouping summarizes product sales by vendor.
• (pn) – This grouping provides subtotal sales for each product.
• () – This grouping provides the grand total for all sales for all vendors and products.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
15
CUBE Example
CUBE creates subtotals for all possible
combinations of grouping columns.
The following example
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY CUBE(pn, vn)
ORDER BY 1,2,3;
is the same as
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY GROUPING SETS
( (pn, vn), (pn),
(vn), () )
ORDER BY 1,2,3;
Copyright © 2014 EMC Corporation. All Rights Reserved.
pn | vn | sum
—–+—-+——–100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
| 60
700 | 40 | 1
700 |
| 1
800 | 40 | 1
800 |
| 1
| 10 | 0
| 20 | 0
| 30 | 180
| 40 | 2640002
| 50 | 0
|
| 2640182
(24 rows)
Module 5: Advanced Analytics – Technology and Tools 16
CUBE Example
A CUBE grouping creates subtotals for all of the possible combinations of the given list of
grouping columns, or expressions.
In terms of multidimensional analysis, CUBE generates all the subtotals that could be calculated
for a data cube with the specified dimensions.
In the example shown on the slide, the additional grouping set of (vn) – subtotaling the sales
by vendor, is included as part of the cube.
Note that n elements of a CUBE translate to 2n grouping sets. Consider using CUBE in any
situation requiring cross-tabular reports. CUBE is typically most suitable in queries that use
columns from multiple dimensions rather than columns representing different levels of a
single dimension. For instance, a commonly requested cross-tabulation might need subtotals
for all the combinations of month, state, and product.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
16
GROUPING Function Example
Grouping distinguishes NULL from summary markers.
store | customer | product | price
SELECT * FROM dsales_null;
——-+———-+———+——s2
| c1
| p1
| 90
s2
| c1
| p2
| 50
s2
|
| p1
| 44
store | customer | product | sum | grouping
s1
| c2
| p2
| 70
——-+———-+———+—-+———s1
| c3
| p1
| 40
s1
| c2
| p2
| 70 | 0
(5 rows)
s1
| c2
|
| 70 | 0
s1
| c3
| p1
| 40 | 0
s1
| c3
|
| 40 | 0
SELECT
s1
|
|
| 110 | 1
store,customer,product,
s2
| c1
| p1
| 90 | 0
sum(price),
s2
| c1
| p2
| 50 | 0
s2
| c1
|
| 140 | 0
grouping(customer)
s2
|
| p1
| 44 | 0
FROM dsales_null
s2
|
|
| 44 | 0
GROUP BY
s2
|
|
| 184 | 1
ROLLUP(store,customer,
|
|
| 294 | 1
(12 rows)
product);
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 5: Advanced Analytics – Technology and Tools 17
GROUPING Function Example
When you use grouping extensions to calculate summary rows, such as sub-totals and grand
totals, the output can become confusing if the data in a grouping column contains NULL
values. It is hard to tell if a row is supposed to be a subtotal row or a regular row containing a
NULL.
In the example shown on the slide, one of the rows shown where the customer field is NULL.
Without the grouping id, you could misinterpret the sum of 44 as a subtotal row for store 2.
The GROUPING function returns a result for each output row, where:
• 1 represents a summary row
• 0 represents grouped rows
Copyright © 2014 EMC Corporation. All rights reserved.
Module 5: Advanced Analytics – Technology and Tools
17
GROUP_ID Function
GROUP_ID:
• Returns 0 for each output row in a unique grouping set
• Assigns a serial number >0 to each duplicate grouping set found
• Can be used to filter output rows of duplicate grouping sets, such
as in the following example:
SELECT a, b, c, sum(p*q), group_id()
FROM sales
GROUP BY ROLLUP(a,b), CUBE(b,c)
HAVING group_id()
Purchase answer to see full
attachment