Pre-requisites:
This blog is aimed at an Oracle RDBMS user with a beginner to intermediate level understanding of SQL and PL/SQL languages, and basic Oracle Concepts. If you are an analyst weaned on Excel or other GUI workbenches with some basic programming ability, but know next to nothing about databases, read this friendly intro (ignore the author's controversial opinions), kick the tires on a free edition of Oracle, and return here.
Database:
This blog is based on Oracle10gR2 and higher. Oracle gives away an awful lot of analytics features as part of its free (XE/Express) or low-cost (Standard) editions - but if a highlighted feature is available only in a particular higher-end edition, we will mention it.
Datasets:
We will use the data in the Sample Schema that ships with the Oracle database for most of our examples. Enable SELECT privileges on all the tables in the various sample schemas. When this data is inadequate to explain a concept, we will import data from other public sources.
Loading data:
For small datasets outside the sample schema, we will use INSERT statements. For larger tables, we'll use SQL*Loader and/or External Tables. Here is a quick primer on data load (Note: the extract_files_dir directory object should be created by the DBA using the CREATE DIRECTORY statement). For more details, consult Oracle docs.
DevEnv:
The baseline is Xemacs- SQL*Plus - Oracle10gR2 Enterprise Edition - RHEL on a 2-processor Core duo Dell COTS (common off-the-shelf) machine. Over time, we will try the free SQL*Developer tool for all development. Software downloads of the database and related products are available here.
Output, Visualization of results:
One major weakness of Oracle compared to best-of-breed statistical/scientific products is the charting and visualization capability in its disparate GUI products. Oracle has several of them - SQL*Developer, Application Express, Discoverer, Warehouse Builder, Data Miner, Enterprise Manager, and a whole slew of tools from the Siebel BI (a k a BI EE) and Hyperion acquisitions. These tools address specific areas of data or application management. As the #1 database company, it would be fantastic if Oracle makes a strategic investment to build modular, data visualization solutions across its product stack - integrating the best elements from it existing portfolio of GUI-based products, and third party tools like Spotfire, Tableau, or Dundas (now part of MSFT). But Visualization for BI is still in its nascent stages, so this may not happen soon. For purposes of this blog, we will embed screenshots from Excel, SQL*Dev, App Express, Data Miner, R and other Oracle GUIs as applicable. We will use Dia for diagramming.
Code Quality:
I will test every code fragment I publish here, but I am not the best SQL or PL/SQL developer around - so I welcome comments/corrections on better-expressed and/or more efficient queries for a particular problem. Consult AskTom or search for Celko's articles for SQL tips, and Steve Feuerstein's articles/books for PL/SQL coding tips.
Dev Essentials
Preface
Greetings.
This blog is a personal quest to learn/discover/enable the Oracle RDBMS as an analytical/scientific computing engine. While Oracle does not (yet) match the breadth and depth of a popular, best-of-breed platform like SAS for analytics, or a scientific computing suite like Matlab, it offers enough facilities in Oracle SQL and PL/SQL to build a competent business analytics platform, if not yet a comprehensive platform for scientific research.
The answer to the central question of "I am an Oracle database professional ...Why Should I Know Analytics?" is simply "To keep yourself relevant ten years from now". A more lucid, lengthy answer can be found in the books in the Readings section. After a 10-15 year journey through database design, E-R diagrams, Certifications on every database technology, managing terabyte warehouses, you are bound to come to a juncture in your professional life where you ask "We store at least 7-10 years worth of transactional data (as required by Sarbanes-Oxley).. is there anything I can learn about my customers/products/employees implicitly from all this stored data - beyond what I code into my applications?" At this point in time, you begin to become an analyst.
The central question of "Why In-Database Analytics?" deserves a dedicated post rather than a glib answer in this column. But the basic premise is this: The wider and deeper you deploy intelligence across a company, the smarter the company. Any data-driven intelligence that you compute at the source - i.e. where your enterprise data resides - becomes corporate intelligence consumable across the enterprise. The higher you go in your stack to compute this intelligence, the more insular this intelligence becomes to the rest of the company - SOA and other mid-tier technologies notwithstanding. So the proposition is simple - 'If you are an Oracle DBA/Developer/Analyst/CIO, explore the capabilities of the database as an analytical platform before you think about spending on tiered, best-of-breed products. Save that money for the best business analysis and data analysis talent that you can find'.
Over the years, customers/consultants have wanted to incorporate their own algorithms into the database. Oracle has been extensible since 8i, offering a framework to link your C-based Math/Analytics libraries into the database. But unlike Illustra (the competitive Object-relational database of the 8i era), Oracle (wisely) places a user application in its own process sandbox - it does not allow a user to link their "untrusted" C library directly into the Oracle binary - mainly to prevent some buggy/malicious C fragment from crashing a million dollar production system.
The consequence is that your application has to marshal the data between your library and the database through an ExtProc interface. Obviously, this critical data pinch-point has not helped spawn a large collection of extensible third-party applications. However, internal to Oracle, the framework has been a great success, enabling an evolutionary development of products that manage unstructured data - Oracle's Spatial Option, Text, XML, Enterprise Search, PL/SQL Table functions and such features directly or indirectly use this framework or its design concepts.
There is a secondary motivation to this blog. 47% of the world's corporate relational data is stored in Oracle. If even a fraction of this installed base begins to demand more foundational analytics infrastructure such as high performance matrix computation, concurrent processing, math libraries, and such building blocks for scientific computing - this may nudge Oracle towards becoming a world-class analytics engine.
Returning to more humbler pragmatics for the blog, rather than take the classic extensibility route, we'll code techniques using Oracle10gR2 (and later) SQL and PL/SQL. PL/SQL as a language has dramatically matured over the past 5 years, with support for native compilation, IEEE floating point arithmetic, function/rowset caching, and improved scalability and performance. Where expedient/convenient, we will package any open-source Java libraries as Java stored procedures.
If you have read thus far, and have the slightest inkling of the breadth and depth of quantitative techniques out there, you will agree that the mission statement for this blog is downright audacious, or Quixotic, or stupid, or pick your adjective...
Wish me luck, and hope you find this site interesting and useful.
Best regards, Ram