Databases are at the core of virtually all modern information
technology systems. Sometimes these databases are exposed, for
example, as in data warehouse systems. The centrepiece of a system
like that is a powerful industrial database such as Oracle, or
Sybase, or Microsoft SQL Server, or something similar. The presence
of a database is pretty obvious in that case – only these powerful
databases can crunch the immense amounts of data processed by data
warehouses. In other cases databases are hidden. For example, Apple
iTunes uses a SQLite database to store the details of music tracks on
your computer. That database is not obvious; it does not advertise
itself, but it’s there. It makes sure that the ratings you assigned
to songs are saved and can be synchronised between all your iPhones
and iPads. It counts the number of times every song was played so
that you don’t listen to the same song twice when you put your iPhone
on shuffle. Databases are everywhere. They all serve the same purpose
– to store data and make its retrieval as easy and fast as possible
– but they are also vastly different from each other.
Let’s take another example. When you log into your Google account
and bring up your Gmail inbox, all the emails you see are actually
stored in the remote database. That database is called BigTable, and
it contains not only your emails, but all the emails of all Gmail
users in the world, and also virtually all of Google’s data. While
your iTunes SQLite database may be about 50 megabytes in size (and
that’s assuming you have A LOT of songs), Google’s BigTable contains
petabytes of data. That’s your iTunes database times one billion.
If you think about it, it becomes obvious that these databases
require vastly different approaches to the way the data are stored
and retrieved: You can fit an iTunes database into memory and query
it whichever way you like without a performance penalty. At the same
time, no machine has been built yet that could apply the same
approach to Google’s BigTable.
Unfortunately, not all software developers understand that.
Databases once were an inspiring topic but in recent years they went
out of fashion. Software developers are geeks; they like new toys;
they all want to work on something latest and greatest and
cutting-edge. So many new exciting things are happening in the area
of Information Technology – Web 2.0, HTML5 and Apple iOS to name
just a few – that databases just fade in comparison, despite the
fact that they make all these new shiny things tick. Most of the
developers these days take a database just as generic data storage:
“We’ll just stuff the data in and we don’t care what’s inside.”
10 years ago SQL language was a necessary skill for database
application developers. Nowadays the majority of programmers don’t
know SQL. They rely on frameworks such as Hibernate to produce SQL
statements for them. They think that all databases are the same and
therefore, if necessary, they can take one system that uses MS SQL
Server as a backend and put it onto Oracle and it will work just
fine.
Well, that may be true for very simple applications. The myth that
all databases are the same is flawed, especially when it comes down
to performance. Today cloud computing is a buzzword, and Google is
the patriarch of the cloud. Google’s servers process billions of
requests every day, crunching petabytes of data. Yet, every request
made to a Google search engine is served within seconds. This places
such a high demand on the Google’s database layer, that Google’s
engineers couldn’t afford using even the most powerful of industrial
databases, and they had to develop their own – the aforementioned
BigTable. If you told these guys that “all databases are the same”,
they would laugh into your face, and rightfully so, because Google
knows that performance matters and they try to squeeze every bit of
performance out of their systems.
Databases matter, and if you consider yourself a decent software
developer, you need to learn how to tame them. Learn the differences
between them. Learn what they are, what makes them tick, and the most
importantly, how to make them tick faster, because writing
applications that are slow is just bad taste.
As I have already shown you, implicit type conversion is one of the most dangerous features of Oracle SQL and PL/SQL. It is dangerous because it happens automatically without your knowledge and may lead to unpredictable results. These problems are most common when dealing with DATE conversions, but not limited to them.
For example, let’s have a look at this Stackoverflow.com
question by James
Collins.
James had a problem, the following query was slow:
SELECT a1.*
FROM people a1
WHERE a1.ID LIKE '119%'
AND ROWNUM < 5
Despite column A1.ID
was indexed, the index wasn’t used and
the explain plan looked like this:
SELECT STATEMENT ALL_ROWS
Cost: 67 Bytes: 2,592 Cardinality: 4 2 COUNT STOPKEY 1 TABLE ACCESS FULL TABLE people
Cost: 67 Bytes: 3,240 Cardinality: 5
James was wondering why.
Well, the key to the issue lies, as it often happens with Oracle, in
an implicit data type conversion. Because Oracle is capable to perform
automatic data conversions in certain cases, it sometimes does that
without you knowing. And as a result, performance may suffer or code
may behave not exactly like you expect.
In our case that happened because ID
column was NUMBER
.
You see, LIKE
pattern-matching condition expects to see character
types as both left-hand and right-hand operands. When it encounters a
NUMBER
, it implicitly converts it to VARCHAR2
. Hence, that query was basically silently rewritten to this:
SELECT a1.*
FROM people a1
WHERE To_Char(a1.ID) LIKE '119%'
AND ROWNUM < 5
That was bad for 2 reasons:
- The conversion was executed for every row, which was slow;
- Because of a function (though implicit) in a
WHERE
predicate, Oracle was unable to use the index on A1.ID
column.
If you came across a problem like that, there is a number of
ways to resolve it. Some of the possible options are:
- Create a function-based index on
A1.ID
column:
CREATE INDEX people_idx5 ON people (To_char(ID));
- If you need to match records on first 3 characters of
ID
column, create another column of type NUMBER
containing just these 3
characters and use a plain = operator on it.
- Create a separate column
ID_CHAR
of type VARCHAR2
and fill it with
TO_CHAR(id)
. Then index it and use instead of ID
in your WHERE
condition.
- Or, as David Aldridge
pointed out: “It might also be possible to rewrite
the predicate as
ID BETWEEN 1190000 and 1199999
, if the values are
all of the same order of magnitude. Or if they’re not then ID = 119 OR ID BETWEEN 1190 AND 1199
, etc.”
Of course if you choose to create an additional column
based on existing ID column, you need to keep those 2 synchronized.
You can do that in batch as a single UPDATE
, or in an ON-UPDATE
trigger,
or add that column to the appropriate INSERT
and UPDATE
statements in
your code.
James choose to create a function-based index and it worked like a charm.
SYSDATE
is one of the most commonly used Oracle functions. Indeed,
whenever you need the current date or time, you just type SYSDATE
and
you’re done. However, sometimes it’s not all that simple. There are a
few confusions associated with SYSDATE
that are pretty common and, if
not understood, can cause a lot of damage.
First of all, SYSDATE
returns not just current date, but date and time
combined. More precisely, the current date and time down to a second.
If just a date is needed, TRUNC
function has to be applied, that is,
TRUNC(SYSDATE)
. For a sake of a good database design, date should not be
confused with date/time. For example, if a column in a table is called
transaction_date
, it would be natural for it to contain a date,
but not date/time. That may lead to a major confusion. Let’s imagine
there is a table BANK_TRANSACTIONS
containing the following fields:
txn_no INTEGER,
txn_amount NUMBER(14,2),
txn_date DATE
The last field is of the most interest to us. Apparently its data type
is DATE
, but is it a date or date/time? We can’t tell by just looking
at the table definition. Nonetheless, it is a very important thing to
know. A common case for using DATE columns is including them in date
range queries. Forexample, if we wanted to get all the bank transactions
from 1 January 2009 to 31 July 2009 we could write this:
SELECT txn_no,
txn_amount
FROM bank_transactions
WHERE txn_date BETWEEN To_date('01-JAN-2009','DD-MON-YYYY')
AND To_date('31-JUL-2009','DD-MON-YYYY')
And that would be fine if TXN_DATE
were a date column. But if it is
a date/time, we would just have missed a whole day worth of data. And it
is because, as I said, DATE
data type can hold date/time down to a
second. That means that for 31 July 2009 it could hold values ranging
from 0:00am to 11:59pm. But because TO_DATE('31-JUL-2009', 'DD-MON-YYYY')
is basically an equivalent to TO_DATE('31-JUL-2009 00:00:00', 'DD-MON-YYYY HH24:MI:SS')
, all the transactions happened
after 0:00am on 31 July 2009 would be missed out.
That kind of mistake is pretty common. Sometimes it’s hard to tell by
just looking at the data whether a particular DATE
column can have date
portion. Even if all the values in there are rounded to 0:00 hours, that
doesn’t mean that a different time value can’t appear there in the
future. The data dictionary can’t help us here either – DATE
type is
always the same whether it contains time or not. (By the way, Oracle
recommends using TIMESTAMP
type for new
projects, but that is a whole different story.)
If you are working with an existing table and you are not sure, you can
use a fool-proof method like this:
SELECT txn_no,
txn_amount
FROM bank_transactions
WHERE txn_date BETWEEN To_date('01-JAN-2009','DD-MON-YYYY')
AND To_date('31-JUL-2009','DD-MON-YYYY') + 1 – 1/24/3600
+1 – 1/24/3600
here means “Plus 1 day minus 1 second”. That is because
“1” in DATE
type means “1 day”, “1/24” - 1 hour, and there are 3600
seconds in an hour.
The above expression will retrieve all the transactions from “01 January
2009 0:00am” to “31 July 2009 0:00am plus 1 day minus 1 second”, i.e. to
“31 July 2009 23:59pm”.
If you are charged with designing an application and need to create a
table with a DATE
column, it is worth to keep yourself and others from
future confusions by a simple trick: name columns that only contain date
portions as DATE
and add TIME
to the name of the columns that
you know will contain time components. In our case it would be prudent
to call the
date/time column TXN_DATE_TIME
.
The second issue I’d like to discuss is much more subtle, but can do
even more damage.
Imagine that you are charged with developing a report that returns all
the transaction for the previous month. It looks like a job for SYSDATE!
You fetch your trusty keyboard and after a few minutes of typing you
come up with something like this:
SELECT txn_no,
txn_amount
FROM bank_transactions
WHERE txn_date BETWEEN Last_day(Add_months(Trunc(SYSDATE),-2)) + 1
AND Last_day(Add_months(Trunc(SYSDATE),-1))
You create a few lines in BANK_TRANSACTIONS
table, run a few unit tests
to make sure your code works and check it into the source control. Job
done! You congratulate yourself on the productive work and spend the
rest of the day reading your friends' blogs and dreaming about your next
vacation. And the next day you move on to another task and get as busy
as ever.
After some time, which may be a few days or months, depending on the
pace of the project, the code you wrote gets migrated into the UAT
environment. And a task force of a few testers and end users is assigned
to test the report you wrote. And as it often happens in UAT, they are
going to test in on real data they extracted from the production
system – that is, the last year’s data.
Got it? Last year’s.
The final stages of testing, such as UAT, have to prove that the system
does what it is expected to do in conditions that resemble the
production as closely as possible. And the best way to do that is to
test it on the retrospective production data – the data that is proven.
That makes it possible to compare the outcome to the actual production
system, and thus, prove or disprove that the new system works.
That sounds reasonable. But one of the implications for you is that
BANK_TRANSACTIONS
table is not going to contain previous month’s
transactions. Hence, your report will be blank. You can’t rewind back
time because you hard-coded SYSDATE
, which has only one meaning – “right
now”. Test failed.
If you have known that when you wrote it, you wouldn’t have used the
SYSDATE
. You would use a parameter, something like v_run_date
, which
you could set to whatever date you wanted. And that would do. Well, now
you know.
You only need a single look at Sydney Opera House to recognise that it
is a work of art. Any masterpiece is like that – you don’t need to do a
throughout examination of Mona Lisa’s smile to realise its beauty – you
see it instantly. Perfection needs no explanation, it works on
subconscious level.
The same applies to the software engineering too. Great code always
looks good. It is always carefully formatted, indented and commented. By
just looking at it you can tell that it is a work of art. Such code will
always work, do what it is supposed to do and have a very few bugs.
Because whoever wrote that code cared a lot about it. And you can safely
assume that if anyone has put a lot of effort into making the code
looking good, he has put at least as much effort into designing and
debugging it.
What is even more important, carefully carved code is easier to
maintain. In modern software projects any single procedure gets tweaked
and rewritten tens of times. If you are a programmer, good chances that
even in a project that you work on right now you inherited some code
that was written years ago, maybe from people who long left the company.
And when you finish with it, it will not be the end of the story – the
code will be passed to QA and finally to the production support. And
then the cycle will start again. Hence, whatever you program, it’s not
just about you. You don’t know how many people will be looking into your
code trying to make sense of it. And you can help them immensely by
making it is easy to read and understand now. Whether you will be
remembered as a good programmer or cursed depends on it.
So, you made an effort to write the code that works. Now make an extra
step – make it beautiful.
It’s a follow-up to the previous post.
As it turned out, implicit date conversions may also prevent Oracle from
doing the partition
pruning.
For example, if you have a table INVOICES
with a range partition on
INVOICE_DATE
field, then expression
SELECT
...
WHERE invoice_date >= '01-MAR-09'
AND invoice_date < '02-MAR-09'
will not perform the partition pruning, whereas
SELECT
...
WHERE invoice_date >= TO_DATE('01/03/2009', 'DD/MM/YYYY')
AND invoice_date < TO_DATE('02/03/2009', 'DD/MM/YYYY')
will.
Because the efficiency of partition pruning is usually why partitioning
is used in the first place, the choice is obvious.
But after all, I’d use
SELECT
...
WHERE invoice_date BETWEEN TO_DATE( '01/03/2009', 'DD/MM/YYYY')
AND TO_DATE( '01/03/2009', 'DD/MM/YYYY') + 1 - 1/24/3600
since BETWEEN
operation is specifically tailored for such situations.
“1/24/3600” here represents 1 second, and the whole statement should be
read as “From 01 March 2009 0:00am to 01 March 2009 11:59pm”.