• I just read an interesting article over on Java Lobby by Dennis Forbes titled Out of Bounds : Avoiding Career Protection Faults.

    The article really rang some bells in my mind. I didn’t really identify with the concerns about job and career security — I guess that I have been fortunate to work with development groups that are fairly self-confident, and managed by people who can see past the little bits of political posturing that does take place. No, what caught my attention was more related to the behaviour of the new generation of programmers when it comes to anything but the newest code and techniques.

    I guess I need to be a bit more concrete. A very good friend of mine, who will remain nameless to protect the innocent, was involved in a web application (yeah, I know, another one). They were using Hibernate to provide an “object” view of the database.

    Now, before everyone jumps all over me, let me state quite clearly that I like Hibernate. I like it a lot. This is not a criticism of Hibernate. Are we all clear on that? OK, moving right along…

    Recently, they needed to implement a report. Nothing fancy, just your usual run-of-the-mill business report, pulling a few thousand line items and doing some basic totalling and so on. It came back from the dev team, with a note that it now works, so the story is complete, but probably needs to be optimised.

    Boy does it need to be optimised. It takes over 4 hours to produce the report.

    Now, these are not dumb developers — nothing could be further from the truth. They are very bright, and they write first-class code. But they are young and idealistic. I’m going to sound like my father for saying this, but the reality is that they haven’t had enough real-world production experience to make the right decisions on gut instinct. Therefore, in the absence of that, they always apply the pearls of wisdom gathered from the latest “best practices” guru or article without filtering it through some basic sanity and appropriateness checks.

    Let’s get back to the report in question as a case in point. There are two reasons that the report was running so slowly. The first is a general reluctance to use the database for what it does so well — querying data. Instead, they use Hibernate to get object graphs. And while Hibernate is pretty good at doing sensible things with lazy loading, there is a limit to how well it can optimise this sort of thing and it seems, more often than not in these sorts of scenarios, that it is pulling more data than it needs to out of the database. Quite frankly, I think that creating reports, or lists of items to display in a list view, is generally better done using a simple SQL query that returns just an ID with a the set of columns actually required for displaying the list or report. Hibernate even has a mechanism for doing essentially exactly this, using report queries and HQL, so I am prepared to accept that as a viable alternative, but I would prefer SQL because, quite frankly, it is better understood.

    Hibernate is actually pretty well documented, especially for an open-source project. Indeed, it is pretty well documented, period. Many commercial packages could learn a thing or two from Hibernate’s documentation. But compared to the huge body of knowledge that exists about SQL, and the decades of real-world experience, frankly, it just doesn’t cut it. Not unless you happen to have a Hibernate guru on hand.

    So, I think that using tools such as Hibernate, and refusing to entertain using raw SQL under any circumstances, is part of the problem, but only a very small part. I doubt you could make Hibernate take 4 hours to do the report even if you were trying very hard. No, that is not where I think the problem lies.

    We can get closer to the problem by stepping back a little. The reason that they can’t just run a query to get the report data is that the data is not stored, but calculated on demand. The system stores transactional data, and all balances are calculated as required using a bunch of (often nested) calculators that walk the database. So if I want a balance, I walk all the transactions that affect the balance (and there are many types of transactions in several tables with various dependencies) and do a calculation, often quite complex.

    From a coding perspective, it’s a one-line call to a calculator method to get the data I want, so it looks quite elegant, and the calculators are tested separately and I know that they are correct.

    The calculators do some basic filtering of the transactions that they process, but they still need to read a lot of them, and sometimes they end up using some lazy fetching of related data. It’s just the way the data mapping is done, because there are conflicting requirements for different functions in the system. The end result is that doing a calculation is quite expensive, which is not a problem when you are doing just a single calculation.

    But what if I want to get month-end balances for the past year? Simple, just loop though the required dates, pass them to the calculator, and get back the balance on that date. It’s really easy for the calculator to just stop processing transactions when it gets to a particular date.

    And if I need DAILY balances? Do you see where this is going?

    Inside what looks to the developer like a simple loop, there is a huge amount of database activity going on, and done in such a way that the database has no opportunity to do anything but act as a dumb file store. That is what is bringing the database (and the system) to its knees when this report is run.

    Now, at this time, gentle reader, you are probably thinking to yourself: why was the database structured that way in the first place? Well, do you remember in school, when you were told not to store something in the database if it can be calculated? Well, that’s what blindly following that advice ultimately leads to.

    Now, I am not advocating that we forget about things like normalisation and removal of redundancy. But all general rules need a context. Just because something can be calculated does not mean that it should never, under any circumstances, be stored.

    Take this calculated balance as a case in point. Calculating it is expensive, so storing it should at least be considered, and implemented if the time taken to recalculate it makes it impractical to use recalculation whenever it is needed. Even more importantly, some data items, even if they can be calculated, have a distinct “point-in-time” value and therefore need to be stored regardless of the time it takes to recalculate them.

    Let me explain that, because it is important.

    Let’s say that a calculation involves a set of input values, transactions, exchange rates, interest rates, taxation rates… you get the idea. It’s more than just the initial and transactional data. Calculating a value means you need to get, for each transaction, the applicable rates that were in effect when that transaction occurred, so you need to keep historical data for each of them. OK, that much is obvious, and there is a clear requirement to store that history if you are going to recalculate as required.

    What most people forget, however, is that there is another component to the calculation that can change over time — the code of the calculation itself. If there is a different way to calculate sales tax, for example, or a new type of tax is introduced, or simply a business rule changes, then the calculation logic itself will change and it will generate different results for the same input data.

    So if you want to recalculate an invoice amount for a past date, you better have a copy of the correct version of the calculator code around too, and you better have an infrastructure in the code to handle identifying and instantiating the correct version of the calculator. And it’s even more complex than that, because if the output of a given calculation feeds into the next period’s data, then you need to identify, instantiate and use the correct version of the calculation logic for each period as you are looping forward.

    You’d better be able to handle this correctly, because if you don’t, your client-facing staff will see one figure on the screen while the customer will have a different figure on the invoice. Trust me when I say that this is not what you want.

    Whenever a figure has a meaning at a point in time, it is a data point all by itself. In my mind, it is a no-brainer: it needs to be stored. In a product order line item, for example, you don’t store the extended price, because you can always calculate it as unit price times number of units.

    But you do store the unit price and description, even if you could always look them up from the products table. Why? Because they can change, and the line item is a point-in-time data value. In this case, we are shielding the point-in-time data value from changes in the data inputs. We are not concerned about the extended price, because the point-in-time data value’s data inputs (ie the unit price and number of units) are captured at that point in time, and we do not foresee (and will not support) any changes to the trivial calculation logic. If the calculation logic were ever to change, then I would see no real alternative but to store the extended price in the line item.

    Could someone have pointed this out earlier? Yes. Did anyone do so? Yes. Did anyone listen? Unfortunately, no.

    You see, that was “old world” thinking. We have a calculator! Why would we want to store the data value when we can always calculate it?

    This post is long enough. If anyone ever reads it, I fully expect to be flamed. Just before you hit that comment button, however, please take the time to understand the point I am trying to make. I am not trying to discount modern best practices, or the tools that are currently in vogue. I actually pride myself as being pretty good at picking up new ideas and technology, and leading rather than following in their adoption. I am simply pointing out that, in my opinion, we need to apply a real-world filter over all the stuff that we are constantly being bombarded with, and realise that there are no absolutes — no rule can be applied blindly without some thought.

    And sometimes, just sometimes, we older folks might know a thing or two that can be useful.

  • I’m going to get off the topic of the Apple for today — not that nothing has happened, but because in reading over the blog I sound like some Mac fanatic. Today, Chris, a good friend of mine, showed me his new HP laptop. Huge, 17″ monster, very powerful, but battery life of about an hour, and he couldn’t get it set up to access the network. Sigh!

    But I said, no Apple today.

    Over the past couple of weeks, I have been working on a particular Java application, and I needed to extract a whole bunch of data into flat files for a particular client requirement. Cutting a long story short, I ended up writing a set of scripts to generate an XML specification of an extract that is going to be used to control the total extract process, and this gave me a chance to try my hand at Ruby.

    Now, I have heard a lot of good things about Ruby, but had not really used it before. Everyone I knew, who I respected as a programmer, and who had tried Ruby, raved about it. So, even though I knew Python, I made a point of nutting my way round Ruby.

    Obviously, it took me a little while to get moving — there is always a bit to learn when starting with a new language. But I bought a PDF copy of the Pickaxe book and zoomed through the highlights. I have to say, I like Ruby a LOT.

    Ruby is OO to the core. Everything is an object, and it has a remarkably convenient set of built-in functionality. I am not going to put together a tutorial on Ruby, at least not here, but here are a few examples to whet your appetite.

    In Java, to define a class with a set of accessor methods, you do something like this:

    public class Dog
    private String name;
    public Dog(String name)
    {
    super();
    this.name = name;
    }
    public String getName()
    {
    return name;
    }
    public void setName(String value)
    {
    name = value;
    }
    }
    Here’s the same thing in Ruby:

    class Dog
    attr_accessor :name
    def initialize(name)
    @name = name
    end
    end
    Creating an instance in Java:

    Dog dog = new Dog("Rover");

    and in Ruby:

    dog = Dog.new("Rover")

    so the classes a pretty much equivalent, except that the Ruby one is (a) much shorter and (b) eliminates the need to write a whole lot of plumbing, no-brain code. Now I know that any modern IDE generates this boilerplate code for you, but it is still there and needs to be navigated and mentally discounted while you work on the stuff that DOES matter. In Ruby, the only code you write is what you need for the application — well, most of the time anyway :)

    Here’s a really cool thing you can do in Ruby. When you call a function, as well as passing a number of arguments to it, you can also, optionally, attach a code block to it. A code block is delimited by either a the keywords do and end, or braces (they’re the same). Inside the called function, the code can determine whether a code block has been attached to it and, if so, essentially call that block any number of times. Here is an example:

    def send(message)
    if block_given?
    yield "connecting"
    end
    connect(...)
    if block_given?
    yield "sending"
    end
    send(message)
    if block_given?
    yield "sent"
    end
    disconnect(...)
    if block_given?
    yield "done"
    end
    end
    This is a dummy, skeletal procedure. We assume that it sends a message somewhere, and there are several steps — connecting, sending and disconnecting.

    If you call it like this:

    send("Hello world")

    it just does its thing. But you can optionally attach a code block like this:

    send("Hello world") {|stage| puts "... now #{stage}" }

    Let’s look at this line. The braces define a code block — the convention seems to be that short blocks like this use braces, while long, multi-line blocks use do/end. The two vertical bars delineate a parameter list; here, the parameter is called “stage”. The single line inside the code block uses puts to display a string. I’ll get to the string in a moment, but for now just accept that this results in the following printout:

    ... now connecting
    ... now sending
    ... now sent
    ... now done

    The string that is displayed is delimited by double-quote characters, which means that the string is processed by Ruby. One of the effects of this is that the #{x} construct embedded in the string is replaced with the value of the variable x — this works everywhere, not just in these attached code blocks.

    This mechanism is used to implement a really simple, generic and pervasive iterator-like mechanism. For example, to allow arrays to be iterated, the Array built-in class implements a method “each” which, you guessed it, takes a code block. So, to iterate over an array, you use this sort of code:

    my_array.each {|element| puts element }

    The beauty of this is that any object can exhibit this behaviour — just implement an “each” method that expects a code block, and “yield” once for each element your object contains. There is no need to be in any other way related to an array.

    Which leads me to the topic of Duck Typing. This is the Ruby philosophy about object typing. While Ruby does implement a single-inheritance object hierarchy model, you can actually use unrelated object polymorphically as long as they implement a common subset of methods. The idea is that if it walks like a duck, and looks like a duck, and quacks like a duck, then it can be treated like a duck. Yes, this is NOT as bullet-proof as a strongly-typed language like Java, but in reality I don’t actually end up assigning a Debit object instance to an Animal object reference very often, and if I do, I will rely on my tests to pick that up. In return, I save myself a lot of unnecessary casting and fiddling in perfectly good code just to tell the compiler what I already know.

    Ruby also has mix-ins, called modules. A module is a bit like an interface and a bit like an abstract class. Like an interface, a module aggregates a set of methods — these are included by classes that want to, regardless of their position in the object inheritance hierarchy. But unlike interfaces, modules have code in them too — implemented methods. These methods become part of any class that includes the module, and have access to class methods, exactly as if the code had been copied and pasted into that class. Also like interfaces, a single class can mix in, or include, any number of modules.

    Like an abstract class, it implements some code, and through access to non-coded variables and methods, can set up an expectation on the classes that include it, but unlike an abstract class, the class that includes it does NOT need to descend from it (indeed, it can’t do so, because modules are not classes per se).

    Very powerful indeed, and I don’t claim to fully appreciate all the implications of how these can be used, but just intuitively it seems to be really useful. And just plain cool.

    Anyway, that’s more than enough for one post. Tomorrow is going to be a busy day. Toodles.

Categories

Archives

Search

Local Links