Archive for March, 2008

Identity in Software Design

Friday, March 28th, 2008

Identity

Over the years I have been designing and building software I have noticed one recurring set of problems that keep cropping up, regardless of company, product domain or programming language. Software developers’ often have a naïve understanding of identity (myself included!), and this leads to all sorts of bugs, hacks and design compromises. You’d think something as fundamental as how to identify a Thing would have been settled by now! To make matters even worse, changing how you identify a Thing after you’ve already amassed a lot of data (Thing Instances) is typically very complicated and expensive.

The philosophy of identity has a long and rich history, so software developers are in good company when it comes to struggling with these issues. What I find particularly interesting is that many of the classic identity thought experiments are very concrete issues for software developers. For example, you may lie awake at night and philosophize as to whether you are identical to your clone in a parallel universe. However programmers frequently write code to clone and move objects between two systems separated by space and time. For example, every time you synchronize your iPod, a program, and by extension a programmer, applies some quite sophisticated identity management rules.

Who’s Asking?

One of the first things we notice when we start to identify a real-life object is that the attributes we use to identify the object typically depend on the identity of the person asking, or the overall client context. For example, if someone asks me, “Who are you?” I might answer “Daniel”, “Daniel Selman”, “Rule Studio for Java Team Lead”, identify myself based on my relationship to other people, as often used in the Bible, or using biometric data such as DNA or fingerprints. This is a performance optimization as it is not feasible to list all our identifying characteristics to all clients! It is the client that imposes their identity requirements on us: if we given the client too little information they simply ask for more, if we supply too much, they probably ignore what they don’t need or cannot interpret.

Types of Identity

Gottfried Wilhelm Von Leibniz

Gottfried Leibnitz famously stated that “x is the same as y if and only if every predicate true of x is true of y as well.” If you tug gently at this little philosophical thread you quickly become entangled in the fascinating and complex questions related to identity – many of which are still actively debated today.

There are two definitions of identity: numerical identity and qualitative identity.

Objects a and b are numerically identical if a and b are one and the same thing. It is the relation an object has with itself and nothing else – a circular definition as “nothing else” means, “no numerically non-identical thing”. For example, I will be numerically equal for as long as I exist.

Objects a and b can be said to be qualitatively identical if a and b are duplicates, that is if a and b are exactly similar in all respects. This implies that things can be more or less qualitatively identical. Twins may be qualitatively equal even though they are numerically different.

I-Predicates

I-predicates are used to express qualitative identity relationships, taking into account the richness of a given theory or application context. For example, “having the same income as” will be an I-predicate in a theory in which persons with the same income are indistinguishable, but not in a richer theory. For example, within the “Selman Family Theory” I can safely use “has the same first name as” as an I-predicate to identify people. This I-predicate would be a foolish choice for the “ILOG Employees Theory” however!

Some philosophers contend that there is no absolute identity, and that identity is always relative, this is controversial and contested however.

Criteria of Identity

Similar to I-predicates is the concept of criteria of identity. For example, the criterion of identity for directions is parallelism of lines. Criterion of identity for numbers is equinumerosity of concepts, that is, the number of F’s is identical with the number of G’s if and only if there are exactly as many F’s as G’s.

Identity over Time

Heraclitus, Johannes Moreelse

Identity over time is particularly controversial, because time involves change. For example, Heraclitus famously argued that one could not bathe in the same river twice – as the water continuously flowing through the river changes its identity.

Take a simple statement such as “Tabby was fat on Monday.” Endurance theorists state that persisting things endure and change through time, but do not extend through time, but only through space. I.e. Things are different from events or processes. Perdurance theorists refute this and do not distinguish between Things and processes.

If Tabby is fat on Monday, that is a relation between Tabby and Monday. Perdurance theorists would state that Tabby-on-Monday is intrinsically fat.

Personal Identity

It is very useful to also consider the questions applicable to personal identity when designing software systems. These questions are:

  • Who am I? What are the attributes that make me, me?
  • Personhood: What is required to be a person? What is the definition of person?
  • Persistence: What events can you survive? What brings your existence to an end?
  • Evidence: How do we find out who is who?
  • Population: What determines how many of us there are now?
  • What am I? What am I composed of?
  • How could I have been? Which of my properties do I have essentially, and which only accidentally? Could I have had different parents for example?

For example, take a business rule, copy it, rename it, update some of its properties and delete its history. Is it the same business rule as the original? If I now deploy the business rule from a development server to a cloned staging server how many business rules do I have? How about if I download the business rules from both the development and the staging servers into separate projects within an Eclipse workspace on my local computer? The point is that there can be fairly complex answers to some of these questions, particularly when you have multiple software systems interacting over space and time.

Metaphysical Questions

The metaphysical questions below are also very useful to consider as you design software systems:

  • What does it mean for an object to be the same as itself?
  • If x are y are identical (are the same thing), must they always be identical? Are they necessarily identical?
  • What does it mean for an object to be the same, if it changes over time? I.e. is x at time t the same as x at time t+1?
  • If an object’s parts are entirely replaced over time, in what way is it the same?

Best Practices?

Qualitative Identity in Java is expressed using the equals method as well as the compare method. Equals allows you to test for qualitative identity (which can include numerical identity) whereas the compare method is used to order a list of objects using a comparison predicate.

Determine your I-predicates and in Java perhaps code them as Comparators. Does your domain model require several I-predicates? If yes, you will need something other than a single equals method. In one software system I designed we had a dedicated object comparison service that could compare different types of objects using different criteria, based on the client as well as the objects. For example, you might compare an Integer with a Float (with or without rounding), two Doubles (with precision), or two EJBObject instances. Note that most equals methods also test that the classes for the two instances are identical. The JVM determines that two classes are identical if they have the same fully qualified class name and were loaded using the same ClassLoader.

Think in terms of namespaces. Many identity schemes rely on namespaces, however namespaces must be rooted and managed to prevent copying or cloning corrupting the namespace. Internet domain names are a popular basis for namespaces precisely because they are globally managed and controlled. E.g. Java, XML Schema and the Semantic Web all use variations of Internet domain name namespace identifiers.

Decide whether you need numerical identity. How will you determine numerical identity? Object references within a JVM? Generated statistically unique identifiers such as UUIDs? Automatically generated database row IDs? What about object serialization? Object cloning? Database replication?

For the reasons above numerical identity is very difficult to apply in computer systems. Numerical identity is often used as an optimization however where the scope of the optimization is well understood, such as within a single JVM/ClassLoader or within a single database table. It is usually hidden from end users because it is machine generated and has no inherent business sense. End users typically find opaque machine-generated identifiers difficult to work with, as they cannot understand why two artifacts that appear to be superficially qualitatively identical are numerically different. Even a UUID which should be universally unique is problematic because it is trivial to create exact clones of objects in computer systems, rendering the uniqueness property useless.

A common problem scenario is deciding that you are going to put a UUID in a document and identify documents by UUID. The end user then copies the document on the file system and ends up with two artifacts that have the same UUID but different file names. When the documents are loaded into your system one of four things can happen:

  1. The documents are both stored but they are retrieved non-deterministically. Your user interface makes it impossible for the user to understand which document they are editing (very bad!)
  2. The documents are both stored but the first or last document loaded is always returned. Your user interface makes it impossible for the user to understand which document they are editing (bad!)
  3. An error is detected as a duplicate UUID was loaded and the end user must intervene to fix the document they did not realize was broken — because your identifier is opaque (bad!)
  4. The second document silently overwrites the first, typically because they are being stored in a Map using the UUID as a key (very bad!)

The scenario above happened because the application developer’s identifier conflicts with the underlying storage mechanism’s identifier — typically a fully-qualified (case sensitive?) name for a file system. The file system will happily create exact clones of the developer’s supposedly numerically unique resources. Identity criteria mismatch scenarios are a very common source of identity related bugs – particularly across space and time, as in distributed systems.

Conclusion

I hope this entry has helped you understand how those pesky identity bugs keep cropping up in your products and code. I Am Not a Philosopher (IANAP!) so if I have piqued your interested I encourage you to look at some of the references below for far more detail.

What identity related bug did you workaround or fix this week?

References

Eclipse and VisualStudio in 2010

Saturday, March 22nd, 2008

This talk attempted to zoom out and present some of the impending challenges for IDE design — particularly around GUI. The current MDI stype IDE interface has remained essentially unchanged since inception, while using 2+ large monitors has become increasingly common. Many people now develop while on the road (train tracks in my case!) using a laptop, where screen space is much more constrained. Input devices are also changing, with support for multitouch and gestures already in mainstream use. Developers are also building larger systems and require more focused and efficient filtering of information. Multiple CPUs allow the IDE to be more proactive in offering developer assistance, without disrupting the developer’s chain of thought with modal operations. Apparently developers are also spending far more time exploring and reading code, rather than “just” editing source files.

Babel

Saturday, March 22nd, 2008

I had a very enjoyable couple of drinks with Gabe from the Eclipse Foundation. Amongst many other things Gabe has been implementing the localization server for the Babel project. The server allows any Eclipse users to log in and supply translations for localized strings. These strings are then built and can be downloaded as a language pack. I took the opportunity to pick Gabe’s brain to understand how hard it would be to install a Babel server within ILOG to help us manage the localization for Rule Studio — currently a considerable challenge as we support English, French, German, Spanish, Korean, Japanese and Chinese.

Ganymede Packaging

Saturday, March 22nd, 2008

The talk on the packaging efforts for Eclipse 3.4 (Ganymede) was interesting in that it described the process the Eclipse Foundation uses to create the master Update Site from the 30+ individual project Update Sites that compose the Ganymede release. Buckminster is used to resolve project dependencies while some custom scripts are capable of creating the master Update Site. The source code is (of course) Open Source — so I will have to take a look to see if there is something we can use to improve our internal build processes.

What is new in the Eclipse 3.4 JDT

Saturday, March 22nd, 2008

This short talk showed off the enhancements to JDT coming in Eclipse 3.4: code complete for classes that have not been imported, and automatic addition of casts after “instanceof” tests were my favourites. The breadcrumb navigation bar also looked useful, allowing you to navigate to classes from within the Java source editor.

Eclipse 4.0

Saturday, March 22nd, 2008

The talk on Eclipse 4.0 (e4) made it clear from the outset that discussions were just starting. Details were sketchy, but one theme that emerged was web-enabling the Eclipse platform. There was talk of being able to implement plug-ins using Javascript, supporting CSS styling rules for all user interface elements and even supporting server-side deployment of the platform runtime to enable web applications. This ambitious effort will provide an Flash/HTML/AJAX port of SWT, allowing graphical Eclipse applications to run within a web-browser.

The presenters stressed however that they were still in the brainstorming phase and aim to produce more concrete plans and demos for EclipseCon 2009. They also tried to dispell any compatability fears by saying that Eclipse 3.x would be maintained and enhanced for many years to come.

Cloudsmith

Saturday, March 22nd, 2008

Stefan Daume, a fellow University of Edinburgh AI graduate, was kind enough to give me a demo of the Cloudsmith software distribution solution. Cloudsmith is currently in beta and allows you to define custom software distributions, assembled from components published in Maven repositories or from Eclipse plug-ins. The very slick Cloudsmith GUI makes managing distributions easy, while the powerful runtime, based on the Eclipse Buckminster project, performs all the heavy lifting to ensure component dependencies are resolved. Distributions can be easily “materialized” into an Eclipse workspace. I’m definintely going to take a second look at Cloudsmith — perhaps it can help to manage the 80+ plug-ins that compose a Rule Studio distribution?

Services vs. Extensions

Saturday, March 22nd, 2008

This panel discussion delved into the use cases for Services vs. Extensions. The most memorable analogy was that Extensions are like the relationship between a parent and their children, while Services are a like a peer-to-peer relationship between consenting adults. All the panelists agreed that both serve a valuable purpose, however there is some technical work to be done to ensure that the Extension lifecycle is as rich as the Service lifecycle and that the programming model for Services is as simple as the programming model for Extensions.

A Love Supreme?

Wednesday, March 19th, 2008

Yesterday afternoon I sat in on two sessions on IBM Jazz. This incredibly ambitious project aims to break through the functional silos between the tools commonly used by developers: source code control, bug tracking, automated build, agile planning and development process support. It also exposes all these tools through Eclipse with an integrated user interface. At the core of this approach is a unified data model and central repository that stores all development artifacts, from source code, definitions of teams, projects, milestones, executable development process definitions etc. There was some talk of connectors to external systems (such as bi-directional synchronization for SVN) but my impression was that you would not get the same unified Jazz experience in a heterogeneous environment.

The process section of the demo was impressive — allowing a project leader to define the artifacts, roles, permissions and allowable state transitions for development artifacts. For example, during the demo the development process prevented a developer from submitting code without referencing a bug report during a stabilization iteration, submitting code with unused package imports or code that contained strings that were not localizable (externalized). Process definitions are defined in XML documents and are hierarchical, allowing a sub-team to specialize its development process while still adhering to the process of its parent team. Jazz supplies a couple of process definition templates, including one that models the “Eclipse Way” process used by the Eclipse Foundation as well as the Scrum development process.

Of course — the $1M question, how much will it all cost? Jazz is not open source and given the functional breadth of the offering you’ve got to imagine it is going to be pricey. Can it displace Rational ClearCase et al for large IBM shops — almost certainly. Can it displace SVN+Jira/Bugzilla+XPlanner/Rally for cost sensitive companies? Unlikely.

AMD CodeSleuth

Tuesday, March 18th, 2008

I saw a cool demo of a new (Open Source) profiler from AMD this morning. The profiler uses the hardware performance counters in the CPU (it supports both AMD and Intel chips) and profiles at the machine instruction set level. The CodeSleuth Eclipse plug-in can relate the machine instructions back to Java source code statements. If you want a deep view into the instructions generated by the JVM’s JIT, and how they relate to your Java application this could be very useful. They also claim that the overhead is very low because the profiler does not need to perform byecode injection to implement counters for example — instead using counters built into the CPU cores.