Posts Tagged ‘relational database’

Null

One of my favorite videos is Null Island from Minute Earth. I frequently link to it when I get into a discussion about whether null is an acceptable value for a certain use case.

What I really like is the definition around null and the focus of null having a concrete definition, that is: “we don’t know”. When used in this way, we have a clear understanding of what null is and the context in which it’s used (whether some field in a relational table, JSON object, value object, etc.) inherits this definition (i.e. “it’s either this value or we don’t know”), yielding something that’s fairly easy to reason about.

When nulls are ill-defined or have multiple definitions, complexity and confusion grow. Null is not:

  • Zero
  • Empty set
  • Empty string
  • Invalid value
  • A flag value for an error

Equating null to any of the above means that if you come across a null, you need to dig deeper into your code or database to figure out what that null actually means.

The flip side of this is avoiding nulls altogether, and there are really 2 cases here:

  • There is no need for null (i.e. we do know what the value is, in every use case)
  • Architect the system such that a null isn’t surfaced

In the first case, null doesn’t fit the use case, so there’s no need for it. When possible, this is ideal, and you avoid the necessity for null checks.

For the second case, architecting this way always seems to involve adding more complexity, to the point where it’s questionable if there’s a net benefit.

Pivot tables

One of the more interesting things I’ve learned about recently, that’s proven itself incredibly useful, is the pivot table. A pivot table turns rows into columns. This may seem odd, but the utility of this transformation becomes apparent when you have data that can’t be modeled precisely by a set of attributes because there are attributes which apply to some pieces of data and not others. The typical solution to this problem is simply to have additional columns, allowing for NULL values or defining an appropriate default value (e.g. empty string). However, with a large number of attributes or with user-defined attributes, this solution becomes unattractive, and constructing a pivot table is preferable.

Here I’ll present an example of constructing a pivot table for a schema in which a number of optional attributes, stored as key-value pairs, are attributed to entities. Here’s an ER diagram of the simple schema used in this example:

Pivot Table example schema

  • entities holds a list of entities (companies, users, etc.), with name being the only attribute required for all entities.
  • keyvals hold a list of key-value pairs to associate with entities.
  • entity_attributes maps keyvals to entities.

An SQL query to grab all entities and all associated keyvals would look something like this:

SELECT *
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1;

… this would return every entity LEFT JOINed with any associated keyval:

No pivot

We can turn each entry in the key column (keyvals.key) into its own column, with rows having the corresponding entry from the value column (keyvals.value), using a simple conditional statement and alias as shown below:

SELECT entities.*,
IF(kv.key='location', kv.value, '') AS location,
    
IF(kv.key='fax_num', kv.value, '') AS fax_num
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1;

No pivot

The keys, location and fax_num, are now represented as columns. For entities with 2+ associated keyvals, we still have multiple rows for each entity, but the data is such that each row only holds a single value entry (keyvals.value) for a single key (keyvals.key), under the respective column. To get a single row per entity, we GROUP BY the entity and take the MAX of each key column.

SELECT entities.*,
MAX(IF(kv.key='location', kv.value, '')) AS location,
    
MAX(IF(kv.key='fax_num', kv.value, '')) AS fax_num
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1
GROUP BY entities.id;

pivoted

The result is a pivot table.

Batching, a basis for optimization

It’s interesting that in 3 distinct domains I’ve run across the same underlying basis for optimization:

  • Graphics: Modern GPUs depend heavily on batching primitives, typically triangles. Instead of rendering triangles individually, you get much better performance by batching primitives together in a list, sending it to the GPU via a single call, then letting the GPU pipelines to do their thing. Even before modern GPUs existed, graphics cards supported techniques like BitBlt which, essentially, performed operations on batched blocks of pixels, to take advantage of the embarrassingly parallel nature of computer graphics.
  • Relational Databases: Issuing lots of small queries can kill performance. A better strategy is, usually, to issue fewer queries, joining and returning as much data as possible with each query. Even if these queries becomes complex and costly, the cost of a complex query will usually still be less than the aggregate cost of numerous simpler queries.
  • Networking: The speed of light sucks… server and packet switching latencies make things worse. I usually assume ~50ms baseline latency to send a request packet + get a reply packet back from an internet server (I use the term “packet” loosely, referring to programmer-defined, application-level “packets” or messages, or whatever you like to call them, not necessarily TCP/IP packets). Note that this baseline is regardless of the amount of information in a packet and is bound by the travel time between server and client. So, to optimize communication and bandwidth, a good strategy is to transfer as much as possible per-packet instead of depending upon numerous requests/responses to/from a server, which would mean lots of packets and lots of wasted time.

hotspotdot

hotspotdot is my entry into Microsoft’s My App is Better Challenge. In a nutshell, hotspotdot allows you to locate and tag “hot spots” that are important to you, allowing you to create a personal database of hot spots.

It’s a very simple app, utilizing PHP, SQL Server 2008, jQuery, jQuery UI, and the Google Maps API. I wrapped thing up in under 2 weeks, but in order to make the August 25th deadline there were a few features I didn’t get a chance to put in, such as preloaders, icons on the map, and searching by tag. I’ll probably devote some time to these in the coming weeks.

hotspotdot login

The goal of the contest was to create something showing off the power of SQL Server and Microsoft’s new PHP for SQL Server 2.0 drivers. I used the new PDO driver which was pretty sweet – very simple, elegant API. Two things in particular that impressed me were transactions (very nice for doing multiple inserts or deletions) and a consistent and well designed exception model (makes error handling much easier and especially powerful when combined with transactions as failed queries don’t effect the database [code jumps to exception handler before commit() is called], so no chance of junk being inserted). Error messages, in general, were also much more descriptive than those I’ve encountered with MySQL.

Finding a server proved difficult. I got a shared hosting solution, but the host was unable (though I suspect unwilling) to install the PHP for SQL drivers. So I ended up getting a virtual private server, but this only came with SQL Server Express and I used some features (see below) that prevented a migration. In order to avoid purchasing an SQL Server 2008 license (way out of my budget), my final solution was use the shared hosting server for the DB and the VPS for everything else. Hence the reason for the site being located at the my. subdomain, which maps to the IP of the VPS.

SQL Server 2008 is a fine system (despite some annoyances with the management studio)… though it’s really just a solid database system, which isn’t bad, but there’s nothing really impressive or creative about that. It’s not really leaps and bounds above a cheap solution like MySQL. The contest was about SQL Server, so I tried to do something that utilized a fairly unique aspect of the system: the geography data type. I wasn’t too impressed. It simply holds a (longitude, latitude) pair in a certain format, nothing more. The one big advantage of having a vector type like this would be doing comparisons based on distance but, as far as I could tell, this isn’t supported (my queries failed). Worse yet, in SQL Server Management Studio the display of the geography type is in hexadecimal, making things very cryptic… I couldn’t help but wonder why this is any better than using 2 columns and storing the longitude and latitude values independently. That said, the idea of richer/more-complex data types within a relational database is a pretty cool idea and it would interesting to see it taken further and beyond its current, primitive state.

hotspotdot map

hotspotdot map

Leave a comment if you find a bug or have any questions, comments, etc.