Posts Tagged ‘mysql’

A more relational dictionary

As I started looking to add more functionality to Lexiio, I realized the Wiktionary definitions database dump I was using wasn’t going to cut it; specifically, I needed a normalized schema, or I’d have data duplication all over the place. I started normalizing in MySQL, but whether it was MySQL or MySQL Workbench, I kept running into character encoding issues. Using a simple INSERT-SELECT, in MySQL 5.7, to transfer words from the existing table to a new table resulted losing characters:

MySQL losing characters

I dumped the data into PostgreSQL, didn’t encounter the issue, and just kept working from there.

The normalized schema can be downloaded here: LexiioDB normalized
(released under the Creative Commons Attribution-ShareAlike License)

LexiioDB schema

The unknown_words and unknown_to_similar_words tables is specific to Lexiio and serve as a place to store unknown words entered by the user and close/similar matches to known words (via the Levenshtein distance).

ISO 8601 and MySQL

I was curious what to name a variable holding a string with a date formatted for MySQL. Something like “startDateMysql” seemed silly, so I went digging as to what MySQL’s date format was based on.

The YYYY-MM-DD format used by MySQL is defined by ISO 8601.

The HH:MM:SS format is also defined by ISO 8601, but MySQL never adds the time zone designator, a ‘Z’ indicator for UTC or an offset from UTC. Without the time zone designator, the time is implied to be local time according to the standard, but MySQL returns non-local time without a designator, which is non-conformant.

When combining a date and time, MySQL omits the ‘T’ delimiter, which is allowed and permitted by “mutual agreement.”

Numbering items within groups in MySQL

In my previous post, I described computing row numbers for items within a result set. If items in the result set can be ordered into groups, you can use the same technique to number items within the groups. This is valuable as it allows you to get the top 3, top 5, top N … items within each group (note that GROUP BY isn’t appropriate here; we don’t want to aggregate items within a group, we want an ordering of all the items within a group).

Consider the following sample of data where we’ve done an ordering by the location and age fields (ORDER BY location DESC, age DESC). Every user’s email address is within a group (location) and within that group users are sorted by age in descending order.

SELECT mytable.*
FROM mytable
ORDER BY location DESC, age DESC

ordered data

By using two variables, one to keep track of the item number (@itm) and another to keep track of the current group (@grp), and incrementing the item number if the current group has not changed, else resetting it to 1, we get a numbering of each item within each group. In this example, the group item number tells us who is the 1st oldest, 2nd oldest, etc. user in each location.

SELECT

    
@itm := IF(@grp = location, @itm + 1, 1) AS group_item_number,
    
@grp := location,

    
mytable.*

FROM     mytable,
        (
SELECT @itm := 0) AS itm_num_init,
        (
SELECT @grp := NULL) AS grp_init

ORDER BY location DESC, age DESC

group item numbers

Row numbers in MySQL

Surprisingly straight-forward. Row numbers are useful is cases where you need the rank of a row within a result set.

SET @cur_row := 0;
SELECT mytable.*,
@cur_row := @cur_row + 1 AS row_num
FROM mytable
ORDER BY whatever;

The variable initialization can be done without a SET declaration by doing it as a JOIN:

SELECT mytable.*,
@cur_row := @cur_row + 1 AS row_num
FROM mytable, (SELECT @cur_row := 0) AS row_num_init
ORDER BY whatever;

Solution via Daniel Vassallo’s answer on StackOverflow. A comment mentioned the row numbers being calculated before ordering is done, but I haven’t found this to be the case.

While the above queries will give you the row numbers as expected, putting a HAVING or GROUP BY clause in the same statement might be dangerous. From Xaprb:

The result of assigning to a variable and using it in the same statement (in the HAVING, for example) depends on the query plan the server chooses, the phase of the moon, and probably other things too.

So, in such cases, a subquery is the safe bet:

SELECT sub.* FROM

    
(SELECT mytable.*,
            
@cur_row := @cur_row + 1 AS row_num
    
FROM mytable, (SELECT @cur_row := 0) AS row_num_init
    
ORDER BY whatever) AS sub

HAVING sub.row_num ... ;

Generating a random timestamp in MySQL

Random timestamps are fairly simple to generate and useful when creating test data or fake data for demos. The technique involves converting a MySQL timestamp into unix time, adding or subtracting a certain time range (in seconds), then converting back to a MySQL timestamp.

I’ve used this technique presented on underdog-projects.net before, where a starting timestamp is specified and a random time span is added to the timestamp. However, I wanted to try something slightly different and, in my opinion, a bit more robust. Instead of specifying a starting timestamp, I wanted a pivot timestamp from which a random time span is randomly added or subtracted.

SET @pivot_ts = '2013-01-15 22:15:45';
SET @max_span = 432000; /* 5 days in seconds */
SET @bias = SIGN(-0.5 + RAND());

SELECT FROM_UNIXTIME(
    
UNIX_TIMESTAMP(@pivot_ts) + ( @bias * (FLOOR(RAND()*@max_span)) )
);

Of course all the variables can be made part of the SELECT statement to make everything more succient.

Pivot tables

One of the more interesting things I’ve learned about recently, that’s proven itself incredibly useful, is the pivot table. A pivot table turns rows into columns. This may seem odd, but the utility of this transformation becomes apparent when you have data that can’t be modeled precisely by a set of attributes because there are attributes which apply to some pieces of data and not others. The typical solution to this problem is simply to have additional columns, allowing for NULL values or defining an appropriate default value (e.g. empty string). However, with a large number of attributes or with user-defined attributes, this solution becomes unattractive, and constructing a pivot table is preferable.

Here I’ll present an example of constructing a pivot table for a schema in which a number of optional attributes, stored as key-value pairs, are attributed to entities. Here’s an ER diagram of the simple schema used in this example:

Pivot Table example schema

  • entities holds a list of entities (companies, users, etc.), with name being the only attribute required for all entities.
  • keyvals hold a list of key-value pairs to associate with entities.
  • entity_attributes maps keyvals to entities.

An SQL query to grab all entities and all associated keyvals would look something like this:

SELECT *
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1;

… this would return every entity LEFT JOINed with any associated keyval:

No pivot

We can turn each entry in the key column (keyvals.key) into its own column, with rows having the corresponding entry from the value column (keyvals.value), using a simple conditional statement and alias as shown below:

SELECT entities.*,
IF(kv.key='location', kv.value, '') AS location,
    
IF(kv.key='fax_num', kv.value, '') AS fax_num
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1;

No pivot

The keys, location and fax_num, are now represented as columns. For entities with 2+ associated keyvals, we still have multiple rows for each entity, but the data is such that each row only holds a single value entry (keyvals.value) for a single key (keyvals.key), under the respective column. To get a single row per entity, we GROUP BY the entity and take the MAX of each key column.

SELECT entities.*,
MAX(IF(kv.key='location', kv.value, '')) AS location,
    
MAX(IF(kv.key='fax_num', kv.value, '')) AS fax_num
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1
GROUP BY entities.id;

pivoted

The result is a pivot table.

Broken comments

h/t to Dave, the guy who contacted me and mentioned that he was unable to post comments on this blog; I didn’t realize the comments system was not functioning.

Now why the WordPress comment system would go down was a bit confusing and frustrating to diagnose. WP simply sent you to a blank page (wp-comments-post.php#comment-). After ruling out it was not an issue with the theme or a problem with a blank line in wp-comments-post.php (the first 2 cases mentioned here). I tried repairing the wp_comments table in the DB, which didn’t work (the post linked to mentions going further and wiping out the data and recreating the table, but I didn’t want to go that far). I ran across a few posts (such as this one) on the WP forums about similar issues, but nothing that led to a solution.

I decided to dig into wp-comments-post.php and started playing around with a few echo statements to see where things were going wrong. Turns out, wp_new_comment() was always returning a comment_id of 0. Not good. Assuming the WP code was correct (I had no reason to suspect otherwise), it had to be a problem with the database. The problem and solution turned out to be pretty trivial: the comment_ID field of the wp_comments table had the auto_increment attribute missing (which needs to be set; you can see the WP database scheme here). So all new comments were being assigned the default value of 0.

I’m not sure why this would happen; maybe an issue resulting from an upgrade my host did a while back or perhaps an issue with the WP auto-update process.