Coder Perfect

[closed] SQL database architecture recommendation for tags or tagging

Problem

I’ve heard of a few approaches to tagging: utilizing a mapping table between TagID and ItemID (which makes sense to me, but does it scale? ), adding a fixed amount of TagID columns to ItemID (which seems like a horrible idea), and keeping tags in a comma separated text column (sounds crazy but could work). I’ve even seen a sparse matrix recommended, but how can the tag names expand gracefully then?

Is there a best practise for tags that I’m overlooking?

Asked by dlamblin

Solution #1

Three tables (one for all things, one for all tags, and one for the relationship between the two), properly indexed, with foreign keys specified, and operating on a decent database, should function and scale well.

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID

Answered by Yaakov Ellis

Solution #2

Normally, I would agree with Yaakov Ellis, but there is another feasible option in this case:

Use two tables:

Table: Item
Columns: ItemID, Title, Content
Indexes: ItemID

Table: Tag
Columns: ItemID, Title
Indexes: ItemId, Title

This offers a number of significant advantages:

To begin with, it simplifies development: in the three-table method, when inserting or updating an item, you must first check the Tag table to determine if there are any existing entries. After that, you must replace them with new ones. This is not a simple task.

It then simplifies queries (and perhaps faster). There will be three major database queries that you will execute: Create a Tag-Cloud and choose all items for one Tag Title by printing all Tags for one item.

Tags for a single item:

3-Table:

SELECT Tag.Title 
  FROM Tag 
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 WHERE ItemTag.ItemID = :id

2-Table:

SELECT Tag.Title
FROM Tag
WHERE Tag.ItemID = :id

Tag-Cloud:

3-Table:

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

2-Table:

SELECT Tag.Title, count(*)
  FROM Tag
 GROUP BY Tag.Title

Tags for one item:

3-Table:

SELECT Item.*
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

2-Table:

SELECT Item.*
  FROM Item
  JOIN Tag ON Item.ItemID = Tag.ItemID
 WHERE Tag.Title = :title

However, there are certain disadvantages: It may take up more space in the database (leading to slower disk operations), and it isn’t normalized, which could lead to inconsistencies.

The size argument is weak because tags are generally quite small, therefore the size increase is not significant. One could argue that querying for the tag title in a tiny table with only one tag is much faster, and this is certainly true. However, taking into account the savings from not having to join and the fact that you can create a strong index on them, this might easily be offset. This, of course, is very dependent on the size of the database you’re working with.

The point about inconsistency is also debatable. Tags are free text fields, therefore operations like “rename all tags “foo” to “bar” aren’t expected.

So, in a nutshell, I’d go with the two-table solution. (In fact, I intend to do so.) I came across this article to see whether there were any valid counter-arguments.)

Answered by Scheintod

Solution #3

If you’re using a database that supports map-reduce, such as couchdb, the ideal approach to store tags is in a plain text or list column. Example:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

When you run this with group=true, the results will be grouped by tag name, and you’ll even get a tally of how many times that tag was found. It’s akin to counting the number of times a word appears in a manuscript.

Answered by Nick Retallack

Solution #4

Use a sophisticated full text search engine to index the tags, which should be stored in a single structured text column[1]. If you don’t, you’ll run into scaling issues when implementing boolean queries.

If you require more information about your tags, you can either keep track of them in an incrementally maintained table or execute a batch process to retrieve the data.

[1] Some RDBMS even include a native array type, which may be better for storage because it doesn’t require parsing, but may cause issues with full text searches.

Answered by David Schmitt

Solution #5

I’ve always kept the tags in their own database, followed by a mapping table. Of course, I’ve never attempted anything on such a grand scale.

Because you can easily put together SQL to get a list of tags with counts of how often each tag is used, having a “tags” table and a map table makes it relatively simple to build tag clouds and other things.

Answered by Mark Biek

Post is based on https://stackoverflow.com/questions/20856/recommended-sql-database-design-for-tags-or-tagging