Tuesday, July 30, 2013

Understanding char, nchar, varchar and nvarchar

What they store?

  • nchar and nvarchar can store Unicode characters.
  • char and varchar cannot store Unicode characters.
  • char and nchar are fixed-length which will reserve storage space for number of characters you specify even if you don't use up all that space.
  • varchar and nvarchar are variable-length which will only use up spaces for the characters you store. It will not reserve storage like char or nchar.
  • nchar and nvarchar will take up twice as much storage space, so it may be wise to use them only if you need Unicode support.

Sizes they need

  • char    =  fixed-length character data with a maximum length of 8000 characters.
  • nchar  =  fixed-length unicode data with a maximum length of 4000 characters.
  • Char   =  8 bit length
  • NChar =  16 bit length

Summary

First, char and nchar will always use a fixed amount of storage space, even when the string to be stored is smaller than the available space, whereas varchar and nvarchar will use only as much storage space as is needed to store that string (plus two bytes of overhead, presumably to store the string length). So remember, "var" means "variable", as in variable space.

The second major point to understand is that, nchar and nvarchar store strings using exactly two bytes per character, whereas char and varchar use an encoding determined by the collation code page, which will usually be exactly one byte per character (though there are exceptions, see below). By using two bytes per character, a very wide range of characters can be stored, so the basic thing to remember here is that nchar and nvarchar tend to be a much better choice when you want internationalization support, which you probably do.

Now for some some finer points.

First, nchar and nvarchar columns always store data using UCS-2. This means that exactly two bytes per character will be used, and any Unicode character in the Basic Multilingual Plane (BMP) can be stored by an nchar or nvarchar field. However, it is not the case that any Unicode character can be stored. For example, according to Wikipedia, the code points for Egyptian hieroglyphs fall outside of the BMP. There are, therefore, Unicode strings that can be represented in UTF-8 and other true Unicode encodings that cannot be stored in a SQL Server nchar or nvarchar field, and strings written in Egyptian hieroglyphs would be among them. Fortunately your users probably don't write in that script, but it's something to keep in mind!

Another confusing but interesting point that other posters have highlighted is that char and varchar fields may use two bytes per character for certain characters if the collation code page requires it. (Martin Smith gives an excellent example in which he shows how Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS exhibits this behavior. Check it out.)


UPDATE: As of SQL Server 2012, there are finally code pages for UTF-16, for example Latin1_General_100_CI_AS_SC, which can truly cover the entire Unicode range.