7.1 String Types

Table of Contents

The commonly used string data types are CHAR for fixed-length and VARCHAR for variable-length strings. These keywords are basically abbreviations for CHARACTER and CHARACTER VARYING (or CHAR VARYING). VARCHARACTER as the last theoretical keyword is not in the SQL standard. You rarely see people use any of the longer forms. There is a special kind of string that uses the NATIONAL keyword, which is explained at the end of this section.

7.1.1 Lengths

All string types have a length or size parameter in the form VARCHAR(n). For fixed-length string types, this is the exact amount of bytes or characters that will be occupied by each string, for variable-length string types, the length means the maximum allowed length. An attempt to store a string that exceeds this length will result in an error. For fixed-length strings, if excess characters are all spaces, the string will be truncated to the length and stored (this rather obscure behavior is required by the SQL standard). If the string to be stored is shorter than the declared length, values of fixed-length string types will be space-padded, while values of variable-length string types will simply store the string without overhead.

String types should always be specified with a length argument, like VARCHAR(50). If you do not specify a maximum length, the DBMS will either fall back to its default or generate an error. VARCHAR and CHAR used to have their upper limits at 255, but that has changed several years ago. Don't try to be smart and specify VARCHAR(255) or VARCHAR(250). It's obsolete.

The following table lists the maximum CHAR and VARCHAR string lengths plus the possible and default length types per DBMS as of mid 2009 (please also check your DBMS's manual, as they constantly evolve):

Maximum and Default String Lengths per DBMS
DBMSMaximum LengthUnit
ISO/ANSI SQL?Characters (default)
PostgreSQL1GB charactersCharacters (default) and bytes
MySQL65535 characters (as of MySQL 5.0.3)Characters
Oracle4000 bytesBytes (default) or characters
SQL Server8000 bytesBytes (default) or characters

You should always strive to find the optimum number of characters based on the requirements. For example, if you want to store email addresses or URLs, use the maximum defined by their standards (RFC). It's usually harder to determine appropriate lengths for attributes that are not backed by a standard, like names, streets etc. Always using VARCHAR(255) is the wrong approach and likely an indicator for an incompetent database designer.

For general-purpose, medium-sized strings, e.g. short descriptions, a maximum of 100, 200, or 500 is probably okay. If you need slightly larger strings, e.g. for small articles, a good portable maxiumum is at 2000 characters. This seemingly arbitrary value was effectively derived from Oracle SQL's limit of 4000 bytes, divided by two for the average Unicode character size. (I assumed a divider of two to be a good compromise from my naive western world view.)

If you just need an approximate rule of thumb (like myself), the chosen divider of two seems to be okay for most international texts (including a sufficient amount of tolerance). To be on the absolutely safest side, you should divide 4000 bytes by three or even four and take that as maximum (some multi-byte characters need that much storage). Unless you don't use a lot of fancy characters, 2000 sounds like a good overall compromise.

For very large strings, you can choose the standard SQL large object types (if available), which allow up to several gigabytes of storage. Alternatively, your DBMS might be allowed you to specify very higher maximum lengths. Some DBMSs actually allow pretty large numbers going into the gigabytes, like PostgreSQL. The character large object types have an extended length syntax, like CLOB(n[K|M|G]), with K meaning kilo (1024), M meaning mega (1048576), and G meaning giga (1073741824). The keywords T for tera and P for peta were added in SQL:2008. The length n is multiplied by the respective values in parentheses.

The SQL standard states the following about character large objects:

There are some restrictions associated with using CLOBs. The only comparison supported for CLOB values is using the NULL predicate.

A CLOB column may not be part of any primary key, index, unique constraint or primary key constraint.

The comparison restrictions also prevent CLOB columns from being used in DISTINCT, GROUP BY and ORDER BY clauses and UNION statements.

The same restrictions apply for binary large objects, too.

Try to avoid types like TINYCLOB, MEDIUMCLOB, and LONGCLOB as well as TEXT, TINYTEXT, MEDIUMTEXT, and LONGTEXT. They're all DBMS-specific and not portable. If no character large object type is available, TEXT appears to be the most common non-standard large string type among the DBMSs I've examined.

7.1.2 National String Types

The SQL standard defines additional string types using the NATIONAL keyword. The valid keyword combinations to specify fixed-length strings are NATIONAL CHARACTER, NATIONAL CHAR7, and NCHAR. For variable-length strings the keyword combinations are NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING, and NATIONAL VARCHAR. NVARCHAR isn't in the SQL standard, but many DBMSs offer it nontheless (which is correct). For very large strings, you can use the types NATIONAL CHARACTER LARGE OBJECT, NATIONAL CHAR LARGE OBJECT, NCHAR LARGE OBJECT, NATIONAL CLOB, or NCLOB. PostgreSQL, Oracle, and SQL Server support national character types, but not MySQL. Note, that not every keyword combination might be available on your DBMS of choice. You have to check your DBMSs manual.

The SQL standard states the following about national character strings:

A character string may be specified as being based on a specific character set by specifying CHARACTER SET in the data type; a particular character set chosen by the implementation to be the national character set may be specified by specifying NATIONAL CHARACTER, NATIONAL CHARACTER VARYING, or NATIONAL CHARACTER LARGE OBJECT (or one of several syntactic equivalents) as the data type.

This Oracle source states:

The alternative Unicode data types allow you to store Unicode characters in columns of NCHAR SQL data types (NCHAR, NVARCHAR2, NCLOB), irrespective of the database character set. The NCHAR SQL data type has been redefined in Oracle9i to be a Unicode data type exclusively. In other words, it stores data in the Unicode encoding only, and can be used with a non-Unicode database. You can use the NCHAR SQL data types in the same way you use the CHAR SQL data types.

So, what does the NATIONAL keyword provide, knowing that you can usually set the character set for the whole database, a table, or each string column? The "national" idea used to confuse me, because "nationality" suggests that a DBMS has to know a "current nation" in some way. Indeed, handling it is left to database implementations. As an example, Oracle XE has the following policy: The national character set, which is used for all NCHAR datatypes, is defined when the database is created. The national character set can be either UTF8 or AL16UTF16. The default is AL16UTF16. I'm pretty sure the same holds true for the "regular" Oracle DBMS, too.

According to the previous quote, the current national character set isn't equal to the database's character set. The default character set on Oracle doesn't seem to be Unicode: Your Oracle 9i db has been created with the default character set, so all fields of varchar2 type are not in Unicode. (The national character set is AL16UTF16 - full Unicode). As a consequence, there is no need to use any of the national string types if your database's character set is the same as the current national character set. You might want to look into setting the database's character set instead of using any of the national string types. After all, most databases tend to become international anyway. (That said, I wonder why the keyword became "NATIONAL" and not "INTERNATIONAL".)

You can use the NATIONAL string types when you have any of the following requirements (which were taken from this Oracle source):

  • To incrementally add multilingual support.
  • For performance concerns, because fixed-width and multibyte character sets perform more efficiently.
  • For packaged applications, that need to run on several different customer databases.
  • Better support for UTF-16 with Windows clients to match with Visual C/C++ wchar_t and string data types.

Note, the described principles are applicable to other DBMSs, too. You have to check your DBMS's manual for details on how and when the database's default and national character sets are determined.

7.1.3 Storage Requirements

The actual storage amount for strings largely depends on the character set that is used. The SQL standard assumes characters, not bytes, but some DBMSs, like MySQL default to bytes. Some DBMSs, like Oracle, allow you to specify bytes or characters explicitly, e.g. by using VARCHAR(20 BYTE) or CHAR(4 CHAR). This syntax is not standard SQL however. On PostgreSQL, the string length type (characters or bytes) is determined by the database character set. In practice, determining the exact storage requirements for strings isn't trivial. It would be best to gather representative string data up front and measure.

7.1.4 Oracle VARCHAR2

Ever since I've been using Oracle, it was recommended to use VARCHAR2 (and NVARCHAR2) instead of VARCHAR (and NVARCHAR). Both types used to have different implementations, limiting VARCHAR2 to 4000 bytes and VARCHAR to 2000 bytes. This was changed long ago (with Oracle 8). You sometimes read, that VARCHAR2 uses Unicode by default, whereas VARCHAR uses the ASCII character set. This is not true. It's either old or incorrect information.

VARCHAR has always just been an alias for VARCHAR2. If you ask me, there is no reason to use VARCHAR2 anymore - even though Oracle still recommends the latter. Officially, VARCHAR is still reserved for future versions and its exact behavior of may change without notice "someday". Because that statement is around since 1989(!), I doubt Oracle will ever change the behavior of VARCHAR. It's pretty safe to stick with that instead of VARCHAR2. If Oracle should really change VARCHAR someday, you will probably be willing to make the right adjustments.

References for String Types:

Last updated: 2010-10-13