VBScript Characters

The two concerns of VBScript character are the character set used for representing all valid characters and character code for storing a character in script coding, computer manipulating and user interfacing.

VBScript Character Code

In general, Visual Basic uses Unicode to store and manipulate strings. Unicode is a character set using 21 bits to represent each character. However, some other programs, such as 16-bit object libraries, uses ANSI (American National Standards Institue) or DBCS (Double-Byte Character Set) to store and manipute strings. Therefore manipulating strings between different enviroments may encounter differences between Unicode and ANSI/DBCS. For example,

Environment	Character set(s) used
Visual Basic	Unicode
32-bit object libraries	Unicode
16-bit object libraries	ANSI and DBCS
Windows NT API	Unicode
Automation in Windows NT	Unicode
Windows 95/98 API	ANSI and DBCS
Automation in Windows 95/98	Unicode

Besides, the manipulateing of byte data of string by Visual Basic itself may also cause problems when dealing with bytewise operations. For example, Chr, ChrB, Asc, AscB, LeftB, MidB, RightB, and InStrB functions.

Character Set

Although the ANSI caharacter set can handle most Western European language with 256 characters using individual 8-bit character sets, some languages such as Chinese (Traditional and Simplified), Japanese, and Korean (Hangeul) require more than 256 characters. DBCS and Unicode becomes the most popular character sets used to represent text string.

ASCII

ASCII (American Standard Code for Information Interchange) uses 7 bits to represent each character in 128 characters, including control characters and printable character. ASCII character set is the most important character set because the ASCII is usually the common part included in other common character sets. In other words, the character code of ASCII characters is usually a subset of other common character sets.

Windows ANSI

Although ANSI stands for American National Standards Institute, ANSI is also a genetic term used by Microsoft Windows as a character code standard. ANSI standard uses only a single byte, 8 bits, to represent each character in 256 characters. Windows ANSI character code can therefore be divided into two parts, with the lower 128 part is identical to ASCII and the upper 128 is assigned to various international character sets. A code page number is used as the identifier of each individual character set for Windows. For example,

code page	1250	1251	1252	1253	1254	1255	1256	1257	1258
upper 128	ANSI Central European	ANSI Cyrillic	ANSI Latin 1 /Western European	ANSI Greek	ANSI Turkish	ANSI Hebrew	ANSI Arabic	ANSI Baltic	OEM Vietnamese
lower 128	ASCII	ASCII	ASCII	ASCII	ASCII	ASCII	ASCII	ASCII	ASCII

In other words, Window ANSI character set is a kind of SBCS (Single-Byte Character Sets)

DBCS

DBCS stands for Double Byte Character Sets. Although SBCS is adequate for English, languages in East Asian region typically require much more than 256 characters. A 2-byte value is a common solution to tackle this problem. Unlike SBCS, DBCS are actually multi-byte encodings, a mix of 8-bit and 16-bit characters. In general, the 8-bit character in DBCS character sets is restricted to the ASCII character set. In other words, only the lower 128 part of a SBCS is used (number 0-127). The upper 128 part of a SBCS is reserved and function as the lead-byte of a 16-bit character. For DBCS data stream used in PC, a 16-bit character is restricted to lead byte with upper 128 part of a single byle while the trial byte can be lower or upper part of a single byte. Similar to ANSI, a DBCS code page number is used as the identifier of each individual DBCS character set for different language set. However, each DBCS code page has a different predefined leadbyte and trailbyte range. For example

Microsoft Windows code pages	Range of lead byte of DBCS
932 - Japanese Shift-JIS	X'81'-X'9F',X'E0'-X'FC'
936 - Simplified Chinese GBK	X'8C'-X'FE'
949 - Korean	X'8F'-X'FE'
950 - Traditional Chinese Big5	X'81'-X'FE'

Besides the number of these ideographic characters, an ideographic character also cannot fit in the exist form. A form called full-width,, wide-width character is introduced to contain 2-byte characters, primarily ideographs. The form containing 1-byte characters is sometime called half-width character. According to these arrangement, a full-width character may have a half-width variant and a half-width character may also have a full-width variant. For example,

Character Description	Code	Character by Unicode
ANSI character "A"	0x41	AA, A
DBCS 932 Japanese Shift_JIS wide-width character A	0x8260	ＡＡ, Ａ
DBCS 936 Simplified Chinese GBKwide-width character A	0xA3C1	ＡＡ, Ａ
DBCS 949 Korean wide-width character A	0xA3C1	ＡＡ, Ａ
DBCS 950 Traditional Chinese Big5 wide-width character A	0xA2CF	ＡＡ, Ａ

Unicode

Although the basic multilingual plane of first version Unicode (1991-1995) is a 16-bit character-encoding scheme (U+0000-U+FFFF), the encoding space is changed to a 16-bit character-encoding scheme (U+0000-U+10FFFF) starting with Unicode 2.0 (July, 1996). Unicode uses all digits for every character trying to globally contain all commonly used characters of different language within all possible code points. Besides the basic plane, more planes are defined but not all possible code points are assigned. Unlike the using of code page identifier, Unicode is language-independence and code points are assigned by agreement such that all languages can share common characters without considering the artistic issue. In general, the Unicode codespace is divided into planes of 2-byte code points and each plane is subdivided into block according to assignment. For example,

Plane 0, Basic Multilingual Plane (BMP), U+0000-U+FFFF contains characters for almost all modern languages, and a large number of symbols
Plane 1, Supplementary Multilingual Plane (SMP), U+10000-U+1FFFF
Plane 2, Supplementary Ideographic Plane (SIP), U+20000-U+2FFFF
Plane 3-13, Unassigned, U+30000-U+DFFFF
Plane 14, Supplementary Special Purpose Plane (SSP), U+E0000-U+EFFFF
Plane 15-16, Supplementary Private Use Area Plane (SPUA-A/B), U+F0000-U+10FFFF

Because of the expansion of the codepoint space, encoded Unicode character set cannot be manipulated in a compute directly as other encoded character sets, for example, ASCII, DBCS. Unicode value can only be used as a unique code point for every assigned character such that each character can be refered to a code with simple digit configuration.

In order to avoid ambiguity in data processing, Unicode code point should be encoded. Some common encoding forms are UTF-8 of using one to four 8-bit bytes, UTF-16 of using one or two 16-bit code units, and UTF-32 of using one single 32-bit code unit.

In Unicode, the character at code point U+FEFF is defined as the byte order mark (BOM), while the byte-reversed counterpart, U+FFFE is a noncharacter (U+FFFE) in UTF-16 encoding form, or outside the code space (0xFFFE0000) for UTF-32. The BOM is used to specify the order of bytes in a code unit. A code unit with the most significant byte (MSB) first is called big-endian, while a code unit with the least significant byte (LSB) first is called little-endian. Therefore, if all Unicode code points of a data stream lies within the Basic Multilingual Plane, the data stream of Unicode code points can be stored together with the byte order mark at the beginning of the data stream without causing any ambiguity. However, whenever there is a Unicode code point of the data stream lies outside Basic Multilingual Plane with Unicode code point >U+FFFF, the data stream of of Unicode code points must be encoded to one of the encoding forms.

For Windows, Unicode has already been used by the Component Object Model (COM) on all 32-bit versions of Windows, used as the basis for OLE and ActiveX technologies, and fully supported by Windows NT.

UTF-8

In order to make the Unicode coded character set feasible for information representation and manipulation in a computer, the coded character set must be mapped to unambiguous form that can be recognized by the computing software. UTF-8 is one of the encoding forms that is commonly used with Unicode. The UTF-8 encoding form uses one to four 8-bit bytes to represent a Unicode code point according to some standard rules. Since the byte sequence arrangement can be determined from the byte, using BOM to identify the byte order of the code units is usually not necessary. The rules used to encoding Unicode code point to UTF-8 can be summarized as following

Unicode Code Value (Range of Unicode Code)	Byte Width of UTF-8	Form of UTF-8
00000000 0xxxxxxx (0x00-0x7F)	1 byte	0xxxxxxx
00000yyy yyxxxxxx (0x80-0x7FF)	2 bytes	110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx (0x800-0xD7FF, 0xE000-0xFFFF)	3 bytes	1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx(0x10000-0x10FFFF)	4 bytes'	11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Example of UTF-8 with BOM

Data	A𥊦Aꂌ
Unicode Code Point Sequence	U+0041 U+252A6 U+0041 U+A08C
UTF-8 w/BOM	EF BB BF 41 F0 A6 8A A6 41 EA 82 8C

UTF-16

UTF-16 is another encoding form for Unicode. The UTF-16 encoding form uses one to two 16-bit bytes to represent a Unicode code point according to some standard rules. Unlike UTF-8, BOM at the head of data stream is usually used to identify the byte order of the code units.. The rules used to encoding Unicode code point to UTF-16 can be summarized as following

Unicode Code Value	Byte Width of UTF-16	Form of UTF-16
xxxxxxxxxxxxxxxx	1(2 bytes) code unit	xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx	2 (2 bytes) code units	110110wwwwxxxxxx 110111xxxxxxxxxx =(110110[wwww-1]xxxxxx 110111xxxxxxxxxx)

Example of UTF-16 with BOM (big endian/little endian

Data	A𥊦Aꂌ
Unicode Code Point Sequence	U+0041 U+252A6 U+0041 U+A08C
UTF-16 big endian	FE FF 00 41 D8 54 DE A6 00 41 A0 BC
UTF-16 little endian	FF FE 41 00 54 D8 A6 DE 41 00 BC A0

UTF-32

UTF-32 is another encoding form for Unicode. The UTF-32 encoding form uses one single 32-bit code unit to represent a Unicode code point with 21-bit code point space. Unlike UTF-8, BOM at the head of data stream is usually used to identify the byte order of the code units.. The rules used to encoding Unicode code point to UTF-32 can be summarized as following

Unicode Code Value	Byte Width of UTF-8	Form of UTF-8
000uuuuuxxxxxxxxxxxxxxxx	1 (4 bytes) code unit	00000000000uuuuuxxxxxxxxxxxxxxxx

Example of UTF-32 with BOM (big endian/little endian

Data	A𥊦Aꂌ
Unicode Code Point Sequence	U+0041 U+252A6 U+0041 U+A08C
UTF-32 big endian	00 00 FE FF 00 00 00 41 00 02 52 A6 00 00 00 41 00 00 A0 BC
UTF-32 little endian	FF FE 00 00 41 00 00 00 A6 52 02 00 41 00 00 00 BC A0 00 00

VBScript Character Set

Although VBScript can manipulate characters of different character sets, the characters used in VBScript scripting are restricted to

Englisht Alphabet Capital Letters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Englisht Alphabet Small Letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Arabic numerals: 0 1 2 3 4 5 6 7 8 9
Special Symbols of ASCII: (space) !(exclamation mark) "(quotation mark) #(number sign) $(dollar sign) %(percent sign) &(ampersand) '(apostrophe) ((left parenthesis) )(right parenthesis) *(asterisk) +(plus) ,(comma) -(hyphen-minus) .(period-full stop) /(solidus-slash) :(colon) ;(semicolon) <(less-than sign - left angle bracket) =(equals) >(greater-than sign - right angle bracket) ?(question mark) @(commerial at sign) [(left square bracket) \(reverse solidus-backslash) ](right square bracket) ^(circumflex accent) _(low line - underline) `(grave accent) {(left curly bracket - left brace) |(vertical line - vertical bar) }(right curly bracket - right brace) ~(tilde)
Control Characters: Chr(0) nullchar, Chr(8) backspace, Chr(9) horizontal tab, Chr(10) linefeed, Chr(11) verticaltab, Chr(12) formfeed, and Chr(13) carriage return

The control characters, Chr(8) backspace, Chr(9) horizontal tab, Chr(10) linefeed, and Chr(13) carriage return of ASCII are supported by Microsoft Windows with no graphical representation but may have visual effect on visual display of text depending on application. The Chr(0) nullchar is the Null character with character having the value 0. However, Chr(11) verticaltab and Chr(12) formfeed are not useful in Microsoft Windows. For example, forming the layout of VBScript.

The printed characters. Chr(32)-Chr(126) are supported and used in VBScript scripting. But Chr(127) delete is also not supported. The alphabetical and numeric characters play a key role in VBScript. However, the special symbolic characters in the Visual Basic character set also play various functions in VBScript. For example, organizing vbscript, defining the tasks, but not specifying an operation to be performed.

But for the extended part of ASCII, Chr(128)-Chr(255) are ANSI characters depending on the code page specified in the local computer system. These characters are supported by Microsoft Windows but not in VBScript scripting.

Examples

Examples of Chr function

ASP VbScript Command:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
<script runat="server" language="VBScript">
Function printdata
cmda=array("Chr(num)"):ucnt=UBound(cmda):call heading
vars="num= 8":dsp_x
vars="num= 9":dsp_x
vars="num= 10":dsp_x
vars="num= 13":dsp_x
For z=32 to 126
vars="num= "&z:dsp_x
Next
End Function
Dim abb,cnt,cmda,cmp,dai,dlt,dte,dad,e,fnd,flg,fir,f_dw,gbd,hor,i,ild,k,mth,min,num,pnn,pi,sec,str1,str2,str,strs,tme,tmp,ucnt,var,vars,varx,wkd,x,z:varx=0
Function heading:If var=1 Then:tmp="":Else:tmp="<b>vars</b>":End If:For i=0 to ucnt:tmp=tmp&"<b>"&rmv_ctrl(cmda(i))&", TypeName</b>":Next:Response.Write "Results on Microsoft Windows Server 2008 R2 Standard  x64, Microsoft-IIS/7.5, VB Version 14.0<br />"&"<i>"&tmp&"</i>":End Function
Function dsp_x:Dim tmp_a,tmp_b,i:If varx=0  Then:Execute vars:End If:If var=1 Then:tmp_a="":Else:tmp_a=rmv_ctrl(vars):End If:For i=0 to ucnt:tmp_b=cmd_b(cmda(i)):if i/2=fix(i/2) Then:tmp_b="<b>"&tmp_b&"</b>":End If:tmp_a=tmp_a&tmp_b:Next:Response.Write "<i>"&tmp_a&"</i>":End Function
Function rmv_ctrl(blk_str):Select Case blk_str:Case chr(0):blk_str="NUL":Case chr(1):blk_str="SOH":Case chr(2):blk_str="STX":Case chr(3):blk_str="ETX":Case chr(4):blk_str="EOT":Case chr(5):blk_str="ENQ":Case chr(6):blk_str="ACK":Case chr(7):blk_str="BEL":Case chr(8):blk_str="BS":Case chr(9):blk_str="HT":Case chr(10):blk_str="LF":Case chr(11):blk_str="VT":Case chr(12):blk_str="FF":Case chr(13):blk_str="CR":Case chr(14):blk_str="SO":Case chr(15):blk_str="SI":Case chr(16):blk_str="DLE":Case chr(17):blk_str="DC1":Case chr(18):blk_str="DC2":Case chr(19):blk_str="DC3":Case chr(20):blk_str="DC4":Case chr(21):blk_str="NAK":Case chr(22):blk_str="SYN":Case chr(23):blk_str="ETB":Case chr(24):blk_str="CAN":Case chr(25):blk_str="EM":Case chr(26):blk_str="SUB":Case chr(27):blk_str="ESC":Case chr(28):blk_str="FS":Case chr(29):blk_str="GS":Case chr(30):blk_str="RS":Case chr(31):blk_str="US":Case chr(127):blk_str="DEL":End Select:rmv_ctrl=blk_str:If IsNull(blk_str) OR IsEmpty(Blk_str) OR blk_str="" OR IsNumeric(Blk_str) Then:Else:rmv_ctrl=Replace(Replace(Replace(Replace(Replace(blk_str,"&","&amp;"),"  "," &nbsp;"),"""","&quot;"),"<","&lt;"),">","&gt;"): End If :End Function
Function cmd_b(blk_str):cmd_b="["&cmd_x(blk_str)&"], "&cmd_x("TypeName("&blk_str&")"):End Function
Function cmd_x(blk_str):On Error Resume Next:Dim xans, tmp:xans="":tmp="":execute "xans="&blk_str:If Err.Number <> 0  Then:xans=cmd_x("TypeName("&blk_str&")"):Select Case xans:Case 9::Case Else:xans="error":End Select:End If:If IsArray(xans)=False Then:cmd_x=rmv_ctrl(xans): Else:tmp=lst_arry(xans):cmd_x=trim(tmp):End If:End Function
Function lst_arry(blk_str):Dim tmp,i,q:q="""":tmp="":For i=0 to UBound(blk_str):if VarType(blk_str(i))>1 And VarType(blk_str(i))<6 Then:q="'":End If:tmp=tmp&" "&q&rmv_ctrl(blk_str(i))&q:Next:lst_arry=trim(tmp):End Function
</script>
        <title>Sample Page</title>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> 
        <style type="text/css">
           i {display:table-row}
           u,b {display:table-cell}
        </style>
    </head>
    <body>
<%call printdata %>
    </body>
</html>

HTML Web Page Embedded Output: