Sideway
output.to from Sideway





Draft for Information Only

VBScript Characters

The two concerns of VBScript character are the character set used for representing all valid characters and character code for storing a character in script coding, computer manipulating and user interfacing.

VBScript Character Code

In general, Visual Basic uses Unicode to store and manipulate strings. Unicode is a character set using 21 bits to represent each character. However, some other programs, such as 16-bit object libraries, uses ANSI (American National Standards Institue) or DBCS (Double-Byte Character Set) to store and manipute strings. Therefore manipulating strings between different enviroments may encounter differences between Unicode and ANSI/DBCS. For example,

EnvironmentCharacter set(s) used
Visual BasicUnicode
32-bit object librariesUnicode
16-bit object librariesANSI and DBCS
Windows NT APIUnicode
Automation in Windows NTUnicode
Windows 95/98 APIANSI and DBCS
Automation in Windows 95/98 Unicode

Besides, the manipulateing of byte data of string by Visual Basic itself may also cause problems when dealing with bytewise operations. For example, Chr, ChrB, Asc, AscB, LeftB, MidB, RightB, and InStrB  functions.

Character Set

Although the ANSI caharacter set can handle most Western European language with 256 characters using individual 8-bit character sets, some languages such as Chinese (Traditional and Simplified), Japanese, and Korean (Hangeul) require more than 256 characters. DBCS and Unicode becomes the most popular character sets used to represent text string.

ASCII

ASCII (American Standard Code for Information Interchange) uses 7 bits to represent each character in 128 characters, including control characters and printable character. ASCII character set is the most important character set because the ASCII is usually the common part included in other common character sets. In other words, the character code of ASCII characters is usually a subset of other common character sets.

Windows ANSI

Although ANSI stands for American National Standards Institute, ANSI is also a genetic term used by Microsoft Windows as a character code standard. ANSI standard uses only a single byte, 8 bits, to represent each character in 256 characters. Windows ANSI character code can therefore be divided into two parts, with the lower 128 part is identical to ASCII and the upper 128 is assigned to various international character sets. A code page number is used as the identifier of each individual character set for Windows. For example,

code page12501251 1252125312541255125612571258
upper 128ANSI Central EuropeanANSI CyrillicANSI Latin 1 /Western European ANSI GreekANSI TurkishANSI HebrewANSI ArabicANSI BalticOEM Vietnamese
lower 128ASCIIASCIIASCIIASCIIASCIIASCIIASCIIASCIIASCII

In other words, Window ANSI character set is a kind of SBCS (Single-Byte Character Sets)

DBCS

DBCS stands for Double Byte Character Sets. Although SBCS is adequate for English, languages in East Asian region typically require much more than 256 characters. A 2-byte value is a common solution to tackle this problem. Unlike SBCS, DBCS are actually multi-byte encodings, a mix of 8-bit and 16-bit characters. In general, the 8-bit character in DBCS character sets is restricted to the ASCII character set. In other words, only the lower 128 part of a SBCS is used (number 0-127). The upper 128 part of a SBCS is reserved and function as the lead-byte of a 16-bit character. For DBCS data stream used in PC, a 16-bit character is restricted to lead byte with upper 128 part of a single byle while the trial byte can be lower or upper part of a single byte. Similar to ANSI, a DBCS code page number is used as the identifier of each individual DBCS character set for different language set. However, each DBCS code page has a different predefined leadbyte and trailbyte range. For example

Microsoft Windows code pagesRange of lead byte of DBCS
932 - Japanese Shift-JIS X'81'-X'9F',X'E0'-X'FC'
936 - Simplified Chinese GBKX'8C'-X'FE'
949 - KoreanX'8F'-X'FE'
950 - Traditional Chinese Big5X'81'-X'FE'

Besides the number of these ideographic characters, an ideographic character also cannot fit in the exist form. A form called full-width,, wide-width character is introduced to contain 2-byte characters, primarily ideographs. The form containing 1-byte characters is sometime called half-width character. According to these arrangement, a full-width character may have a half-width variant and a half-width character may also have a full-width variant. For example,

Character DescriptionCodeCharacter by Unicode
ANSI character "A"0x41AA, A
DBCS 932 Japanese Shift_JIS wide-width character A 0x8260AA, A
DBCS 936 Simplified Chinese GBKwide-width character A0xA3C1AA, A
DBCS 949 Korean wide-width character A0xA3C1AA, A
DBCS 950 Traditional Chinese Big5 wide-width character A0xA2CFAA, A

Unicode

Although the basic multilingual plane of first version Unicode (1991-1995) is a 16-bit character-encoding scheme  (U+0000-U+FFFF), the encoding space is changed to a 16-bit character-encoding scheme (U+0000-U+10FFFF) starting with Unicode 2.0 (July, 1996). Unicode uses all digits for every character trying to globally contain all commonly used characters of different language within all possible code points. Besides the basic plane, more planes are defined but not all possible code points are assigned. Unlike the using of code page identifier, Unicode is language-independence and code points are assigned by agreement such that all languages can share common characters without considering the artistic issue. In general, the Unicode codespace is divided into planes of 2-byte code points and each plane is subdivided into block according to assignment. For example,

  • Plane 0, Basic Multilingual Plane (BMP), U+0000-U+FFFF contains characters for almost all modern languages, and a large number of symbols
  • Plane 1, Supplementary Multilingual Plane (SMP), U+10000-U+1FFFF
  • Plane 2, Supplementary Ideographic Plane (SIP), U+20000-U+2FFFF
  • Plane 3-13, Unassigned, U+30000-U+DFFFF
  • Plane 14,  Supplementary Special Purpose Plane (SSP), U+E0000-U+EFFFF
  • Plane 15-16, Supplementary Private Use Area Plane (SPUA-A/B), U+F0000-U+10FFFF

Because of the expansion of the codepoint space, encoded Unicode character set cannot be manipulated in a compute directly as other encoded character sets, for example, ASCII, DBCS. Unicode value can only be used as a unique code point for every assigned character such that each character can be refered to a code with simple digit configuration.

In order to avoid ambiguity in data processing, Unicode code point should be encoded. Some common encoding forms are UTF-8 of using one to four 8-bit bytes, UTF-16 of using one or two 16-bit code units, and  UTF-32 of using one single 32-bit code unit.

In Unicode, the character at code point U+FEFF is defined as the byte order mark (BOM), while the byte-reversed counterpart, U+FFFE is a noncharacter (U+FFFE) in UTF-16 encoding form, or outside the code space (0xFFFE0000) for UTF-32. The BOM is used to specify the order of bytes in a code unit. A code unit with the most significant byte (MSB) first is called big-endian, while a code unit with  the least significant  byte (LSB)  first is called little-endian. Therefore, if all Unicode code points of a data stream lies within the Basic Multilingual Plane, the data stream of Unicode code points can be stored together with the byte order mark at the beginning of the data stream without causing any ambiguity. However, whenever there is a Unicode code point of the data stream lies outside Basic Multilingual Plane with Unicode code point >U+FFFF, the data stream of of Unicode code points must be encoded to one of the encoding forms.

For Windows, Unicode has already been used by the Component Object Model (COM) on all 32-bit versions of Windows, used as the basis for OLE and ActiveX technologies, and fully supported by Windows NT.

UTF-8

In order to make the Unicode coded character set feasible for information representation and manipulation in a computer, the coded character set must be mapped to unambiguous form that can be recognized by the computing software. UTF-8 is one of the encoding forms that is commonly used with Unicode. The UTF-8 encoding form uses one to four 8-bit bytes to represent a Unicode code point according to some standard rules. Since the byte sequence arrangement can be determined from the byte, using BOM to identify the byte order of the code units is usually not necessary. The rules used to encoding Unicode code point to UTF-8 can be summarized as following

Unicode Code Value (Range of Unicode Code)Byte Width of UTF-8Form of UTF-8
00000000 0xxxxxxx (0x00-0x7F)1 byte0xxxxxxx
00000yyy yyxxxxxx (0x80-0x7FF)2 bytes110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx (0x800-0xD7FF, 0xE000-0xFFFF)3 bytes1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx(0x10000-0x10FFFF)4 bytes'11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Example of UTF-8 with BOM

DataA𥊦Aꂌ
Unicode Code Point SequenceU+0041 U+252A6 U+0041 U+A08C
UTF-8 w/BOMEF BB BF   41   F0 A6 8A A6   41   EA 82 8C

UTF-16

UTF-16 is another encoding form for Unicode. The UTF-16 encoding form uses one to two 16-bit bytes to represent a Unicode code point according to some standard rules. Unlike UTF-8, BOM at the head of data stream is usually used to identify the byte order of the code units.. The rules used to encoding Unicode code point to UTF-16 can be summarized as following

 Unicode Code ValueByte Width of UTF-16Form of UTF-16
xxxxxxxxxxxxxxxx1(2 bytes) code unitxxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx2 (2 bytes) code units110110wwwwxxxxxx 110111xxxxxxxxxx
=(110110[wwww-1]xxxxxx 110111xxxxxxxxxx)

Example of UTF-16 with BOM (big endian/little endian

DataA𥊦Aꂌ
Unicode Code Point SequenceU+0041 U+252A6 U+0041 U+A08C
UTF-16 big endianFE FF   00 41   D8 54 DE A6   00 41   A0 BC
UTF-16 little endianFF FE   41 00   54 D8 A6 DE   41 00   BC A0

UTF-32

UTF-32 is another encoding form for Unicode. The UTF-32 encoding form uses one single 32-bit code unit to represent a Unicode code point  with 21-bit code point space. Unlike UTF-8, BOM at the head of data stream is usually used to identify the byte order of the code units.. The rules used to encoding Unicode code point to UTF-32 can be summarized as following

 Unicode Code ValueByte Width of UTF-8Form of UTF-8
000uuuuuxxxxxxxxxxxxxxxx1 (4 bytes) code unit00000000000uuuuuxxxxxxxxxxxxxxxx

Example of UTF-32 with BOM (big endian/little endian

DataA𥊦Aꂌ
Unicode Code Point SequenceU+0041 U+252A6 U+0041 U+A08C
UTF-32 big endian00 00 FE FF   00 00 00 41   00 02 52 A6   00 00 00 41   00 00 A0 BC
UTF-32 little endianFF FE 00 00   41 00 00 00   A6 52 02 00   41 00 00 00   BC A0 00 00

VBScript Character Set

Although VBScript can manipulate characters of different character sets, the characters used in VBScript scripting are restricted to

  • Englisht Alphabet Capital Letters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  • Englisht Alphabet Small Letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
  • Arabic numerals: 0 1 2 3 4 5 6 7 8 9
  • Special Symbols of ASCII:  (space) !(exclamation mark) "(quotation mark) #(number sign) $(dollar sign) %(percent sign) &(ampersand) '(apostrophe) ((left parenthesis) )(right parenthesis) *(asterisk) +(plus) ,(comma) -(hyphen-minus) .(period-full stop) /(solidus-slash) :(colon) ;(semicolon) <(less-than sign - left angle bracket) =(equals) >(greater-than sign - right angle bracket) ?(question mark) @(commerial at sign) [(left square bracket) \(reverse solidus-backslash) ](right square bracket) ^(circumflex accent) _(low line - underline) `(grave accent) {(left curly bracket - left brace) |(vertical line - vertical bar) }(right curly bracket - right brace) ~(tilde)
  • Control Characters: Chr(0) nullchar, Chr(8) backspace, Chr(9) horizontal tab, Chr(10) linefeed, Chr(11) verticaltab, Chr(12) formfeed, and Chr(13) carriage return

The control characters, Chr(8) backspace, Chr(9) horizontal tab, Chr(10) linefeed, and Chr(13) carriage return of ASCII are supported by Microsoft Windows with no graphical representation but may have visual effect on visual display of text depending on application. The Chr(0) nullchar is the Null character with character having the value 0. However, Chr(11) verticaltab and Chr(12) formfeed are not useful in Microsoft Windows. For example, forming the layout of VBScript.

The printed characters. Chr(32)-Chr(126) are supported and used in VBScript scripting. But Chr(127) delete is also not supported. The alphabetical and numeric characters play a key role in VBScript. However, the special symbolic characters in the Visual Basic character set also play various functions in VBScript. For example, organizing vbscript, defining the tasks, but not specifying an operation to be performed.

But for the extended part of ASCII, Chr(128)-Chr(255) are ANSI characters depending on the code page specified in the local computer system. These characters are supported by Microsoft Windows but not in VBScript scripting.

Examples

Examples of Chr function

HTML Web Page In-line Output:
Results on Microsoft Windows 8.1 Pro x64, Microsoft-IIS/8.5, VBScript Version 5.8 of page /sideway/default.asp
varsChr(num), TypeNamenum= 8[BS], Stringnum= 9[HT], Stringnum= 10[LF], Stringnum= 13[CR], Stringnum= 32[ ], Stringnum= 33[!], Stringnum= 34["], Stringnum= 35[#], Stringnum= 36[$], Stringnum= 37[%], Stringnum= 38[&], Stringnum= 39['], Stringnum= 40[(], Stringnum= 41[)], Stringnum= 42[*], Stringnum= 43[+], Stringnum= 44[,], Stringnum= 45[-], Stringnum= 46[.], Stringnum= 47[/], Stringnum= 48[0], Stringnum= 49[1], Stringnum= 50[2], Stringnum= 51[3], Stringnum= 52[4], Stringnum= 53[5], Stringnum= 54[6], Stringnum= 55[7], Stringnum= 56[8], Stringnum= 57[9], Stringnum= 58[:], Stringnum= 59[;], Stringnum= 60[<], Stringnum= 61[=], Stringnum= 62[>], Stringnum= 63[?], Stringnum= 64[@], Stringnum= 65[A], Stringnum= 66[B], Stringnum= 67[C], Stringnum= 68[D], Stringnum= 69[E], Stringnum= 70[F], Stringnum= 71[G], Stringnum= 72[H], Stringnum= 73[I], Stringnum= 74[J], Stringnum= 75[K], Stringnum= 76[L], Stringnum= 77[M], Stringnum= 78[N], Stringnum= 79[O], Stringnum= 80[P], Stringnum= 81[Q], Stringnum= 82[R], Stringnum= 83[S], Stringnum= 84[T], Stringnum= 85[U], Stringnum= 86[V], Stringnum= 87[W], Stringnum= 88[X], Stringnum= 89[Y], Stringnum= 90[Z], Stringnum= 91[[], Stringnum= 92[\], Stringnum= 93[]], Stringnum= 94[^], Stringnum= 95[_], Stringnum= 96[`], Stringnum= 97[a], Stringnum= 98[b], Stringnum= 99[c], Stringnum= 100[d], Stringnum= 101[e], Stringnum= 102[f], Stringnum= 103[g], Stringnum= 104[h], Stringnum= 105[i], Stringnum= 106[j], Stringnum= 107[k], Stringnum= 108[l], Stringnum= 109[m], Stringnum= 110[n], Stringnum= 111[o], Stringnum= 112[p], Stringnum= 113[q], Stringnum= 114[r], Stringnum= 115[s], Stringnum= 116[t], Stringnum= 117[u], Stringnum= 118[v], Stringnum= 119[w], Stringnum= 120[x], Stringnum= 121[y], Stringnum= 122[z], Stringnum= 123[{], Stringnum= 124[|], Stringnum= 125[}], Stringnum= 126[~], String

┬ęsideway

ID: 180400011 Last Updated: 2018/4/11 Revision:

IMAGE

Home (1)

Business

Management

HBR (3)

Information

Recreation

Hobbies (7)

Culture

Chinese (1097)

English (334)new

Reference (60)

Computer

Hardware (148)

Software

Application (187)

Digitization (24)

Numeric (19)

Programming

Web (533)

HTML

Knowledge Base

Common Color (1)

Html Entity (Unicode) (1)

Html 401 Special (1)

OS (366)

MS Windows

Windows10 (1)

DeskTop (5)

Knowledge

Mathematics

Formulas (8)

Number Theory (128)new

Algebra (14)

Trigonometry (18)

Geometry (18)

Calculus (66)

Engineering

Tables (8)

Mechanical

Control

Process Control (1)

Mechanics (1)

Rigid Bodies

Statics & Dynamics (128)

Fluid (5)

Fluid Kinematics (5)

Acoustics (19)

FiniteElement (2)

Biology (1)

Geography (1)


facebook
Latest Updated Links

Copyright © 2000-2018 Sideway . All rights reserved Disclaimersfacebook last modified on 08 Mar 2018