Sideway
output.to from Sideway
Draft for Information Only

Content

Regular Expression Best Practices
   Input Source
  Handle Object Instantiation Appropriately
   Static Regular Expressions
   Interpreted vs. Compiled Regular Expressions
   Regular Expressions: Compiled to an Assembly
  Take Charge of Backtracking
  Use Time-out Values
  Capture Only When Necessary
 Scanning of Valid Email Format
 See also
 Source/Referencee

Regular Expression Best Practices

The regular expression engine in .NET is a text searching tool that processes text based on pattern matches rather than on comparing and matching literal text. However, in some cases, the regular expression engine can appear to be very slow due to the excessive backtracking processes caused by the regular expression pattern design.

 Input Source

In general, there are two types of input string, constrained or unconstrained. Constrained input is text that originates from a known or reliable source and follows a predefined format. Unconstrained input is text that originates from an unreliable source, such as a web user, and may not follow a predefined or expected format.

Regular expression patterns are typically written to match valid input. That is, developers examine the text that they want to match and then write a regular expression pattern that matches it. Developers then determine whether this pattern requires correction or further elaboration by testing it with multiple valid input items. When the pattern matches all presumed valid inputs, it is declared to be production-ready and can be included in a released application. This makes a regular expression pattern suitable for matching constrained input. However, it does not make it suitable for matching unconstrained input.

To match unconstrained input, a regular expression must be able to efficiently handle three kinds of text:

  • Text that matches the regular expression pattern.

  • Text that does not match the regular expression pattern.

  • Text that nearly matches the regular expression pattern.

The last text type is especially problematic for a regular expression that has been written to handle constrained input. If that regular expression also relies on extensive backtracking, the regular expression engine can spend an inordinate amount of time (in some cases, many hours or days) processing seemingly innocuous text.

Because this regular expression was developed solely by considering the format of input to be matched, it fails to take account of input that does not match the pattern. This, in turn, can allow unconstrained input that nearly matches the regular expression pattern to significantly degrade performance.

To solve this problem, you can do the following:

  • When developing a pattern, you should consider how backtracking might affect the performance of the regular expression engine, particularly if your regular expression is designed to process unconstrained input. For more information, see the Take Charge of Backtracking section.

  • Thoroughly test your regular expression using invalid and near-valid input as well as valid input. To generate input for a particular regular expression randomly, you can use Rex, which is a regular expression exploration tool from Microsoft Research.

Handle Object Instantiation Appropriately

At the heart of .NET’s regular expression object model is the System.Text.RegularExpressions.Regex class, which represents the regular expression engine. Often, the single greatest factor that affects regular expression performance is the way in which the Regex engine is used. Defining a regular expression involves tightly coupling the regular expression engine with a regular expression pattern. That coupling process, whether it involves instantiating a Regex object by passing its constructor a regular expression pattern or calling a static method by passing it the regular expression pattern along with the string to be analyzed, is by necessity an expensive one.

For a more detailed discussion of the performance implications of using interpreted and compiled regular expressions, see Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking in the BCL Team blog.

You can couple the regular expression engine with a particular regular expression pattern and then use the engine to match text in several ways:

  • You can call a static pattern-matching method, such as Regex.Match(String, String). This does not require instantiation of a regular expression object.

  • You can instantiate a Regex object and call an instance pattern-matching method of an interpreted regular expression. This is the default method for binding the regular expression engine to a regular expression pattern. It results when a Regex object is instantiated without an options argument that includes the Compiled flag.

  • You can instantiate a Regex object and call an instance pattern-matching method of a compiled regular expression. Regular expression objects represent compiled patterns when a Regex object is instantiated with an options argument that includes the Compiled flag.

  • You can create a special-purpose Regex object that is tightly coupled with a particular regular expression pattern, compile it, and save it to a standalone assembly. You do this by calling the Regex.CompileToAssembly method.

The particular way in which you call regular expression matching methods can have a significant impact on your application. The following sections discuss when to use static method calls, interpreted regular expressions, and compiled regular expressions to improve your application's performance.

The form of the method call (static, interpreted, compiled) affects performance if the same regular expression is used repeatedly in method calls, or if an application makes extensive use of regular expression objects.

Static Regular Expressions

Static regular expression methods are recommended as an alternative to repeatedly instantiating a regular expression object with the same regular expression. Unlike regular expression patterns used by regular expression objects, either the operation codes or the compiled Microsoft intermediate language (MSIL) from patterns used in instance method calls is cached internally by the regular expression engine.

A very inefficient implementation of the IsValidCurrency method is shown in the following example. Note that each method call reinstantiates a Regex object with the same pattern. This, in turn, means that the regular expression pattern must be recompiled each time the method is called.

You should replace this inefficient code with a call to the static Regex.IsMatch(String, String) method. This eliminates the need to instantiate a Regex object each time you want to call a pattern-matching method, and enables the regular expression engine to retrieve a compiled version of the regular expression from its cache.

By default, the last 15 most recently used static regular expression patterns are cached. For applications that require a larger number of cached static regular expressions, the size of the cache can be adjusted by setting the Regex.CacheSize property.

Interpreted vs. Compiled Regular Expressions

Regular expression patterns that are not bound to the regular expression engine through the specification of the Compiled option are interpreted. When a regular expression object is instantiated, the regular expression engine converts the regular expression to a set of operation codes. When an instance method is called, the operation codes are converted to MSIL and executed by the JIT compiler. Similarly, when a static regular expression method is called and the regular expression cannot be found in the cache, the regular expression engine converts the regular expression to a set of operation codes and stores them in the cache. It then converts these operation codes to MSIL so that the JIT compiler can execute them. Interpreted regular expressions reduce startup time at the cost of slower execution time. Because of this, they are best used when the regular expression is used in a small number of method calls, or if the exact number of calls to regular expression methods is unknown but is expected to be small. As the number of method calls increases, the performance gain from reduced startup time is outstripped by the slower execution speed.

Regular expression patterns that are bound to the regular expression engine through the specification of the Compiled option are compiled. This means that, when a regular expression object is instantiated, or when a static regular expression method is called and the regular expression cannot be found in the cache, the regular expression engine converts the regular expression to an intermediary set of operation codes, which it then converts to MSIL. When a method is called, the JIT compiler executes the MSIL. In contrast to interpreted regular expressions, compiled regular expressions increase startup time but execute individual pattern-matching methods faster. As a result, the performance benefit that results from compiling the regular expression increases in proportion to the number of regular expression methods called.

To summarize, we recommend that you use interpreted regular expressions when you call regular expression methods with a specific regular expression relatively infrequently. You should use compiled regular expressions when you call regular expression methods with a specific regular expression relatively frequently. The exact threshold at which the slower execution speeds of interpreted regular expressions outweigh gains from their reduced startup time, or the threshold at which the slower startup times of compiled regular expressions outweigh gains from their faster execution speeds, is difficult to determine. It depends on a variety of factors, including the complexity of the regular expression and the specific data that it processes. To determine whether interpreted or compiled regular expressions offer the best performance for your particular application scenario, you can use the Stopwatch class to compare their execution times.

Regular Expressions: Compiled to an Assembly

.NET also enables you to create an assembly that contains compiled regular expressions. This moves the performance hit of regular expression compilation from run time to design time. However, it also involves some additional work: You must define the regular expressions in advance and compile them to an assembly. The compiler can then reference this assembly when compiling source code that uses the assembly’s regular expressions. Each compiled regular expression in the assembly is represented by a class that derives from Regex.

To compile regular expressions to an assembly, you call the Regex.CompileToAssembly(RegexCompilationInfo[], AssemblyName) method and pass it an array of RegexCompilationInfo objects that represent the regular expressions to be compiled, and an AssemblyName object that contains information about the assembly to be created.

We recommend that you compile regular expressions to an assembly in the following situations:

  • If you are a component developer who wants to create a library of reusable regular expressions.

  • If you expect your regular expression's pattern-matching methods to be called an indeterminate number of times -- anywhere from once or twice to thousands or tens of thousands of times. Unlike compiled or interpreted regular expressions, regular expressions that are compiled to separate assemblies offer performance that is consistent regardless of the number of method calls.

If you are using compiled regular expressions to optimize performance, you should not use reflection to create the assembly, load the regular expression engine, and execute its pattern-matching methods. This requires that you avoid building regular expression patterns dynamically, and that you specify any pattern-matching options (such as case-insensitive pattern matching) at the time the assembly is created. It also requires that you separate the code that creates the assembly from the code that uses the regular expression.

Take Charge of Backtracking

Ordinarily, the regular expression engine uses linear progression to move through an input string and compare it to a regular expression pattern. However, when indeterminate quantifiers such as *, +, and ? are used in a regular expression pattern, the regular expression engine may give up a portion of successful partial matches and return to a previously saved state in order to search for a successful match for the entire pattern. This process is known as backtracking.

For more information on backtracking, see Details of Regular Expression Behavior and Backtracking. For a detailed discussion of backtracking, see Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking in the BCL Team blog.

Support for backtracking gives regular expressions power and flexibility. It also places the responsibility for controlling the operation of the regular expression engine in the hands of regular expression developers. Because developers are often not aware of this responsibility, their misuse of backtracking or reliance on excessive backtracking often plays the most significant role in degrading regular expression performance. In a worst-case scenario, execution time can double for each additional character in the input string. In fact, by using backtracking excessively, it is easy to create the programmatic equivalent of an endless loop if input nearly matches the regular expression pattern; the regular expression engine may take hours or even days to process a relatively short input string.

Often, applications pay a performance penalty for using backtracking despite the fact that backtracking is not essential for a match.

Because a word boundary is not the same as, or a subset of, a word character, there is no possibility that the regular expression engine will cross a word boundary when matching word characters. This means that for this regular expression, backtracking can never contribute to the overall success of any match -- it can only degrade performance, because the regular expression engine is forced to save its state for each successful preliminary match of a word character.

If you determine that backtracking is not necessary, you can disable it by using the (?>subexpression) language element. The following example parses an input string by using two regular expressions. The first, \b\p{Lu}\w*\b, relies on backtracking. The second, \b\p{Lu}(?>\w*)\b, disables backtracking. As the output from the example shows, they both produce the same result.

In many cases, backtracking is essential for matching a regular expression pattern to input text. However, excessive backtracking can severely degrade performance and create the impression that an application has stopped responding. In particular, this happens when quantifiers are nested and the text that matches the outer subexpression is a subset of the text that matches the inner subexpression.

In addition to avoiding excessive backtracking, you should use the timeout feature to ensure that excessive backtracking does not severely degrade regular expression performance. For more information, see the Use Time-out Values section.

In these cases, you can optimize regular expression performance by removing the nested quantifiers and replacing the outer subexpression with a zero-width lookahead or lookbehind assertion. Lookahead and lookbehind assertions are anchors; they do not move the pointer in the input string, but instead look ahead or behind to check whether a specified condition is met.

The regular expression language in .NET includes the following language elements that you can use to eliminate nested quantifiers. For more information, see Grouping Constructs.

Language element Description
(?= subexpression ) Zero-width positive lookahead. Look ahead of the current position to determine whether subexpression matches the input string.
(?! subexpression ) Zero-width negative lookahead. Look ahead of the current position to determine whether subexpression does not match the input string.
(?<= subexpression ) Zero-width positive lookbehind. Look behind the current position to determine whether subexpression matches the input string.
(?<! subexpression ) Zero-width negative lookbehind. Look behind the current position to determine whether subexpression does not match the input string.

Use Time-out Values

If your regular expressions processes input that nearly matches the regular expression pattern, it can often rely on excessive backtracking, which impacts its performance significantly. In addition to carefully considering your use of backtracking and testing the regular expression against near-matching input, you should always set a time-out value to ensure that the impact of excessive backtracking, if it occurs, is minimized.

The regular expression time-out interval defines the period of time that the regular expression engine will look for a single match before it times out. The default time-out interval is Regex.InfiniteMatchTimeout, which means that the regular expression will not time out. You can override this value and define a time-out interval as follows:

If you have defined a time-out interval and a match is not found at the end of that interval, the regular expression method throws a RegexMatchTimeoutException exception. In your exception handler, you can choose to retry the match with a longer time-out interval, abandon the match attempt and assume that there is no match, or abandon the match attempt and log the exception information for future analysis.

Capture Only When Necessary

Regular expressions in .NET support a number of grouping constructs, which let you group a regular expression pattern into one or more subexpressions. The most commonly used grouping constructs in .NET regular expression language are (subexpression), which defines a numbered capturing group, and (?<name>subexpression), which defines a named capturing group. Grouping constructs are essential for creating backreferences and for defining a subexpression to which a quantifier is applied.

However, the use of these language elements has a cost. They cause the GroupCollection object returned by the Match.Groups property to be populated with the most recent unnamed or named captures, and if a single grouping construct has captured multiple substrings in the input string, they also populate the CaptureCollection object returned by the Group.Captures property of a particular capturing group with multiple Capture objects.

Often, grouping constructs are used in a regular expression only so that quantifiers can be applied to them, and the groups captured by these subexpressions are not subsequently used.

When you use subexpressions only to apply quantifiers to them, and you are not interested in the captured text, you should disable group captures. For example, the (?:subexpression) language element prevents the group to which it applies from capturing matched substrings.

You can disable captures in one of the following ways:

  • Use the (?:subexpression) language element. This element prevents the capture of matched substrings in the group to which it applies. It does not disable substring captures in any nested groups.

  • Use the ExplicitCapture option. It disables all unnamed or implicit captures in the regular expression pattern. When you use this option, only substrings that match named groups defined with the (?<name>subexpression) language element can be captured. The ExplicitCapture flag can be passed to the options parameter of a Regex class constructor or to the options parameter of a Regex static matching method.

  • Use the n option in the (?imnsx) language element. This option disables all unnamed or implicit captures from the point in the regular expression pattern at which the element appears. Captures are disabled either until the end of the pattern or until the (-n) option enables unnamed or implicit captures. For more information, see Miscellaneous Constructs.

  • Use the n option in the (?imnsx:subexpression) language element. This option disables all unnamed or implicit captures in subexpression. Captures by any unnamed or implicit nested capturing groups are disabled as well.

 

 

Scanning of Valid Email Format

Examples of Scanning of Valid Email Format.
ASP.NET Code Input:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
       <title>Sample Page</title>
       <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
       <script runat="server">
           Sub Page_Load()
               Dim xstring As String = "dahhhvid.johgnes@prghjjosehware.com"
               Dim xmatchstr As String = ""
               lbl01.Text = xmatchstr
           End Sub
           Function showresult(xstring,xpattern)
               Dim xmatch As Match
               Dim xcaptures As CaptureCollection
               Dim ycaptures As CaptureCollection
               Dim xgroups As GroupCollection
               Dim xmatchstr As String = ""
               Dim xint As Integer
               Dim yint As Integer
               Dim zint As Integer
               xmatchstr = xmatchstr & "<br />Result of Regex.Match(string,""" & Replace(xpattern,"<","&lt;") & """): """
               xmatch = Regex.Match(xstring,xpattern)
               xmatchstr = xmatchstr & xmatch.value & """<br />"
               xcaptures = xmatch.Captures
               xmatchstr = xmatchstr & "->Result of CaptureCollection.Count: """
               xmatchstr = xmatchstr & xcaptures.Count & """<br />"
               For xint = 0 to xcaptures.Count - 1
                   xmatchstr = xmatchstr & "->->Result of CaptureCollection("& xint & ").Value, Index, Length: """
                   xmatchstr = xmatchstr & xcaptures(xint).Value & ", " & xcaptures(xint).Index & ", " & xcaptures(xint).Length & """<br />"
                   xgroups = xmatch.Groups
                   xmatchstr = xmatchstr & "->Result of GroupCollection.Count: """
                   xmatchstr = xmatchstr & xgroups.Count & """<br />"
                   For yint = 0 to xgroups.Count - 1
                       xmatchstr = xmatchstr & "->->Result of GroupCollection("& yint & ").Value, Index, Length: """
                       xmatchstr = xmatchstr & xgroups(yint).Value & ", " & xgroups(yint).Index & ", " & xgroups(yint).Length & """<br />"
                       ycaptures = xgroups(yint).Captures
                       xmatchstr = xmatchstr & "->->->Result of CaptureCollection.Count: """
                       xmatchstr = xmatchstr & ycaptures.Count & """<br />"
                       For zint = 0 to ycaptures.Count - 1
                           xmatchstr = xmatchstr & "->->->->Result of CaptureCollection("& zint & ").Value, Index, Length: """
                           xmatchstr = xmatchstr & ycaptures(zint).Value & ", " & ycaptures(zint).Index & ", " & ycaptures(zint).Length & """<br />"
                       Next
                   Next
               Next
               Return xmatchstr
           End Function
       </script>
    </head>
    <body>
       <% Response.Write ("<h1>This is a Sample Page of Scanning of of Valid Email Format</h1>") %>
       <p>
           <%-- Set on Page_Load --%>
           <asp:Label id="lbl01" runat="server" />
       </p>
    </body>
</html>
HTML Web Page Embedded Output:

See also

Source/Referencee


©sideway

ID: 190800010 Last Updated: 8/10/2019 Revision: 0 Ref:

close

References

  1. Active Server Pages,  , http://msdn.microsoft.com/en-us/library/aa286483.aspx
  2. ASP Overview,  , http://msdn.microsoft.com/en-us/library/ms524929%28v=vs.90%29.aspx
  3. ASP Best Practices,  , http://technet.microsoft.com/en-us/library/cc939157.aspx
  4. ASP Built-in Objects,  , http://msdn.microsoft.com/en-us/library/ie/ms524716(v=vs.90).aspx
  5. Response Object,  , http://msdn.microsoft.com/en-us/library/ms525405(v=vs.90).aspx
  6. Request Object,  , http://msdn.microsoft.com/en-us/library/ms524948(v=vs.90).aspx
  7. Server Object (IIS),  , http://msdn.microsoft.com/en-us/library/ms525541(v=vs.90).aspx
  8. Application Object (IIS),  , http://msdn.microsoft.com/en-us/library/ms525360(v=vs.90).aspx
  9. Session Object (IIS),  , http://msdn.microsoft.com/en-us/library/ms524319(8v=vs.90).aspx
  10. ASPError Object,  , http://msdn.microsoft.com/en-us/library/ms524942(v=vs.90).aspx
  11. ObjectContext Object (IIS),  , http://msdn.microsoft.com/en-us/library/ms525667(v=vs.90).aspx
  12. Debugging Global.asa Files,  , http://msdn.microsoft.com/en-us/library/aa291249(v=vs.71).aspx
  13. How to: Debug Global.asa files,  , http://msdn.microsoft.com/en-us/library/ms241868(v=vs.80).aspx
  14. Calling COM Components from ASP Pages,  , http://msdn.microsoft.com/en-us/library/ms524620(v=VS.90).aspx
  15. IIS ASP Scripting Reference,  , http://msdn.microsoft.com/en-us/library/ms524664(v=vs.90).aspx
  16. ASP Keywords,  , http://msdn.microsoft.com/en-us/library/ms524672(v=vs.90).aspx
  17. Creating Simple ASP Pages,  , http://msdn.microsoft.com/en-us/library/ms524741(v=vs.90).aspx
  18. Including Files in ASP Applications,  , http://msdn.microsoft.com/en-us/library/ms524876(v=vs.90).aspx
  19. ASP Overview,  , http://msdn.microsoft.com/en-us/library/ms524929(v=vs.90).aspx
  20. FileSystemObject Object,  , http://msdn.microsoft.com/en-us/library/z9ty6h50(v=vs.84).aspx
  21. http://msdn.microsoft.com/en-us/library/windows/desktop/ms675944(v=vs.85).aspx,  , ADO Object Model
  22. ADO Fundamentals,  , http://msdn.microsoft.com/en-us/library/windows/desktop/ms680928(v=vs.85).aspx
close

Latest Updated LinksValid XHTML 1.0 Transitional Valid CSS!Nu Html Checker Firefox53 Chromena IExplorerna
IMAGE

Home 5

Business

Management

HBR 3

Information

Recreation

Hobbies 8

Culture

Chinese 1097

English 339

Reference 79

Computer

Hardware 249

Software

Application 213

Digitization 32

Latex 52

Manim 205

KB 1

Numeric 19

Programming

Web 289

Unicode 504

HTML 66

CSS 65

SVG 46

ASP.NET 270

OS 429

DeskTop 7

Python 72

Knowledge

Mathematics

Formulas 8

Algebra 84

Number Theory 206

Trigonometry 31

Geometry 34

Coordinate Geometry 2

Calculus 67

Complex Analysis 21

Engineering

Tables 8

Mechanical

Mechanics 1

Rigid Bodies

Statics 92

Dynamics 37

Fluid 5

Fluid Kinematics 5

Control

Process Control 1

Acoustics 19

FiniteElement 2

Natural Sciences

Matter 1

Electric 27

Biology 1

Geography 1


Copyright © 2000-2024 Sideway . All rights reserved Disclaimers last modified on 06 September 2019