Regex Injection

# Regex Injection

*Regular expressions* (regex) are a way of describing the order and type of characters that occur in a string. They are often used to validate input or search for “wildcard” matches within a set of strings. If the regular expression (rather than the string it is testing) is generated from untrusted input, or a regex that exists in your codebase is poorly designed, an attacker can perform a **regex injection** attack by sending malicious input that will take a huge amount of computing power to evaluate. This technique is often used to perform *denial-of-service* attacks on vulnerable web-servers.

## Regex Injection in Python

The following Python application naively allows a “wildcard” search expression to be sent from the client, and evaluates the string as a regular expression against a list of potential matches:

“`python
@app.route(‘/search/<pattern>’)
def search(pattern):
regex = re.compile(pattern)
matches = (item for item in ITEMS if regex.match(item))return matches
“`

Regular expression matching can be very slow if the expression matching engine has to perform a lot of *backtracking*. This occurs when the regular expression contains repeated matching groups, each of which contains a repeating symbol – which means the engine has to evaluate exponentially more different logical branches while scanning a string.

It this case, an “evil” regex in the following form can be supplied:

“`
(.*a){20}
“`

This pattern means “twenty occurrences of: zero or more characters followed by the letter a”. This expression will require an enormous amount of compute time to evaluate against a string such as:

“`
aaaaaaaaaaaaaaaaaaaa!
“`

Sending many such search requests to your server gives an attacker an easy way to perform a denial-of-service attack.

### Regex Patterns in Validation

Attackers can take advantage of inefficient regexes even when they do not have control of the form of the regular expression itself. By passing a maliciously crafted “email” parameter to a sign-up page, for instance, they can probe for slow-running validation expressions and attempt to take your website offline.

## Mitigation

* Don’t generate regular expressions directly from untrusted input – define them as string literals in your codebase.

* Use a search index like Elasticsearch or Lucene for complex searches, rather than running regular expression matches on large datasets. For instance, this is how you run a search query against an Elasticsearch server:

“`python
from elasticsearch import ElasticsearchSEARCH_INDEX = Elasticsearch(hosts=os.getenv(“ES_HOSTS”).split(‘,’))

@app.route(‘/search/<pattern>’)
def search(pattern):
query = {
“query” : {
“bool” : {
“must” : {
“match” : {
“name” : pattern
}
}
}
}
}

matches = SEARCH_INDEX.search(index=”main-index”, query=query)

return (document[“name”] for document in matches[‘hits’][‘hits’])
“`

* Check any regexes within your codebase for repeating grouped patterns or ambiguous patterns. “Catastrophic” backtracking can be avoided if you follow these rules of thumb:

* Avoid nested quantifiers like `(a+)+`, where a pattern potentially matching multiple characters can be applied multiple times.
* Avoid quantified overlapping disjunctions like `(a|a)+`.
* Avoid quantified overlapping adjacencies like `\d+\d+`.

## CWEs

* [CWE-185](https://cwe.mitre.org/data/definitions/185.html)

Regex Injection

Regex Injection

# Regex Injection

## Regex Injection in Python

### Regex Patterns in Validation

## Mitigation

## CWEs

About ShiftLeft

See for yourself – run a scan on your code right now

Newsletter Sign Up

About

Platform Overview

Platform Components

Resources