# Regex Injection

*Regular expressions* (regex) are a way of describing the order and type of characters that occur in a string. They are often used to validate input or search for “wildcard” matches within a set of strings. If the regular expression (rather than the string it is testing) is generated from untrusted input, or a regex that exists in your codebase is poorly designed, an attacker can perform a **regex injection** attack by sending malicious input that will take a huge amount of computing power to evaluate. This technique is often used to perform *denial-of-service* attacks on vulnerable web-servers.

## Regex Injection in Python

The following Python application naively allows a “wildcard” search expression to be sent from the client, and evaluates the string as a regular expression against a list of potential matches:

“`python
@app.route(‘/search/<pattern>’)
def search(pattern):
regex = re.compile(pattern)
matches = (item for item in ITEMS if regex.match(item))return matches
“`

Regular expression matching can be very slow if the expression matching engine has to perform a lot of *backtracking*. This occurs when the regular expression contains repeated matching groups, each of which contains a repeating symbol – which means the engine has to evaluate exponentially more different logical branches while scanning a string.

It this case, an “evil” regex in the following form can be supplied:

“`
(.*a){20}
“`

This pattern means “twenty occurrences of: zero or more characters followed by the letter a”. This expression will require an enormous amount of compute time to evaluate against a string such as:

“`
aaaaaaaaaaaaaaaaaaaa!
“`

Sending many such search requests to your server gives an attacker an easy way to perform a denial-of-service attack.

### Regex Patterns in Validation

Attackers can take advantage of inefficient regexes even when they do not have control of the form of the regular expression itself. By passing a maliciously crafted “email” parameter to a sign-up page, for instance, they can probe for slow-running validation expressions and attempt to take your website offline.

## Mitigation

* Don’t generate regular expressions directly from untrusted input – define them as string literals in your codebase.

* Use a search index like Elasticsearch or Lucene for complex searches, rather than running regular expression matches on large datasets. For instance, this is how you run a search query against an Elasticsearch server:

“`python
from elasticsearch import ElasticsearchSEARCH_INDEX = Elasticsearch(hosts=os.getenv(“ES_HOSTS”).split(‘,’))

@app.route(‘/search/<pattern>’)
def search(pattern):
query = {
“query” : {
“bool” : {
“must” : {
“match” : {
“name” : pattern
}
}
}
}
}

matches = SEARCH_INDEX.search(index=”main-index”, query=query)

return (document[“name”] for document in matches[‘hits’][‘hits’])
“`

* Check any regexes within your codebase for repeating grouped patterns or ambiguous patterns. “Catastrophic” backtracking can be avoided if you follow these rules of thumb:

* Avoid nested quantifiers like `(a+)+`, where a pattern potentially matching multiple characters can be applied multiple times.
* Avoid quantified overlapping disjunctions like `(a|a)+`.
* Avoid quantified overlapping adjacencies like `\d+\d+`.

## CWEs

* [CWE-185](https://cwe.mitre.org/data/definitions/185.html)

About ShiftLeft

ShiftLeft empowers developers and AppSec teams to dramatically reduce risk by quickly finding and fixing the vulnerabilities most likely to reach their applications and ignoring reported vulnerabilities that pose little risk. Industry-leading accuracy allows developers to focus on security fixes that matter and improve code velocity while enabling AppSec engineers to shift security left.

A unified code security platform, ShiftLeft CORE scans for attack context across custom code, APIs, OSS, containers, internal microservices, and first-party business logic by combining results of the company’s and Intelligent Software Composition Analysis (SCA). Using its unique graph database that combines code attributes and analyzes actual attack paths based on real application architecture, ShiftLeft then provides detailed guidance on risk remediation within existing development workflows and tooling. Teams that use ShiftLeft ship more secure code, faster. Backed by SYN Ventures, Bain Capital Ventures, Blackstone, Mayfield, Thomvest Ventures, and SineWave Ventures, ShiftLeft is based in Santa Clara, California. For information, visit: www.shiftleft.io.

Share

See for yourself – run a scan on your code right now