# Regex Injection
*Regular expressions* (regex) are a way of describing the order and type of characters that occur in a string. They are often used to validate input or search for “wildcard” matches within a set of strings. If the regular expression (rather than the string it is testing) is generated from untrusted input, or a regex that exists in your codebase is poorly designed, an attacker can perform a **regex injection** attack by sending malicious input that will take a huge amount of computing power to evaluate. This technique is often used to perform *denial-of-service* attacks on vulnerable web-servers.
## Regex Injection in Python
The following Python application naively allows a “wildcard” search expression to be sent from the client, and evaluates the string as a regular expression against a list of potential matches:
“`python @app.route(‘/search/<pattern>’) def search(pattern): regex = re.compile(pattern) matches = (item for item in ITEMS if regex.match(item))return matches “` |
Regular expression matching can be very slow if the expression matching engine has to perform a lot of *backtracking*. This occurs when the regular expression contains repeated matching groups, each of which contains a repeating symbol – which means the engine has to evaluate exponentially more different logical branches while scanning a string.
It this case, an “evil” regex in the following form can be supplied:
“` (.*a){20} “` |
This pattern means “twenty occurrences of: zero or more characters followed by the letter a”. This expression will require an enormous amount of compute time to evaluate against a string such as:
“` aaaaaaaaaaaaaaaaaaaa! “` |
Sending many such search requests to your server gives an attacker an easy way to perform a denial-of-service attack.
### Regex Patterns in Validation
Attackers can take advantage of inefficient regexes even when they do not have control of the form of the regular expression itself. By passing a maliciously crafted “email” parameter to a sign-up page, for instance, they can probe for slow-running validation expressions and attempt to take your website offline.
## Mitigation
* Don’t generate regular expressions directly from untrusted input – define them as string literals in your codebase.
* Use a search index like Elasticsearch or Lucene for complex searches, rather than running regular expression matches on large datasets. For instance, this is how you run a search query against an Elasticsearch server:
“`python from elasticsearch import ElasticsearchSEARCH_INDEX = Elasticsearch(hosts=os.getenv(“ES_HOSTS”).split(‘,’)) @app.route(‘/search/<pattern>’) matches = SEARCH_INDEX.search(index=”main-index”, query=query) return (document[“name”] for document in matches[‘hits’][‘hits’]) |
* Check any regexes within your codebase for repeating grouped patterns or ambiguous patterns. “Catastrophic” backtracking can be avoided if you follow these rules of thumb:
* Avoid nested quantifiers like `(a+)+`, where a pattern potentially matching multiple characters can be applied multiple times.
* Avoid quantified overlapping disjunctions like `(a|a)+`.
* Avoid quantified overlapping adjacencies like `\d+\d+`.
## CWEs
* [CWE-185](https://cwe.mitre.org/data/definitions/185.html)