test - regular expression in python for beginners

Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring (3)

awk solution. The requirements are not that complicated: a simple script can do the trick. There's just one complication: every regex that is a result from your first match has to be matched against all lines of the second file. Here's where we use xargs to solve that.

Now, whatever language you pick, it looks like the number of matches being made is going to be extensive, so some remarks about the regexes need to be made first.

The regex for the first file is going to be slow, because in


the number of possibilities for the first part [AGCT]{1,12000} are huge. Actuallly it only says pick any element from A, C, G, T and to that between 1 and 12000 times. Then match the rest. Couldn't we do a


instead? The speed gain is considerable.

A similar remark can be made to the regex for the second file. If you replace




you will experience some improvement.

Because I started this answer with the low complication-factor of the requirements, let's see some code:

$ cat tst.awk
match($0, /AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA$/, a) {
   r = sprintf("(CTAAA)[AC]{5,100}(TTTGGG)(%s)(CTT)[AG]*(%s)$", 
   print r

function translate(word) {
   cmd = "echo '" word "' | tr 'ACGT' 'TGCA'";
   res = ((cmd | getline line) > 0 ? line : "");
   return res

What this will do is produce the regex for your second file. (I've added extra grouping for demo purposes). Now, let's take a look at the second script:

$ cat tst2.awk
match($0, regex, a){ printf("Found matches %s and %s\n", a[3], a[5]) }

What this will do is get a regex and matches it with every line read from the second input file. We need to provide this script with a value for regex, like this:

$ awk -f tst.awk input1.txt | xargs -I {} -n 1 awk -v regex={} -f tst2.awk input2.txt

The -v option of awk let's us define a regex, which is fed into this call by the first script.

$ cat input1.txt

$ cat input2.txt

and the result is:

$ awk -f tst.awk input1.txt | xargs -I {} -n 1 awk -v regex={} -f tst2.awk input2.txt
Found matches TTCCT and GG

In conclusion: should you use regexes to solve your problem? Yes, but you need to be not too ambitious to match the whole string in one time. Quantifiers like {1,12000} are going to slow you down, whatever language you pick.

I am particularly looking at R, Perl, and shell. But any other programming language would be fine too.


Is there a way to visually or programmatically inspect and index a matched string based on the regex? This is intended for referencing back to the first regex and its results inside of a second regex, so as to be able to modify a part of the matched string and write new rules for that particular part.

https://regex101.com does visualize how a certain string matches the regular expression. But it is far from perfect and is not efficient for my huge dataset.


I have around 12000 matched strings (DNA sequences) for my first regex, and I want to process these strings and based on some strict rules find some other strings in a second file that go well together with those 12000 matches based on those strict rules.


This is my first regex (a simplified, shorter version of my original regex) that runs through my first text file.


Let's suppose that it finds the following three sub-strings in my large text file:


Now I have a second file which includes a very large string. From this second file, I am only interested in extracting those sub-strings that match a new (second) regex which itself is dependent on my first regex in few sections. Therefore, this second regex has to take into account the substrings matched in the first file and look at how they have matched to the first regex!

Allow me, for the sake of simplicity, index my first regex for better illustration in this way:

first.regex.p1 = [ACGT]{1,12000}
first.regex.p2 = (AAC)
first.regex.p3 = [AG]{2,5}
first.regex.p4 = [ACGT]{2,5}
first.regex.p5 = (CTGTGTA)

Now my second (new) regex which will search the second text file and will be dependent on the results of the first regex (and how the substrings returned from the first file have matched the first regex) will be defined in the following way:

second.regex = (CTAAA)[AC]{5,100}(TTTGGG){**rule1**} (CTT)[AG]{10,5000}{**rule2**}

In here rule1 and rule2 are dependent on the matches coming from the first regex on the first file. Hence;

rule1 = look at the matched strings from file1 and complement the pattern of first.regex.p3 that is found in the matched substring from file1 (the complement should of course have the same length)
rule2 = look at the matched strings from file1 and complement the pattern of first.regex.p4 that is found in the matched substring from file1 (the complement should of course have the same length)

You can see that second regex has sections that belong to itself (i.e. they are independent of any other file/regex), but it also has sections that are dependent on the results of the first file and the rules of the first regex and how each sub-string in the first file has matched that first regex!

Now again for the sake of simplicity, I use the third matched substring from file1 (because it is shorter than the other two) to show you how a possible match from the second file looks like and how it satisfies the second regex:

This is what we had from our first regex run through the first file:


So in this match, we see that:

T has matched first.regex.p1
AAC has matched first.regex.p2
AAGGA has matched first.regex.p3
CC first.regex.p4
CTGTGTA has matched first.regex.p5

Now in our second regex for the second file we see that when looking for a substring that matches the second regex, we are dependent on the results coming from the first file (which match the first regex). Particularly we need to look at the matched substrings and complement the parts that matched first.regex.p3 and first.regex.p4 (rule1 and rule2 from second.regex).

complement means:
A will be substituted by T
T -> A
G -> C
C -> G

So if you have TAAA, the complement will be ATTT.

Therefore, going back to this example:


We need to complement the following to satisfy the requirements of the second regex:

AAGGA has matched first.regex.p3
CC first.regex.p4

And complements are:

TTCCT (based on rule1)
GG (based on rule2)

So an example of a substring that matches second.regex is this:


This is only one example! But in my case I have 12000 matched substrings!! I cannot figure out how to even approach this problem. I have tried writing pure regex but I have completely failed to implement anything that properly follows this logic.. Perhaps I shouldn't be even using regex?

Is it possible to do this entirely with regex? Or should I look at another approach? Is it possible to index a regex and in the second regex reference back to the first regex and force the regex to consider the matched substrings as returned by first regex?

This can be done programmatically in Perl, or any other language.

Since you need input from two different files, you cannot do this in pure regex, as regex cannot read files. You cannot even do it in one pattern, as no regex engine remembers what you matched before on a different input string. It has to be done in the program surrounding your matches, which should very well be regex, as that's what regex is meant for.

You can build the second pattern up step by step. I've implemented a more advanced version in Perl that can easily be adapted to suit other pattern combinations as well, without changing the actual code that does the work.

Instead of file 1, I will use the DATA section. It holds all three example input strings. Instead of file 2, I use your example output for the third input string.

The main idea behind this is to split up both patterns into sub-patterns. For the first one, we can simply use an array of patterns. For the second one, we create anonymous functions that we will call with the match results from the first pattern to construct the second complete pattern. Most of them just return a fixed string, but two actually take a value from the arguments to build the complements.

use strict;
use warnings;

sub complement {
    my $string = shift;
    $string =~ tr/ATGC/TACG/; # this is a transliteration, faster than s///
    return $string;

# first regex, split into sub-patterns
my @first = ( 

# second regex, split into sub-patterns as callbacks
my @second = (
    sub { return qr(CTAAA) },
    sub { return qr([AC]{5,100}) },
    sub { return qr(TTTGGG) },
    sub {
        my (@matches) = @_;

        # complement the pattern of first.regex.p3
        return complement( $matches[3] );
    sub { return qr(CTT) },
    sub { return qr([AG]{10,5000}) },
    sub {
        my (@matches) = @_;

        # complement the pattern of first.regex.p4
        return complement( $matches[4] );


while ( my $file1 = <DATA> ) {

    # this pattern will match the full thing in $1, and each sub-section in $2, $3, ...
    # @matches will contain (full, $2, $3, $4, $5, $6)
    my @matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g );

    # iterate the list of anonymous functions and call each of them,
    # passing in the match results of the first match
    my $pattern2 = join q{}, map { '(' . $_->(@matches) . ')' } @second;

    my @matches2 = ( $file2 =~ m/($pattern2)/ );


These are the generated second patterns for your three input substrings.


If you're not familiar with this, it's what happens if you print a pattern that was constructed with the quoted regex operator qr//.

The pattern matches your example output for the third case. The resulting @matches2 looks like this when dumped out using Data::Printer.

    [1] "CTAAA",
    [2] "ACACC",
    [3] "TTTGGG",
    [4] "TTCCT",
    [5] "CTT",
    [7] "GG"

I cannot say anything about speed of this implementation, but I believe it will be reasonable fast.

If you wanted to find other combinations of patterns, all you had to do was replace the sub { ... } entries in those two arrays. If there is a different number than five of them for the first match, you'd also construct that pattern programmatically. I've not done that above to keep things simpler. Here's what it would look like.

my @matches = ( $file1 =~ join q{}, map { "($_)" } @first);

If you want to learn more about this kind of strategy, I suggest you read Mark Jason Dominus' excellent Higher Order Perl, which is available for free as a PDF here.

Using stringr in R

Extract matches to regex_1: "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)"

reg_1_matches = stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")
reg_1_matches = unlist(reg_1_matches)

lets assume the matches were:


Use stringr::str_match with capturing groups (...)

df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")

p3 = df_ps[,2]
p4 = df_ps[,3]


rule_1 = chartr(old= "ACGT", "TGCA", p3)
rule_2 = chartr(old= "ACGT", "TGCA", p4)

Construct regex_2

  paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="") 

all in one go:

reg_1_matches =  stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")
df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")
p3 = df_ps[,2]
p4 = df_ps[,3]
rule_1 = chartr(old= "ACGT", "TGCA", p3)
rule_2 = chartr(old= "ACGT", "TGCA", p4)
paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="")