regex file - Remove small words using Python




stop how (3)

Is it possible use regex to remove small words in a text? For example, I have the following string (text):

anytext = " in the echo chamber from Ontario duo "

I would like remove all words that is 3 characters or less. The Result should be:

"echo chamber from Ontario"

Is it possible do that using regular expression or any other python function?

Thanks.


Answers

Certainly, it's not that hard either:

shortword = re.compile(r'\W*\b\w{1,3}\b')

The above expression selects any word that is preceded by some non-word characters (essentially whitespace or the start), is between 1 and 3 characters short, and ends on a word boundary.

>>> shortword.sub('', anytext)
' echo chamber from Ontario '

The \b boundary matches are important here, they ensure that you don't match just the first or last 3 characters of a word.

The \W* at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in \W, use \s if you only want to remove preceding whitespace.

For what it's worth, this regular expression solution preserves extra whitespace between the rest of the words, while mgilson's version collapses multiple whitespace characters into one space. Not sure if that matters to you.

His list comprehension solution is the faster of the two:

>>> import timeit
>>> def re_remove(text): return shortword.sub('', text)
... 
>>> def lc_remove(text): return ' '.join(word for word in text.split() if len(word)>3)
... 
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import re_remove as remove')
7.0774190425872803
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import lc_remove as remove')
6.4250049591064453

I don't think you need a regex for this simple example anyway ...

' '.join(word for word in anytext.split() if len(word)>3)

Basically, you want to find a substring in a string in python. There are two ways to search for a substring in a string in Python.

Method 1: in operator

You can use the Python's in operator to check for a substring. It's quite simple and intuitive. It will return True if the substring was found in the string else False.

>>> "King" in "King's landing"
True

>>> "Jon Snow" in "King's landing"
False

Method 2: str.find() method

The second method is to use the str.find() method. Here, we call the .find() method on the string in which substring is to found. We pass the substring to the find() method and check its return value. If its value is other than -1, the substring was found in the string, otherwise not. The value returned is the index where substring was found.

>>> some_string = "valar morghulis"

>>> some_string.find("morghulis")
6

>>> some_string.find("dohaeris")
-1

I would recommend you to use the first method as it is more Pythonic and intuitive.





python regex