Extracting and replacing elements that satisfy the conditions of a list (array) of strings in Python

Money and Business

To generate a new list from a list (array) whose elements are strings, by extracting only the elements of strings that satisfy certain conditions, or by performing substitutions, conversions, etc., use list comprehensions.

After a brief explanation of list comprehensions, the following contents are explained with sample code.

  • Extraction based on whether or not a specific string is included (partial match)
  • Replace specific string
  • Extract by starting or not starting with a specific string
  • Extract by ending or not ending with a specific string
  • Judged and extracted by case
  • Convert uppercase and lowercase
  • Determines whether alphabetic or numeric characters are used and extracts them
  • Multiple conditions
  • (computer) regular expression

Note that lists can store different types of data and are strictly different from arrays. If you want to handle arrays in processes that require memory size and memory addresses or numerical processing of large data, use array (standard library) or NumPy.

list inclusion notation

When generating a new list from a list, list comprehensions are simpler to write than for loops.

[expression for any variable name in iterable object if conditional expression]

If the element is only to be selected by a conditional expression, it is not processed by an expression, so it takes the following form

[variable name for variable name in original list if conditional expression]

If the if conditional expression is made into an if not conditional expression, it becomes a negation, and elements that do not satisfy the conditional expression can be extracted.

Contains a specific string (partial match) \ Does not contain: in

In “specific string in original string”, returns True if the original string contains the specific string. This is a conditional expression.

The negation of in is done with not in.

l = ['oneXXXaaa', 'twoXXXbbb', 'three999aaa', '000111222']

l_in = [s for s in l if 'XXX' in s]
print(l_in)
# ['oneXXXaaa', 'twoXXXbbb']

l_in_not = [s for s in l if 'XXX' not in s]
print(l_in_not)
# ['three999aaa', '000111222']

Replace specific string

If you want to replace a string of list elements, use the string method replace() for each element in the list comprehension notation.

If there is no string to be replaced, there is no need to select the element in the if conditional expression because it will not be changed by applying replace().

l_replace = [s.replace('XXX', 'ZZZ') for s in l]
print(l_replace)
# ['oneZZZaaa', 'twoZZZbbb', 'three999aaa', '000111222']

If you want to replace an entire element that contains a specific string, extract it with in and process it with the ternary operator. The ternary operator is written in the following form.
True Value if Conditional Expression else False Value

It is OK if the expression part of the list comprehension notation is a ternary operator.

l_replace_all = ['ZZZ' if 'XXX' in s else s for s in l]
print(l_replace_all)
# ['ZZZ', 'ZZZ', 'three999aaa', '000111222']

The following is a summary of the results, enclosed in parentheses. If you are not used to using parentheses, it may be easier to understand and avoid mistakes. Grammatically, there is no problem even if you write parentheses.

[('ZZZ' if ('XXX' in s) else s) for s in l]

The use of in as a condition is confusing with the list comprehension notation in, but it is not difficult if you are aware of the syntactic form of list comprehension notation and ternary operators.

Starts with a specific string \ doesn't start: startswith()

The string method startswith() returns true if the string begins with the string specified in the argument.

l_start = [s for s in l if s.startswith('t')]
print(l_start)
# ['twoXXXbbb', 'three999aaa']

l_start_not = [s for s in l if not s.startswith('t')]
print(l_start_not)
# ['oneXXXaaa', '000111222']

Ends with a specific character string \ not end: endswith()

The string method endswith() returns true if the string ends with the string specified in the argument.

l_end = [s for s in l if s.endswith('aaa')]
print(l_end)
# ['oneXXXaaa', 'three999aaa']

l_end_not = [s for s in l if not s.endswith('aaa')]
print(l_end_not)
# ['twoXXXbbb', '000111222']

Judged and extracted by case

The string methods isupper(),islower() can be used to determine if a string is all upper or all lower case.

l_lower = [s for s in l if s.islower()]
print(l_lower)
# ['three999aaa']

Convert uppercase and lowercase

If you want to convert all characters to upper or lower case, use the string methods upper() and lower(). Other methods include capitalize(), which capitalizes only the first letter, and swapcase(), which swaps upper and lower case letters.

As in the substitution example above, use the ternary operator if you want to process only elements that satisfy the condition.

l_upper_all = [s.upper() for s in l]
print(l_upper_all)
# ['ONEXXXAAA', 'TWOXXXBBB', 'THREE999AAA', '000111222']

l_lower_to_upper = [s.upper() if s.islower() else s for s in l]
print(l_lower_to_upper)
# ['oneXXXaaa', 'twoXXXbbb', 'THREE999AAA', '000111222']

Determines whether alphabetic or numeric characters are used and extracts them

The string methods isalpha() and isnumeric() can be used to determine whether a string is all alphabetic, numeric, etc.

l_isalpha = [s for s in l if s.isalpha()]
print(l_isalpha)
# ['oneXXXaaa', 'twoXXXbbb']

l_isnumeric = [s for s in l if s.isnumeric()]
print(l_isnumeric)
# ['000111222']

Multiple conditions

The conditional expression part of list comprehensions can be multiple conditions. Negative “not” conditions can also be used.

When using three or more conditional expressions, it is safer to enclose each group in parentheses () because the result will vary depending on the order.

l_multi = [s for s in l if s.isalpha() and not s.startswith('t')]
print(l_multi)
# ['oneXXXaaa']

l_multi_or = [s for s in l if (s.isalpha() and not s.startswith('t')) or ('bbb' in s)]
print(l_multi_or)
# ['oneXXXaaa', 'twoXXXbbb']

(computer) regular expression

Regular expressions allow for highly flexible processing.

The match object returned by re.match() when it matches is always determined to be true when evaluated with a conditional expression. If it does not match, it returns None, which is false in the conditional expression. So, if you want to extract only the elements that match the regular expression, just apply re.match() to the conditional expression part of the list comprehension expression as before.

import re

l = ['oneXXXaaa', 'twoXXXbbb', 'three999aaa', '000111222']

l_re_match = [s for s in l if re.match('.*XXX.*', s)]
print(l_re_match)
# ['oneXXXaaa', 'twoXXXbbb']

re.sub(), which replaces the matched part of a regular expression, is also useful. To extract and replace only the matched elements, just add “if conditional expression”.

l_re_sub_all = [re.sub('(.*)XXX(.*)', r'\2---\1', s) for s in l]
print(l_re_sub_all)
# ['aaa---one', 'bbb---two', 'three999aaa', '000111222']

l_re_sub = [re.sub('(.*)XXX(.*)', r'\2---\1', s) for s in l if re.match('.*XXX.*', s)]
print(l_re_sub)
# ['aaa---one', 'bbb---two']
Copied title and URL