Remove and extract duplicate elements from a list (array) in Python

Money and Business

This section describes how to generate a new list in Python by removing or extracting duplicate elements from a list (array).

The following details are described here.

  • Remove duplicate elements and generate new listings
    • Do not preserve the order of the original listing:set()
    • Preserves the order of the original listing: dict.fromkeys(),sorted()
    • Two-dimensional array (list of lists)
  • Extract duplicate elements and generate a new list
    • Do not preserve the order of the original listing
    • Preserves the order of the original listing
    • Two-dimensional array (list of lists)

The same concept can be applied to tuples instead of lists.

See the following article for

  • If you want to determine if a list or tuple has duplicate elements
  • If you want to extract elements that are common or not common among multiple listings instead of a single listing

Note that lists can store different types of data and are strictly different from arrays. If you want to handle arrays in processes that require memory size and memory addresses or numerical processing of large data, use array (standard library) or NumPy.

Remove duplicate elements and generate new listings

Do not preserve the order of the original listing: set()

If there is no need to preserve the order of the original list, use set(), which generates a set type set.

The set type is a data type that has no duplicate elements. When a list or other data type is passed to set(), duplicate values are ignored and an object of type set is returned in which only unique values are elements.

If you want to make it a tuple, use tuple().

l = [3, 3, 2, 1, 5, 1, 4, 2, 3]

print(set(l))
# {1, 2, 3, 4, 5}

print(list(set(l)))
# [1, 2, 3, 4, 5]

Of course, it can also be left as set. See the following article for more information on the set type set.

Preserves the order of the original listing: dict.fromkeys(),sorted()

If you want to preserve the order of the original list, use the class method fromkeys() of the dictionary type or the built-in function sorted().

dict.fromkeys() creates a new dictionary object whose keys are lists, tuples, etc. specified in the arguments. If the second argument is omitted, the value is None.

Since dictionary keys do not have duplicate elements, duplicate values are ignored as in set(). In addition, a dictionary object can be passed as an argument to list() to obtain a list whose elements are dictionary keys.

print(dict.fromkeys(l))
# {3: None, 2: None, 1: None, 5: None, 4: None}

print(list(dict.fromkeys(l)))
# [3, 2, 1, 5, 4]

It has been guaranteed since Python 3.7 (CPython is 3.6) that dict.fromkeys() preserves the order of the argument sequence. Earlier versions use the built-in function sorted() as follows.

Specify the list tuple method index() for the argument key of sorted, which returns a sorted list of elements.

index() is a method that returns the index of the value (the number of the element in the list), which can be specified as the key of sorted() to sort the list based on the order of the original list. The argument key is specified as a callable (callable) object, so do not write ().

print(sorted(set(l), key=l.index))
# [3, 2, 1, 5, 4]

Two-dimensional array (list of lists)

For two-dimensional arrays (lists of lists), the method using set() or dict.fromkeys() results in a TypeError.

l_2d = [[1, 1], [0, 1], [0, 1], [0, 0], [1, 0], [1, 1], [1, 1]]

# l_2d_unique = list(set(l_2d))
# TypeError: unhashable type: 'list'

# l_2d_unique_order = dict.fromkeys(l_2d)
# TypeError: unhashable type: 'list'

This is because non-hashable objects such as lists cannot be elements of type set or keys of type dict.

Define the following functions The order of the original list is preserved and works for one-dimensional lists and tuples.

def get_unique_list(seq):
    seen = []
    return [x for x in seq if x not in seen and not seen.append(x)]

print(get_unique_list(l_2d))
# [[1, 1], [0, 1], [0, 0], [1, 0]]

print(get_unique_list(l))
# [3, 2, 1, 5, 4]

List comprehension notation is used.

Here, we use the following

  • If X in “X and Y” is false in the short-circuit evaluation of the and operator, then Y is not evaluated (not executed).
  • The append() method returns None.

If the elements of the original list seq do not exist in the seen, then and after are evaluated.
seen.append(x) is executed and the element is added to seen.
Because the append() method returns None and None is False, not seen.append(x) evaluates to True.
The conditional expression in the list comprehension notation becomes True and is added as an element of the final generated list.

If the elements of the original list seq are present in seen, then x not in seen is False, and the conditional expression for the list comprehension expression is False.
Therefore, they are not added as elements of the final generated list.

Another method is to set the argument axis in NumPy's function np.unique(), although the result will be sorted.

Extract duplicate elements and generate a new list

Do not preserve the order of the original listing

To extract only duplicate elements from the original list, use collections.Counter().
Returns a collections.Counter (a subclass of dictionary) with the elements as keys and the number of elements as values.

import collections

l = [3, 3, 2, 1, 5, 1, 4, 2, 3]

print(collections.Counter(l))
# Counter({3: 3, 2: 2, 1: 2, 5: 1, 4: 1})

Since it is a subclass of dictionary, items() can be used to retrieve keys and values. It is sufficient to extract keys whose number is two or more.

print([k for k, v in collections.Counter(l).items() if v > 1])
# [3, 2, 1]

Preserves the order of the original listing

As shown in the example above, since Python 3.7, the keys of collections.Counter retain the order of the original list and so on.

In earlier versions, sorting with sorted() is sufficient, as is deleting duplicate elements.

print(sorted([k for k, v in collections.Counter(l).items() if v > 1], key=l.index))
# [3, 2, 1]

If you wish to extract duplicates as they are, simply leave elements from the original list with a number of two or more. The order is also preserved.

cc = collections.Counter(l)
print([x for x in l if cc[x] > 1])
# [3, 3, 2, 1, 1, 2, 3]

Two-dimensional array (list of lists)

For two-dimensional arrays (lists of lists), the following functions are possible when the order of the original list is not retained and when it is retained, respectively. It also works for one-dimensional lists and tuples.

l_2d = [[1, 1], [0, 1], [0, 1], [0, 0], [1, 0], [1, 1], [1, 1]]
def get_duplicate_list(seq):
    seen = []
    return [x for x in seq if not seen.append(x) and seen.count(x) == 2]

def get_duplicate_list_order(seq):
    seen = []
    return [x for x in seq if seq.count(x) > 1 and not seen.append(x) and seen.count(x) == 1]

print(get_duplicate_list(l_2d))
# [[0, 1], [1, 1]]

print(get_duplicate_list_order(l_2d))
# [[1, 1], [0, 1]]

print(get_duplicate_list(l))
# [3, 1, 2]

print(get_duplicate_list_order(l))
# [3, 2, 1]

If you want to extract with duplicates, leave elements from the original list with a count of two or more.

print([x for x in l_2d if l_2d.count(x) > 1])
# [[1, 1], [0, 1], [0, 1], [1, 1], [1, 1]]

Note that since the computational complexity of count() is O(n), the function shown above that repeatedly executes count() is very inefficient. There may be a smarter way.

Counter is a subclass of dictionary, so if you pass a list or tuple whose elements are lists or other non-hashable objects to collections.Counter(), an error will occur and you will not be able to use it.

# print(collections.Counter(l_2d))
# TypeError: unhashable type: 'list'