unique columns sed

#!/bin/sed -f

# Removes duplicate fields in a | separated file
# e.g.    foo|bar|foo|quz|bar
# becomes foo|bar|quz

: restart

# The s instruction needs some explanation.
# The regular expression consists of the following parts
# \1: \(^\||\)
#     Beginning of line or termination of last field
#     Note that we use | as field separator
# \2: \([^|]\+\)
#     Everything between \1 and the next field
#     We can use the \+ extension because we need an extension in \4 anyway
# \3: \(.*\)
#     Everything between \2 and \4
# \4: \(|\2\)
#     A field identical to \2 plus field separator
# \5: \(|\|$\)
#     Field separator closing \4 or end of line
#
# The replacement \1\2\3\5 excludes \4. So the duplicated field is removed
s/\(^\||\)\([^|]\+\)\(.*\)\(|\2\)\(|\|$\)/\1\2\3\5/

# Loop if the s instruction matched something until all duplicates are gone
# s///g does not work in this case as changes may overlap
t restart

# Handling of repeated empty fields has to happen separately
# The regex matches || or | followed by end of line
# The replacement is a single | unless we matched the end of line
# Then it is the null line matched by $
#
# The suffix 2g is a GNU extension and replaces all but the first match
# For non-GNU, may be replaced with another loop
s/|\(|\|$\)/\1/2g