Using Command Line Tools to modify glossaries

WARNING: Use Terminal commands only when you know what you are doing on your Mac or Linux PC.

Command Line Tools offer a great way to manipulate your tab-delimited glossaries.

Cutting out the first column

You can cut out the first column of your CafeTran glossary, e.g. to spell check it. Here is an example with the InterActive Terminology for Europe, the EU's multilingual termbase:

cut -f1 iate.txt > german.txt

See also: Spell-checking a glossary

Cutting out the second column

cut -f2 iate.txt > dutch.txt

Changing a text to lowercase

Now we change the Dutch column to all lowercase:

tr [:upper:] [:lower:] < dutch.txt > lowercase.txt

Joining columns

Now we recombine the two columns:

paste german.txt lowercase.txt > modified.txt

More info: C't 2014, Volume 22, pages 174-177

Creating a word list from a source text

Use this command in the Terminal to create a word list from a source text, while excluding all words in a given list:

fmt -1 source.txt | tr -s '\t' ' ' | sed -e 's/^[ ]//' | tr -d [:punct:] | tr -d [:digit:] | grep -w -i -v -fexclude.txt | sort | uniq > wordlist.txt

Command Explanation
fmt -1 source.txt Reformat (word wrap) the source text so that all words are on separate lines.
tr -s '\t' ' ' Replace all tab characters with spaces.
sed -e 's/^[ ]//' Remove leading spaces.
tr -d [:punct:] | tr -d [:digit:] Remove punctuation characters and digits.
grep -w -i -v -fexclude.txt Ignore (case insensitive) all words in the list 'exclude.txt'.
sort | uniq Sort the list and remove all duplicates.
> wordlist.txt Send the output to the file 'wordlist.txt'.

Download a list with German stop words.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License