How to support this blog?

To support this blog, you can hire me as an OmegaT consultant/trainer, or you can send translation and project management jobs my way.

Search the site:

Introduction to regular expressions

[2018 update]

Textwrangler has now been replaced by BBEdit that offers a free trial and then turns into a BBEdit "free mode". BBEdit "free mode" is a better TextWrangler and also happens to be a 64 bits application compatible with High Sierra.
https://www.barebones.com/products/bbedit/comparison.html

I mention Aquamacs at the end of the article. Regular expressions for Aquamacs but also for Emacs are documented in the Emacs manual available here:
https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html

Check this article too:
https://www.johndcook.com/blog/2018/01/27/emacs-features-that-use-regular-expressions/

In OmegaT, you lose a lot if you're not familiar with regular expressions. OmegaT uses the Java flavor of regular expressions:
https://omegat.sourceforge.io/manual-standard/en/appendix.regexp.html

OmegaT allows you to create "tags" that can be protected and checked during the translation by using regular expressions. In a recent project I had created a 872 characters long regular expression that described 71 different tags.

As of February 2018, OmegaT also allows for replacements with capture groups as described below:
https://sourceforge.net/p/omegat/feature-requests/953/

And you can also use regular expressions to search for empty segments, as I document in this article:
https://mac4translators.blogspot.com/2018/03/searching-for-empty-translations-in.html




The technology that had the most impact on my workflow is definitely "regular expressions".

I discovered them at the end of the 90' when I was working on the conversion of a database output to a set of about 6000 static HTML pages. At the time, the editor of choice on the Mac was BBEdit from Barebones Software, but its free and "lite" version "BBEdit Lite" was also immensely popular. BBedit Lite has now been replaced by Textwrangler and just like its predecessor, Textwrangler can be used without paying a user license fee*.



What are regular expressions?


Regular expressions are a "search" function on steroids. Regular expressions were created to find patterns in strings. They can find simple patterns like the word "pattern" in this text, or more complex patterns like "a string that starts with 'pa', followed by a letter that's repeated twice, followed by any three characters that are neither 'space' nor '@' or '^' and followed by a space".

This document uses its first two paragraphs (the paragraphs in italics, above) as a test ground. Paste that paragraph in your favorite regular expressions supporting text editor (I use Textwrangler for all the descriptions so you might want to use it too) then call the search window, check the "grep" box at its bottom and search for:

re[^ ]*

You should see colors appearing while you type the search terms.

What that expression means is:
r followed by e followed by a group of characters that are not a space, or by nothing.

Hit Next and see what you get, then hit Cmd+G and see what you get. If you start from the top of the paragraph, you should have 8 "matches".


Normal characters


Most characters represent themselves in regular expressions (regex), like a "normal" search.

r means r and e means e, " " means a space. In the same sequence. No magic here.


Special effects


Some characters have special effects:

→ [ starts a group of characters
→ ] ends that group
→ ^ means "not"
→ * means "zero or more of what just came"

So, our above simple regular expression re[^ ]* means:

"look for any string that has a r followed by a e followed by zero or more characters that are not a space."

Now, what if you need to find characters like ^, [, ] or *?


Cancelling special effects


When you want to find characters that have a special effect without "triggering" that special effect, you put a "\" in front of them:

\* means the character *
\[ means the character [


And since the character "\" has the special effect of removing the special effect of a character that has a special effect... then:

→ \\ means the character \

etc.

By the way, the character . has the special effect of matching "any one character" so if you're looking for a period, then you really want to look for the \. string...


Examples:


The regular expression ". " (. followed by space) will match any one character followed by a space. There are 78 strings that match this pattern in the paragraph.

The regular expression "\. " (\ followed by . followed by space) will match any period followed by a space. There are only 2 strings that match this pattern in the paragraph.

The regular expressions "re*." (re followed by * followed by .  will match any string that is composed of a r followed by zero or more e, followed by any one character. There are 22 matches in the paragraph. Verify that you understand them all.

The regular expression ".e\*\." (. followed by e followed by \ followed by * followed by \ followed by .) will match the 4 characters string ee*. that you find at the end of the paragraph.


Triggering special effects


Some characters work the other way round: by themselves they do not have a special effect but if you stick the \ character before them, then their special effect is triggered.

t means t but \t means tabulation
r means r but \r means line break (specifically "carriage return")
s means s but \s means all sorts of white space, which includes spaces, tabulations, line breaks, etc.

If the character does not have a special effect then using \ has no effect.

i means i and \i too means i

Such sequences (\ followed by a character) are usually called "escape sequences".


Remembering matches


If you want to "memorize" a match, for later use in the expression or in the "Replace:" field, then you put the corresponding expressions between parenthesis:

(re)[^ ]+ will produce the same matches as above, but will memorize the re part and not the rest.

→ re([^ ]+) will produce the same matches as above, but will not memorize the re part and instead will memorize the rest.

→ (re)([^ ]+) will produce the same matches as above and will memorize the 2 parts separately.


Using memorized matches


Now that the matches are remembered, you can use them. Use \1 to refer to the first memorized string\2 to refer to the second memorized string, etc...

→ (e)\1\*\. will match the "ee*." string that you find at the end of the text.

→ search for (re)([^ ]+) and put \2\1 in the Replace: field:

(re) is the first group
([^ ]+) is the second group

\2\1 will thus put the second group before the first group.

The term "regular" matches the pattern: (re) matches re and ([^ ]+) matches gular. The replaced string will thus be "gularre".

→ search for (re)([^ ]+) and put \1\1_\[\2\] in the Replace: field:


(re) is the first group
([^ ]+) is the second group

\1\1_\[\2\] will put 2 instances of the first group, then an underbar, then [, then the second group, then ].

In the case of "regular", we'd have the following replacement string:
rere_[gular]


That's only the beginning...


What you need to check now is the special effects of some characters. If you've used Textwrangler it is all in the user manual, page 133, Chapter 8 (Searching with Grep)**, or you can call the Help with Cmd+? and you'll find a relevant link right away.

Textwrangler's regex is pretty standard so once you're used to it there, you can use it in other editors too. If what works in Textwrangler does not work there, check the idiosyncrasies of the editor you use.

Now, take a real world document and try to transform it by using a few regular expressions. A typical use case for a translator would be to convert a TMX file into a 2 column tab separated data set, or the opposite: to convert a 2 column tab separated data sets into a TMX file. If you manage to do that you've created your first alignment based TMX converter!



* I try to use or discuss free software when possible because I think that is the way to go. People who want to use a free text editor on the Mac can use Aquamacs. It comes with all the goodness of emacs (including the same regular expressions) and looks and feels a lot like a "normal" Mac text editor.
** [2018 Update] In the BBEdit manual, Chapter 8 is on page 165.

Popular, if not outdated, posts...

.docx .NET .pptx .sdf .xlsx AASync accented letters Accessibility Accessibility Inspector Alan Kay alignment Apple AppleScript ApplescriptObjC AppleTrans applications Aquamacs Arabic archive Automator backup bash BBEdit Better Call Saul bug Butler C Calculator Calendar Chinese Cocoa Command line CSV CSVConverter database defaults Devon Dictionary DITA DocBook Dock Doxygen EDICT Emacs emacs lisp ergonomics Excel external disk file formats file system File2XLIFF4j Finder Fink Font français Free software FSF Fun Get A Mac git GNU GPL Guido Van Rossum Heartsome Homebrew HTML IceCat Illustrator InDesign input system ITS iWork Japanese Java Java Properties Viewer Java Web Start json keybindings keyboard Keynote killall launchd LISA lisp locale4j localisation MacPort Mail markdown MARTIF to TBX Converter Maxprograms Mono MS Office NeoOffice Numbers OASIS Ocelot ODF Okapi OLPC OLT OmegaT OnMyCommand oo2po OOXML Open Solaris OpenDocument OpenOffice.org OpenWordFast org-mode OSX Pages PDF PDFPen PlainCalc PO Preview programming python QA Quick Look QuickSilver QuickTime Player Rainbow RAM reggy regular expressions review rsync RTFCleaner Safari Santa Claus scanner Script Debugger Script Editor scripting scripting additions sdf2txt security Services shell shortcuts Skim sleep Smultron Snow Leopard Spaces Spanish spellchecking Spotlight SRX standards StarOffice Stingray Study SubEthaEdit Swordfish System Events System Preferences TBX TBXMaker Terminal text editing TextEdit TextMate TextWrangler The Tool Kit Time Capsule Time Machine tmutil TMX TMX Editor TMXValidator transifex Translate Toolkit translation Transmug troubleshooting TS TTX TXML UI Browser UI scripting Unix VBA vi Virtaal VirtualBox VLC W3C WebKit WHATWG Windows Wine Word WordFast wordpress writing Xcode XLIFF xml XO xslt YAML ZFS Zip