Search

Simple XML creation with AppleScriptObjectiveC

(*
This article is an attempt at putting together a practical introduction to using AppleScriptObjectiveC from all the information I gathered when creating a stand-alone application that required XML generation. Including the code, it is 3000 words. It's trivial for most of you, but it took me quite some time to put all the parts together and then to document them. Everything is still not super clear, so go ahead and tear it apart. :) I first give the code without any comments and then I add block comments to explain what needs explanations.

The code generates an omegat.project file that is equivalent to what OmegaT generates when creating a new project. I'm using that generation code in a bigger script that creates full fledged OmegaT projects without going through the OmegaT interface. The code can be applied to any kind of generic XML generation with a few tweaks. I use something similar in an Excel to TMX conversion script as well. I'll eventually publish both here.
*)

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

on CreateOmegaTProjectFile(ProjectSettings)
set project_tags to {"source_dir", "source_dir_excludes", "target_dir", "tm_dir", "glossary_dir", "glossary_file", "dictionary_dir", "source_lang", "target_lang", "source_tok", "target_tok", "sentence_seg", "support_default_translations", "remove_tags", "external_command"}
set masks to {"**/.svn/**", "**/CVS/**", "**/.cvs/**", "**/desktop.ini", "**/Thumbs.db", "**/.DS_Store"}
set valueindex to 0
set projectRoot to current application's NSXMLNode's elementWithName:"omegat"
set theProject to current application's NSXMLNode's documentWithRootElement:projectRoot
theProject's setCharacterEncoding:"UTF-8"
theProject's setStandalone:true
set project to current application's NSXMLNode's elementWithName:"project"
set projectVersion to current application's NSXMLNode's attributeWithName:"version" stringValue:"1.0"
project's addAttribute:projectVersion
projectRoot's addChild:project
repeat with child in project_tags
set valueindex to valueindex + 1
if contents of child is not "source_dir_excludes" then
set child to (current application's NSXMLNode's elementWithName:child stringValue:(item valueindex of ProjectSettings))
(project's addChild:child)
else
set child to (current application's NSXMLNode's elementWithName:"source_dir_excludes")
(project's addChild:child)
set source_dir_excludes to (project's elementsForName:"source_dir_excludes")'s firstObject()
repeat with mask in masks
set mask to (current application's NSXMLNode's elementWithName:"mask" stringValue:mask)
(source_dir_excludes's addChild:mask)
end repeat
end if
end repeat
set theData to theProject's XMLDataWithOptions:((get current application's NSXMLNodePrettyPrint) + (get current application's NSXMLDocumentTidyXML))
theData's writeToFile:(item 16 of ParametersList) options:(current application's NSDataWritingAtomic) |error|:(missing value)
end CreateOmegaTProjectFile


set ProjectSettings to {"PATH_1", "MASKS", "PATH_2", "PATH_3", "PATH_4", "FILE_5", "PATH_6", "LANG_7", "LANG_8", "TOKENIZER_9", "TOKENIZER_10", "SEGMENTATION_11", "DEFAULT_TRANSLATION_12", "REMOVE_TAGS_13", "EXTERNAL_COMMAND_14", ((POSIX path of (path to desktop folder)) & "omegat.project")}

my CreateOmegaTProjectFile(ProjectSettings)


(*
There are a number of solutions in AppleScript to create XML but nothing out of the box to generate generic data. System Events cannot create generic XML, it can only modify existing files. The only format it can natively produce is the "property list" format, used to store application preferences, etc. There are solutions that involve concatenating strings and doing a lot of checks on the data to make sure the output is valid (in XML a number of characters are forbidden, so all strings that are concatenated need to be thoroughly checked for those) but they are neither elegant nor robust. There are libraries available, but they do a lot more than what casual users need.

I thought that offering a simple solution for a simple problem that would provide the reader with a step-by-step introduction to reading the Foundation documentation and understanding how to use it with vanilla AppleScript was a better approach, at least for me, since short of having scriptable applications that do what you want, the only way to do really complex things in AppleScript is to use Foundation.

My problem, which is common to all amateur AppleScript users, is a problem of discoverability and of fluency. AppleScript is not a trivial language, a systematic description of its features is non-existent, and when you want to go beyond the "tell Finder to sort the mess on my Desktop" you end up needing to check a lot of references that either consider that you know a lot, or that you don't know much. There is no middle ground and no easy way to find what you need without actually asking. In fact, I still wonder how I could have written this code without the help of people who just seem to "know". There is no clear discoverability path to the information I gathered here.
AppleScriptObjC is not described in the AppleScript Language Guide issued by Apple. The 3rd edition of "Learn AppleScript" by Hamish Sanderson and Hanaan Rosenthal from Apress (2010) discusses some aspects of AppleScriptObjC but the document that best describes it is "Everyday AppleScriptObjC" by Shane Stanley (2015). The Foundation framework, along with other frameworks that can be accessed by AppleScriptObjC are described in the developer documentation (either in Xcode or online) with Objective-C use in mind (syntax, etc.) so we'll use the 3 references to go through the code.

Becoming fluent in a language is a problem both for the linguist and for the programmer. Fluency only comes with regular practice based on sound references, and regular contact with "natives". As far as AppleScript (and AppleScriptObjC) is concerned, "natives" can be found in a number of very interesting places on the internet. The first is the official AppleScript User List (ASUL) hosted by Apple. Apple has been a bad host in recent years so fear of losing this resource has motivated some of its users to create an independent list, hosted on groups.io. There also is the Macscripter web forum. I'm not super fond of web forums. Their user interface is clumsy more often than not and this one is no exception. But it's been around for ever and a number of world-class experts are there to comment on code and generally manage the community. Then, there is the ScriptDebugger's user forum. Conversations there generally revolve around higher level topics, the web forum is a modern one and the user experience is of extremely high quality.

Links:

Ok, so, you're all set, no more chatting, let's check the code.
*)

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

(*
As seen in the script headers above, we'll be using the "Foundation" framework which allows us to directly work on the XML tools that macOS offers. Without this declaration we can't access what we need.

This code will generate an XML file that will reproduce the following typical omegat.project file, with default values for the project folders, English as source language, Japanese as target language, sentence segmentation and default translations enabled, and no external command to process when creating the target files.

In XML, the order of tags as well as the white space used to present them is irrelevant, so the appearance of the data will change depending on how the code was produced. It may not be exactly like a "genuine" OmegaT file, but as far as the meaning of the data and OmegaT proper are concerned that is not relevant.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<omegat>
    <project version="1.0">
        <source_dir>__DEFAULT__</source_dir>
        <source_dir_excludes>
            <mask>**/.svn/**</mask>
            <mask>**/CVS/**</mask>
            <mask>**/.cvs/**</mask>
            <mask>**/desktop.ini</mask>
            <mask>**/Thumbs.db</mask>
            <mask>**/.DS_Store</mask>
        </source_dir_excludes>
        <target_dir>__DEFAULT__</target_dir>
        <tm_dir>__DEFAULT__</tm_dir>
        <glossary_dir>__DEFAULT__</glossary_dir>
        <glossary_file>glossary.txt</glossary_file>
        <dictionary_dir>__DEFAULT__</dictionary_dir>
        <source_lang>EN</source_lang>
        <target_lang>JA</target_lang>
        <source_tok>org.omegat.tokenizer.LuceneEnglishTokenizer</source_tok>
        <target_tok>org.omegat.tokenizer.LuceneJapaneseTokenizer</target_tok>
        <sentence_seg>true</sentence_seg>
        <support_default_translations>true</support_default_translations>
        <remove_tags>true</remove_tags>
        <external_command></external_command>
    </project>
</omegat>

The tokenizers are generally not set by the user because OmegaT would automatically select them based on the corresponding language codes.
Anything else is user settable, even the "excludes" files, even though that is an advanced setting that most users should not have to consider.

So, we're going to create a handler that takes as input the following values:

• source_dir_value: path to directory
• source_dir_excludes_masks: list of mask tags
• target_dir_value: path to directory
• tm_dir_value: path to directory
• glossary_dir_value: path to directory
• glossary_file_value:  path to text file
• dictionary_dir_value: path to directory
• source_lang_value: string language code (we do need to check the validity of the code based on the appropriate ISO standard, but we consider here that the check has been made upstream, if only to keep the XML generation code free of such external checks)
• target_lang_value: string language code (see above)
• source_tok_value: string automatically proposed by the script, based on language code (here again, the tokenizers corresponding to the source language should be a valid tokenizer but we let the upstream code select and validate it to stick to the core XML generation code)
• target_tok_value: string automatically proposed by the script, based on language code (see above)
• sentence_seg_value: boolean
• support_default_translations_value: boolean
• remove_tags_value: boolean
• external_command_value: string to be sent to exec when the target files are created, can be empty if no action is requested.

We'll make this code a handler so that we can call it without having to copy it every time we need it. This is better because it allows to logically separate portions of the code.
*)

on CreateOmegaTProjectFile(ProjectSettings)
(*
The first few lines below are necessary to initialize the XML tags. First comes "project_tags", a list of tags that are used in the file. There are 15 tags, of which 14 tags only take a simple value, either a path, or a text string, etc. As we can see above, "source_dir_excludes" takes a list of "masks" tags which we'll create with "masks", which contains the patterns that describe the type of files to be ignored by OmegaT. Last, we initialize "valueindex", an index that will be used to associate the tag to its value found in the list that we feed the handler.
  Then comes the beginning of the XML structure creation.
*)
set project_tags to {"source_dir", "source_dir_excludes", "target_dir", "tm_dir", "glossary_dir", "glossary_file", "dictionary_dir", "source_lang", "target_lang", "source_tok", "target_tok", "sentence_seg", "support_default_translations", "remove_tags", "external_command"}
set masks to {"**/.svn/**", "**/CVS/**", "**/.cvs/**", "**/desktop.ini", "**/Thumbs.db", "**/.DS_Store"}
set valueindex to 0
(*
We're first going to create an XML element that we'll then use as the XML document root. We'll create the other elements once that is done.
*)
set projectRoot to current application's NSXMLNode's elementWithName:"omegat"
(*
This is our first line of AppleScriptObjC. First question: how do we call the Foundation items that we need to work with?
"Everyday AppleScriptObjC":
"Class names are effectively properties of the current application (which is, in turn, the parent of AppleScript in your script)."

"Learn Applesccript":
"AppleScriptObjC presents Cocoa classes as class elements of the current application object."

So, everything we'll call from Cocoa will be called from "current application", hence the use of "current application's NSXMLNode".

In the Xcode documentation, NSXMLNode is described as, well, an XML node.
 
As the documentation says, an XML node can be anything from: an element, an attribute, text, a processing instruction, a namespace, or a comment.
 
Here we use the "elementWithName" method to create the element. Clicking on its description in the Xcode documentation browser shows that the method is a "type method" and returns "an NSXMLElement object with a given tag identifier, or name".
 
"Type methods" apply to classes, they are generally used to create objects which are specific instances of a given class. When we'll work on the objects themselves we'll need "instance methods". In the documentation, "type methods" have a "+" prefixed to their name and "instance methods" have a "-" prefixed instead.
 
The syntax for such method calls is methodName:parameter, so here we have: elementWithName:"omegat"  This syntax will be used for all the other elements that we create.
 
It is important to note that we are using a shortcut here. As Shane wrote on ASUL:
"(elementWithName is) what's known as a convenience method. You're actually making a particular subclass of NSXMLNode, an NSXMLElement.
You could have used:
set projectRoot to current application's NSXMLElement's alloc()'s initWithName:"omegat"
But convenience methods are common, if not following any particular logic of when and where they're provided. Something like "stringWithString:" instead of "alloc()'s initWithString:" is a common example."
 
To make sense of that we need to know how Objective-C/Cocoa works:
"Everyday AppleScriptObjC":
"The equivalent (of AppleScript's make) in Cocoa is actually a two-stage process: first the object is created by allocating memory for it, and then it is initialized.  The first stage is done using the +alloc method. You will not see it listed (in the documentation) because it is a method all classes inherit from the NSObject class."
If we check the NSObject description, we see "+ alloc" which is described as "Returns a new instance of the receiving class" and later "You must use an init... method to complete the initialization process." Then we see "- init", which is described as "Implemented by subclasses to initialize a new object (the receiver) immediately after memory for it has been allocated." In the method description we also see that it only exists for Objective-C and not for Swift which only has "init()" to cover allocation and initialization in one fell swoop.
So, instead of using that longer process, we prefer to use that "convenience method" and make the code slightly shorter.
Foundation reference:
*)
set theProject to current application's NSXMLNode's documentWithRootElement:projectRoot
(*
Here again, we'll use a convenience method that does not require us to go through the alloc-init process. Now we have a XML document and its root element. We'll need a few more things to make our output look like what OmegaT needs.
Foundation reference:
*)
theProject's setCharacterEncoding:"UTF-8"
(*
"setCharacterEncoding" is not documented in the Foundation documentation. The closest thing related to an NSXMLDocument that we have is "characterEncoding" (notice the case for "Character": lower case in the documentation, upper case here), which is described as an "instance property", which seems to correspond since theProject is indeed an instance of NSXMLDocument.
The "set" prefix is explained in "Everyday AppleScriptObjC":
"To set a new value for a Cocoa property, assuming it is not read-only, you use the word set, followed by the property name with the  first letter in uppercase, followed by a colon or underscore and parentheses, depending on the syntax.  The single argument is then the proposed new value."
Now we understand why "set" is prefixed and why characterEncoding is changed to CharacterEncoding when prefixed with set. If we need to get the characterEncoding property of the document, let's not forget about the letter case...
Setting characterEncoding happens to be enough to create the <?xml version="1.0" encoding="UTF-8"?> line in our example and to actually encode the data in UTF-8.
Foundation reference:
*)
theProject's setStandalone:true
(*
setStandalone works like setCharacterEncoding and adds the standalone="yes" part to our xml declaration.
I had the following line in a previous version of this code:
theProject's setDocumentContentKind:(current application's NSXMLDocumentXMLKind)

but we can dispense with it since XML is the default kind of XML document (other kinds include HTML, XHTML and text). Still there is something in this line that we ought to remember for other occasions. "documentContentKind" is an instance property, like "standalone". To set it we must thus use "setDocumentContentKind". The possible values for a documentContentKind are documented as "enumerations", of which NSXMLDocumentXMLKind is the default value in the case of an XML document. To use NSXMLDocumentXMLKind as a value, we must do as we've done for other Cocoa items: call them from the current application, hence the (current application's NSXMLDocumentXMLKind).

Foundation reference:
*)
set project to current application's NSXMLNode's elementWithName:"project"
set projectVersion to current application's NSXMLNode's attributeWithName:"version" stringValue:"1.0"
project's addAttribute:projectVersion
projectRoot's addChild:project
(*
We're still in NSXMLNode territory here. Now we're creating the "project" element with a "version" attribute that has a "1.0" value.
To do this, we first create the element and we then separately create an attribute node by using the "attributeWithName:stringValue:" method (see the Xcode description) that actually comes in two parts: the attributeWithName part and the stringValue part.

Once created, the 2 nodes have no relation to each other or to anything we've created so far. We need to "link" everything together now and we do that with the 2 lines that follow:
"addAttribute" is documented as an instance method, which is good because "project" is an instance of "NSXMLNode" and "adds an attribute node to the receiver", which is exactly what we're trying to do. The parameter is "projectVersion", the attribute node that we created 1 line before that.

Now we have an element that looks like <project version="1.0"></project> and we need to add it as a child of the document root element. That's what "addChild" does: an instance method that "adds a child node after the last of the receiver’s existing children". The receiver is projectRoot and the child is project.
Foundation reference:
*)
repeat with child in project_tags
set valueindex to valueindex + 1
if contents of child is not "source_dir_excludes" then
set child to (current application's NSXMLNode's elementWithName:child stringValue:(item valueindex of ProjectSettings))
(project's addChild:child)
else
set child to (current application's NSXMLNode's elementWithName:"source_dir_excludes")
(project's addChild:child)
set source_dir_excludes to (project's elementsForName:"source_dir_excludes")'s firstObject()
repeat with mask in masks
set mask to (current application's NSXMLNode's elementWithName:"mask" stringValue:mask)
(source_dir_excludes's addChild:mask)
end repeat
end if
end repeat
(*
Now that we've been through the basics of using Foundation to create a basic XML structure, the rest of the code is very straightforward. We're running a loop on the tag list that was created at the beginning of the script and for each "child" we'll set a "stringValue" that corresponds to the item at the same position in the ProjectSettings that has been sent to this handler.
Once the element and its string value are created, we add the whole as a child to the "project" element that we created just above.
The only exception to that process is when we bump into "source_dir_excludes". Here, what happens is that we do not use stringValue  to set the element value since it is going to be made of a list of <mask> tags. Instead of that, we first create the tags and add them one by one as children to "source_dir_excludes". When we're done with that special content, we resume the loop and deal with the other tags.
How we identify "source_dir_excludes" within the list of existing tags is interesting. We use elementsForName, an instance method for project that returns an array of all the elements with the name given as the parameter (here "source_dir_excludes"), and since we only have one here, we specify that we want to work with that array's "firstObject". That way we create a reference to the NSXMLElement "source_dir_excludes" that we can later use to add children to it. It took me a while to figure that out. Thank you Shane :)
The creation of the list of <mask> tags is straightforward and does introduce any new concept.
Foundation reference:
*)
set theData to theProject's XMLDataWithOptions:((get current application's NSXMLNodePrettyPrint) + (get current application's NSXMLDocumentTidyXML))
theData's writeToFile:(item 16 of ParametersList) options:(current application's NSDataWritingAtomic) |error|:(missing value)
(*
We're reaching the end of this XML generation code. Now we need to output the data, in readable form.

"XMLDataWithOptions" is a method that returns an NSData object and the options are listed under "NSXMLNodeOptions". The description says, "One or more options (bit-OR'd if multiple) to affect the output of the document..." where "bit-OR'd if multiple" means "added". Another thing to notice is that the second and following options (if any) seem to need a "get". Adding the "get" to the first option does not seem to be necessary.

The options we use here are: "NSXMLNodePrettyPrint" and "NSXMLDocumentTidyXML". The first "prints this node with extra space for readability", the second "changes malformed XML into valid XML during processing of the document."
"writeToFile:options:error:" is a three part instance method of NSData, which theData is. writeToFile requires a Unix absolute path, options work the same way as defined the line before: "NSDataWritingAtomic" is "A hint to write data to an auxiliary file first and then exchange the files." And error gives us information about errors while using the method. To avoid naming conflicts with AppleScript's "error", it is put between vertical bars: |error|. The description of "error" says we can use a "NULL" value, for which the AppleScript equivalent is missing value (Everyday AppleScriptObjC, p. 52 "The terms nil and its close relative NULL are used commonly in Cocoa.  They are essentially the equivalent of AppleScript’s missing value, and the scripting bridge converts between them.")
Foundation reference:
*)
end CreateOmegaTProjectFile

(*
Let's now test the code with the following values:
*)

set ProjectSettings to {"PATH_1", "MASKS", "PATH_2", "PATH_3", "PATH_4", "FILE_5", "PATH_6", "LANG_7", "LANG_8", "TOKENIZER_9", "TOKENIZER_10", "SEGMENTATION_11", "DEFAULT_TRANSLATION_12", "REMOVE_TAGS_13", "EXTERNAL_COMMAND_14", ((POSIX path of (path to desktop folder)) & "omegat.project")}


my CreateOmegaTProjectFile(ProjectSettings)