IDA Markup

Concept

Each article element is structured this way:

name (parameters) "Content";

Every article consists of at least one element. With no input, the parser output is one anonymous element with no parameters and no content.

Structure

The first element decoded is considered top-level, a distinction which has some effect on compatibility features. From there, various choices are available for building structure:

Each parameter is also an element, potentially having parameters of its own.
The content may be a list which contains other elements, potentially also having list content.
The article may contain additional adjacent elements.

Elements adjacent to a top-level element are also top-level. Elements within a set of parameters or within list content are no longer top-level.

For example:

interval (start "2022-01-01",
            end "2023-12-31");
eclipse
{
	solar (type partial) "2022-10-25"
	lunar (type total) "2022-11-07"
	solar (type total) "2023-04-20"
	lunar (type penumbral) "2023-05-06"
	solar (type annular) "2023-10-14"
	lunar (type partial) "2023-10-28"
}

The elements named interval and eclipse are top-level and adjacent. The interval has no content, but has parameters start and end each with string content. The eclipse has no parameters, and has list content with six items, each element having its own name, parameter and string content.

Rich data structures can be created by using the self-similarity of article elements with their parameters and list items.

Terminology

This document is structured such that most specific terms are introduced naturally prior to appearing within a description of something else. For example, this section has supplied the specific terms element, name, parameters, content, anonymous, top-level and adjacent.

A few supplementary terms will be needed, all of which describe a type of range:

Below `X`	Strictly less than `X`
Above `X`	Strictly greater than `X`
Up to `X`	Less than or equal to `X`
Beyond `X`	Greater than or equal to `X`
`X` until `Y`	Beyond `X` and below `Y`
`X` through `Y`	Beyond `X` and up to `Y`

Syntax

It is important to remember that a lot of IDA syntax can be omitted. The parser will advance the flow to match what it finds, potentially creating the next element. Whether this should be done is a question of whether it looks nicer that way.

Comments
#	Line comment
//	Line comment
/* */	Block comment
<? ?>	Block comment (XML compatibility)
<!-- -->	Block comment (XML compatibility)
Flow
;	Object-style element delimiter
,	Array-style element delimiter
, (at top-level)	Element delimiter (CSV compatibility)
=	Name/content separator
: (after name)	Name/content separator (JSON compatibility)
< />	Null-content element (XML compatibility)
< >	Begin list-content element (XML compatibility)
</ >	End list-content element (XML compatibility)
( )	Parameters
* (before element)	Pointer identifier
Content
{ }	Object-style list
[ ]	Array-style list
" "	String (may contain `'` and ` )
' '	String (may contain ` and `"` )
` `	String (may contain `"` and `'` )
&	Pointer reference
\	Escape sequence
+	True
-	False
?	Null
Abstraction
::	Class definition
:	Class instance
!	Template definition
@	Template instance
@@	Include template file (required)
@?	Include template file (optional)
Other
^	As-above (name or content)
(^)	As-above (parameters)
^^	As-above (name/parameters or parameters/content)
$	Definition pending (configuration file compatibility)
~	Parser instruction

Spacing

Spaces ( 20 ), ASCII tabs and new lines ( 09 0A 0D ) are accepted as white-space. At top-level and with CSV compatibility enabled, the line-feed 0A also delimits the lists which represent each row.

Character sequences

IDA provides two means of writing arbitrary characters, both of which can be used for element names and for string content. In this case, a character is one Unicode code point regardless of the markup used to convey it.

Each character must represent a valid Unicode code point. Invalid code points are:

U+D800 until U+E000 (the UTF-16 surrogate pairs).
Beyond U+110000 (the limit of a UTF-16 surrogate pair).

If content is to be interpreted as raw bytes then it must not contain characters beyond U+0100. Any characters can be used in element names and when interpreting content as a string.

Escapes

The \ syntax will start an escape sequence, the length of which depends on the escape code used. The escape codes are:

\	`20` Space literal
\0	`00` Zero-byte
\n	`0A` New line (LF)
\r	`0D` New line (CR)
\t	`09` Tab
\x	Byte, or 1-byte Unicode character
\u	2-byte Unicode character
\U	4-byte Unicode character
\" \' \`	Quotation mark literal

\x , \u and \U are all given in case-insensitive hexadecimal, with two hexadecimal digits per byte. For example:

\x20 , \u0020 and \U00000020 each refer to the byte 20, or a literal space character.
\xe4 is the byte E4 if read as binary data, or Unicode U+00E4 ä if read as a string.
\u266c is Unicode U+266C ♬.
\U0001f3b2 is Unicode U+1F3B2 🎲.

Using \u or \U to name an invalid code point will cause a parse error.

UTF-8

Articles may contain valid UTF-8 sequences directly for use as text characters. A UTF-8 sequence is a leading byte of C2 through F4 followed with continuation bytes of 80 through BF .

In accordance with the UTF-8 specification, any invalid UTF-8 will cause a parse error. Each UTF-8 sequence must be a leading byte with the correct number of continuation bytes, must not be an overlong encoding, and must represent a valid code point.

Contrary to a Unicode recommendation, errors will not be replaced with U+FFFD , the replacement character �. Editing an article that way is an explicit non-goal of the parser, and making a genuine U+FFFD character indistinguishable from a UTF-8 error is unacceptable in a data markup scenario.

Note that UTF-8 output is an option which is disabled by default as a precaution. If enabled, all displayable \x, \u and \U escape sequences will be replaced with UTF-8. This should be mostly safe, given that most ASCII-only transfers which may need to carry non-ASCII data go via Base64 by now. ♬🎲.

Unicode is non-displayable for characters U+0080 until U+00A0, where it converges with extended ASCII A0. For example, extended ASCII 85 is the horizontal ellipsis …, but Unicode U+0085 is the Next Line control code. IDA uses Unicode, where the actual ellipsis character U+2026 can be written with the escape sequence \u2026 or with the UTF-8 sequence E2 80 A6 . The escape sequence \x85 or the UTF-8 sequence C2 85 will produce Next Line for a string or 85 for content interpreted as raw bytes.

Strings

Outside of the three explicit string delimiters "'`, various bytes may provide the parser with a string implicitly. Whether this is permissible depends on the byte status, as follows:

First-class

The first-class bytes are those which never constitute article syntax alone. They may appear anywhere throughout an implicit string, both for element names and for string content.

First-class bytes are:

A through Z
a through z
_
/ , except as // (line comment)
Escape sequences
UTF-8 sequences

Second-class

A second-class byte has some non-string meaning if encountered as a prefix (or alone) and therefore cannot begin an implicit string, but may appear throughout a string which already has a leading first-class byte.

Second-class bytes are:

0 through 9
. , ! , @ , ^ , & and ?

Third-class

A third-class byte has additional non-string meaning if encountered as a suffix, so may neither lead nor trail an implicit string, but otherwise may appear within a string.

The only third-class byte is -.

Numbers

To-do. Briefly:

Precision is preserved and is considered important. For example, the numbers 1.34 and 1.340 will convert to the same floating-point data, but their precisions are 2 and 3 (more directly, their resolutions are −2 and −3). The programmer has access to that information and may write branching logic for different precisions/resolutions, or may adjust the weight of a data point if it was known more precisely. The numbers may be the same, but the information is different.
Quotients/ratios are accepted markup. For example, if a number is genuinely five-thirds, just write 5/3 without concern for whether it should be 1.666666 to convey the recurrence or 1.666667 to be technically closer. The numerator and denominator can each use any other allowed form of number.
If a number has a unit which should be communicated, append it. This has no effect on the actual number, but the programmer will be able to read the unit string. For example, a program which operates in radians could support the unit string deg to enable an angle to be written 90deg rather than 1.5707963.
Prefix with 0x for a hexadecimal integer, or with 0b for a binary integer.

Invalid input

At present, the bytes % and | are only valid within an explicit string. % is very likely to be released as a first-class string byte, in particular due to its potential value as a unit string for numbers which represent a percentage.

The following bytes are reserved and should not appear in a normal article. Most will cause a parse error.

00 through 1F (except white-space), non-displayable ASCII.
- 04 EOT will end parsing successfully. This is intended only for command-line testing, and programs should not rely on this for correctness. If you need to decode from a truncated stream, wrap or buffer it instead.
- 03 ETX is under consideration for cancelling the parse process without causing a formal error, discarding any created elements and leaving the root element unchanged as if nothing happened (other than having consumed bytes from an input stream).
7F , ASCII Del.
C0 and C1 , which imply overlong UTF-8 for a normal ASCII character (below U+0080).
F5 through FD , which imply UTF-8 exceeding the valid Unicode range (below U+110000).
FE and FF , unclaimed by UTF-8.