Integrated Data Article

Concept

Each article element is structured this way:

name (parameters) "Content";

Every article consists of at least one element. With no input, the parser output is one anonymous element with no parameters and no content.

Structure

The first element decoded is considered top-level, a distinction which has some effect on compatibility features. From there, various choices are available for building structure:

Elements adjacent to a top-level element are also top-level. Elements within a set of parameters or within list content are no longer top-level.

For example:

interval (start "2022-01-01",
            end "2023-12-31");
eclipse
{
	solar (type partial) "2022-10-25"
	lunar (type total) "2022-11-07"
	solar (type total) "2023-04-20"
	lunar (type penumbral) "2023-05-06"
	solar (type annular) "2023-10-14"
	lunar (type partial) "2023-10-28"
}

The elements named interval and eclipse are top-level and adjacent. The interval has no content, but has parameters start and end each with string content. The eclipse has no parameters, and has list content with six items, each element having its own name, parameter and string content.

Rich data structures can be created by using the self-similarity of article elements with their parameters and list items.

Terminology

This document is structured such that most specific terms are introduced naturally prior to appearing within a description of something else. For example, this section has supplied the specific terms element, name, parameters, content, anonymous, top-level and adjacent.

A few supplementary terms will be needed, all of which describe a type of range:

Below XStrictly less than X
Above XStrictly greater than X
Up to XLess than or equal to X
Beyond XGreater than or equal to X
X until YBeyond X and below Y
X through YBeyond X and up to Y

Syntax

It is important to remember that a lot of IDA syntax can be omitted. The parser will advance the flow to match what it finds, potentially creating the next element. Whether this should be done is a question of whether it looks nicer that way.

Comments

#Line comment
//Line comment
/* */Block comment
<? ?>Block comment (XML compatibility)
<!-- -->Block comment (XML compatibility)

Flow

;Object-style element delimiter
,Array-style element delimiter
, (at top-level)Element delimiter (CSV compatibility)
=Name/content separator
: (after name)Name/content separator (JSON compatibility)
< />Null-content element (XML compatibility)
< >Begin list-content element (XML compatibility)
</ >End list-content element (XML compatibility)
( )Parameters
* (before element)Pointer identifier

Content

{ }Object-style list
[ ]Array-style list
" "String (may contain ' and ` )
' 'String (may contain ` and " )
` `String (may contain " and ' )
&Pointer reference
\Escape sequence
+True
-False
?Null

Abstraction

::Class definition
:Class instance
!Template definition
@Template instance
@@Include template file (required)
@?Include template file (optional)

Other

^As-above (name or content)
(^)As-above (parameters)
^^As-above (name/parameters or parameters/content)
$Definition pending (configuration file compatibility)
~Parser instruction

Spacing

Spaces ( 20 ), ASCII tabs and new lines ( 09 0A 0D ) are accepted as white-space. At top-level and with CSV compatibility enabled, the line-feed 0A also delimits the lists which represent each row.

Character sequences

IDA provides two means of writing arbitrary characters, both of which can be used for element names and for string content. In this case, a character is one Unicode code point regardless of the markup used to convey it.

Each character must represent a valid Unicode code point. Invalid code points are:

If content is to be interpreted as raw bytes then it must not contain characters beyond U+0100. Any characters can be used in element names and when interpreting content as a string.

Escapes

The \ syntax will start an escape sequence, the length of which depends on the escape code used. The escape codes are:

20 Space literal
\000 Zero-byte
\n0A New line (LF)
\r0D New line (CR)
\t09 Tab
\xByte, or 1-byte Unicode character
\u2-byte Unicode character
\U4-byte Unicode character
\" \' \`Quotation mark literal

\x , \u and \U are all given in case-insensitive hexadecimal, with two hexadecimal digits per byte. For example:

Using \u or \U to name an invalid code point will cause a parse error.

UTF-8

Articles may contain valid UTF-8 sequences directly for use as text characters. A UTF-8 sequence is a leading byte of C2 through F4 followed with continuation bytes of 80 through BF .

In accordance with the UTF-8 specification, any invalid UTF-8 will cause a parse error. Each UTF-8 sequence must be a leading byte with the correct number of continuation bytes, must not be an overlong encoding, and must represent a valid code point.

Contrary to a Unicode recommendation, errors will not be replaced with U+FFFD , the replacement character . Editing an article that way is an explicit non-goal of the parser, and making a genuine U+FFFD character indistinguishable from a UTF-8 error is unacceptable in a data markup scenario.

Note that UTF-8 output is an option which is disabled by default as a precaution. If enabled, all displayable \x, \u and \U escape sequences will be replaced with UTF-8. This should be mostly safe, given that most ASCII-only transfers which may need to carry non-ASCII data go via Base64 by now. ♬🎲.

Unicode is non-displayable for characters U+0080 until U+00A0, where it converges with extended ASCII A0. For example, extended ASCII 85 is the horizontal ellipsis , but Unicode U+0085 is the Next Line control code. IDA uses Unicode, where the actual ellipsis character U+2026 can be written with the escape sequence \u2026 or with the UTF-8 sequence E2 80 A6 . The escape sequence \x85 or the UTF-8 sequence C2 85 will produce Next Line for a string or 85 for content interpreted as raw bytes.

Strings

Outside of the three explicit string delimiters "'`, various bytes may provide the parser with a string implicitly. Whether this is permissible depends on the byte status, as follows:

First-class

The first-class bytes are those which never constitute article syntax alone. They may appear anywhere throughout an implicit string, both for element names and for string content.

First-class bytes are:

Second-class

A second-class byte has some non-string meaning if encountered as a prefix (or alone) and therefore cannot begin an implicit string, but may appear throughout a string which already has a leading first-class byte.

Second-class bytes are:

Third-class

A third-class byte has additional non-string meaning if encountered as a suffix, so may neither lead nor trail an implicit string, but otherwise may appear within a string.

The only third-class byte is -.

Numbers

To-do. Briefly:

Invalid input

At present, the bytes % and | are only valid within an explicit string. % is very likely to be released as a first-class string byte, in particular due to its potential value as a unit string for numbers which represent a percentage.

The following bytes are reserved and should not appear in a normal article. Most will cause a parse error.