Data representations and data formats

This document provides help and links about various data representation formats.

The are roughly two kinds of data representations that we will consider: (1) Those used to write documentation and to format readable content, for example, Asciidoc/Asciidoctor, reST/Docutils, and Markdown. (2) Those used to encode data, usually intended for machine processing, for example, XML, Yaml, JSON, and CSV. We’ll discuss each of these below. We’ll also discuss several binary storage media, in particular, HDF5 and Sqlite3.

1. Asciidoc / Asciidoctor

From the Asciidoctor Web site — "Asciidoctor is a fast, open source text processor and publishing toolchain for converting AsciiDoc content to HTML5, DocBook, PDF, and other formats."

More information: https://asciidoctor.org/

Installing Asciidoctor — There are instructions here https://asciidoctor.org/docs/install-toolchain/. Or, do the following:

First, make sure that you have a reasonably up-to-date version of Ruby installed on your machine. E.g. do: $ ruby --version.

Then install with gem:

$ gem install asciidoctor

Or, on Linux, install with your favorite package manager. For example:

$ apt-get install asciidoctor    # or, alternatively ...
$ aptitude install asciidoctor

Then, read the instructions here https://asciidoctor.org/docs/#get-started-with-asciidoctor.

Interesting features provided by Asciidoctor:

There are converters that produce each of PDF, EPUB3, and LaTeX. See Supplemental Converters.
There is a built-in backend that produces Docbook. Use $ asciidoctor -b docbook5 mydoc.txt.

Good to know:

Built-in attributes help you customize the look of the document you generate. See Built-in Attributes.
Some built-in attributes can be set either (1) in the document header or from the command line. See the link above to determine which built-in attributes can only be set in the document header. To set attributes on the command line, use the -a command line option, which can be repeated. These examples customize the TOC (table of contents):
```
$ asciidoctor -a toc mydoc.txt
$ asciidoctor -a toc=left -a toclevels=4 -a sectnums mydoc.txt
$ asciidoctor -a toc=left -a toclevels=4 -a sectnums -a toc-title="The TOC" mydoc.txt
```
For help with displaying and customizing a table of contents, see Table of Contents.
For help with writing Asciidoctor content see Quick Reference and Writer’s Guide and other guides and manuals at Asciidoctor Docs.

1.1. Transforming `asciidoc` to `reST`

We can convert an Asciidoc/Asciidoctor document to reST (reStructuredText) with help from pandoc. We use asciidoctor to produce Docbook content, then use pandoc to convert Docbook to Asciidoctor content.

Notes:

On Linux, pandoc can be installed with the aptitude or the apt-get package manager.
pandoc is also available in the Anaconda distribution of Python.

The following shell scripts show how to convert an Asciidoctor document to reST:

#!/bin/sh
asciidoctor -b docbook5 -o - $1 | pandoc -f docbook -t rst -o $2

Or, to write to stdout:

#!/bin/sh
asciidoctor -b docbook5 -o - $1 | pandoc -f docbook -t rst

2. reStructuredText — Docutils

Information is here: https://docutils.sourceforge.io/.

reStructuredText or reST is, like Asciidoc, a lightweight markup language.

reStructuredText and Docutils are analogous to Asciidoc and Asciidoctor. There are significant similarities and differences in the text source format. And, the tool chains are different; they are two separate implementations. You can compare the source text formats here: reStructuredText and Asciidoc.

Installation:

It’s available at the Python Package Index: https://pypi.org/project/docutils/#description. Either download and install from source, or install with pip: $ pip install docutils.
Or, install from source — Get the source here: Docutils source code. Then, follow the instructions in the README file.

There is plenty of documentation to help you get started writing reStructuredText and using Docutils at the Documentation Overview.

3. Markdown

For information, see:

There are various implementations. See https://www.w3.org/community/markdown/wiki/MarkdownImplementations.

4. HTML

You can also write documentation directly in HTML. Of course other tools, several of which are discussed in this document, can make this task less excruciating. And, another strategy is to use a WYSIWYG editor that enable you to generate HTML, e.g. LibreOffice/LibreWriter.

5. XML

5.1. Python

Python has several tools for processing XML, for example:

ElementTree — ElementTree is in the standard Python Library. See https://docs.python.org/3/library/xml.etree.elementtree.html. Here is a sample code snippet:

[ins] In [12]: import xml.etree.ElementTree as etree
[ins] In [13]: doc = etree.parse('data_representation.xml')
[ins] In [14]: rootnode = doc.getroot()
[ins] In [15]: print('root tag:', rootnode.tag)
root tag: {http://docbook.org/ns/docbook}article

Lxml — Lxml attempts to implement the ElementTree API. It also has extensions to that API (each element object in the element tree has added methods) and additional capabilities, for example, xpath searches and XSLT (eXtensible Stylesheet Language Transforms). See https://lxml.de/index.html. Here is a sample code snippet:

[ins] In [23]: from lxml import etree
[ins] In [24]: doc = etree.parse('data_representation.xml')
[ins] In [25]: root = doc.getroot()
[ins] In [26]: tag = root.tag
[ins] In [27]: print('root tag:', tag)
root tag: {http://docbook.org/ns/docbook}article
[ins] In [28]:
[ins] In [28]: nsmap = {'db': root.nsmap[None]}
[ins] In [29]: nodes = root.xpath('//db:section', namespaces=nsmap)
[ins] In [30]: nodes
Out[30]:
[<Element {http://docbook.org/ns/docbook}section at 0x7f0864116230>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f259f00>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b5b90>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b5320>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b5140>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b5af0>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b58c0>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b5280>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f1b5a50>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f085f30ac30>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f08640b4a00>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f08640b4870>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f08640b4b90>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f08640b41e0>,
 <Element {http://docbook.org/ns/docbook}section at 0x7f08640b44b0>]
[ins] In [31]: node1 = nodes[0]
[ins] In [32]: node1.attrib
Out[32]: {'{http://www.w3.org/XML/1998/namespace}id': '_asciidoc_asciidoctor'}

5.2. Elixir

Consider using SweetXml. It’s built on top of the Xmerl XML implementation in Erlang.

I’ve written several Blog posts on processing XML with Elixir that you might find helpful:

6. Yaml

6.1. Python

Information is here:

https://github.com/yaml/pyyaml
https://pyyaml.org/wiki/PyYAMLDocumentation
See this about the deprecation of the yaml.load function: https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation

Here is an example that loads Python data objects from a Yaml file:

[ins] In [20]: with open('Data/data06.yaml', 'r') as infile:
          ...:     data = yaml.full_load(infile)
          ...:
[ins] In [21]: pprint.pprint(data)
['test 1',
 {'name1': 'value1', 'name2': 'value2', 'name3': 'value3'},
 {'name4': 'value4', 'name5': 'value5', 'name6': 'value6'},
 [11, 22, 33],
 [44, 55, {'name7': 'value7', 'name8': 'value8'}, 66]]

And, here is an example that writes (dumps) Python data structures to a Yaml file:

[ins] In [25]: with open("junk01.yaml", 'w') as outfile:
          ...:     yaml.dump(data, outfile)
          ...:

6.2. Elixir

I’ve found several Yaml implementations for Elixir.

6.2.1. `fast_yaml`

Learn about it here: https://github.com/processone/fast_yaml

Add the following to your mix.exs:

defp deps do
  [
    {:fast_yaml, "~> 1.0.24"},
  ]
end

These examples decode/parse data from (1) a file and (2) a string:

iex> {:ok, data1} = :fast_yaml.decode_from_file("Data/data06.yaml")
{:ok,
 [
   [
     "test 1",
     [{"name1", "value1"}, {"name2", "value2"}, {"name3", "value3"}],
     [{"name4", "value4"}, {"name5", "value5"}, {"name6", "value6"}],
     [11, 22, 33],
     [44, 55, [{"name7", "value7"}, {"name8", "value8"}], 66]
   ]
 ]}
iex> {:ok, data2} = :fast_yaml.decode(content)
{:ok,
 [
   [
     "test 1",
     [{"name1", "value1"}, {"name2", "value2"}, {"name3", "value3"}],
     [{"name4", "value4"}, {"name5", "value5"}, {"name6", "value6"}],
     [11, 22, 33],
     [44, 55, [{"name7", "value7"}, {"name8", "value8"}], 66]
   ]
 ]}

Encoding an Elixir data structure to a string — Since fast_yaml is an Erlang library, it produces char lists. But, you can convert those to Elixir strings. Example:

iex> data3encoded = :fast_yaml.encode(data3)
iex> data3encoded_as_strings = IO.chardata_to_string(data3encoded)

To write the encoded string to a file, try something like this:

iex> data3encoded = :fast_yaml.encode(data3)
iex> File.write!("junk.yaml", data3encoded)

6.2.2. `yaml-elixir`

Learn about it here: https://github.com/KamilLelonek/yaml-elixir

Add the following to your mix.exs:

defp deps do
  [
    {:yaml_elixir, "~> 2.4.0"},
  ]
end

Parsing Yaml data from a file — Use either of the following:

iex> data = YamlElixir.read_from_file!("Data/data06.yaml")
[
  "test 1",
  %{"name1" => "value1", "name2" => "value2", "name3" => "value3"},
  %{"name4" => "value4", "name5" => "value5", "name6" => "value6"},
  [11, 22, 33],
  [44, 55, %{"name7" => "value7", "name8" => "value8"}, 66]
]
iex> {:ok, data} = YamlElixir.read_from_file("Data/data06.yaml")
{:ok,
 [
   "test 1",
   %{"name1" => "value1", "name2" => "value2", "name3" => "value3"},
   %{"name4" => "value4", "name5" => "value5", "name6" => "value6"},
   [11, 22, 33],
   [44, 55, %{"name7" => "value7", "name8" => "value8"}, 66]
]}

You can also read and parse data from a string. An example:

iex> IO.puts(content)
- "test 1"
-
  "name1": "value1"
  "name2": "value2"
  "name3": "value3"
-
  "name4": "value4"
  "name5": "value5"
  "name6": "value6"
-
  - 11
  - 22
  - 33
-
  - 44
  - 55
  -
    "name7": "value7"
    "name8": "value8"
  - 66
:ok
iex> data = YamlElixir.read_from_string!(content)
[
  "test 1",
  %{"name1" => "value1", "name2" => "value2", "name3" => "value3"},
  %{"name4" => "value4", "name5" => "value5", "name6" => "value6"},
  [11, 22, 33],
  [44, 55, %{"name7" => "value7", "name8" => "value8"}, 66]
]

Apparently, yaml-elixir does not provide the ability to convert an Elixir data structure to a string.

7. JSON

7.1. Python

JSON support is in the Python standard library. See json — JSON encoder and decoder.

Here is an example of its use:

[ins] In [6]: import json
[ins] In [7]: mymap = {"name": "lemon", "color": "yellow", "sizes": [12, 14, 16]}
[ins] In [8]: data = json.dumps(mymap)
[ins] In [9]: data
Out[9]: '{"name": "lemon", "color": "yellow", "sizes": [12, 14, 16]}'
[ins] In [10]: json.loads(data)
Out[10]: {'name': 'lemon', 'color': 'yellow', 'sizes': [12, 14, 16]}

7.2. Elixir

There are several Elixir modules for JSON.

7.2.1. `Jason`

You can add it to your Mix dependencies in mix.exs. For example:

defp deps do
  [
    {:jason, ">0.0.0"}
  ]
end

Here is an example of its use:

iex> mymap = %{name: "lemon", color: "yellow", sizes: [12, 14, 16]}
%{color: "yellow", name: "lemon", sizes: [12, 14, 16]}
iex> {:ok, data} = Jason.encode(mymap)
{:ok, "{\"color\":\"yellow\",\"name\":\"lemon\",\"sizes\":[12,14,16]}"}
iex> Jason.decode(data)
{:ok, %{"color" => "yellow", "name" => "lemon", "sizes" => [12, 14, 16]}}

7.2.2. `JSON`

Information is here:

Add the following to your mix.exs:

defp deps do
[
  {:json, "~> 1.3"},
]
end

Here is an example of it’s use:

iex> mymap = %{name: "lemon", color: "yellow", sizes: [12, 14, 16]}
%{color: "yellow", name: "lemon", sizes: [12, 14, 16]}
iex> {:ok, data} = JSON.encode(mymap)
{:ok, "{\"color\":\"yellow\",\"name\":\"lemon\",\"sizes\":[12,14,16]}"}
iex> JSON.decode(data)
{:ok, %{"color" => "yellow", "name" => "lemon", "sizes" => [12, 14, 16]}}

8. CSV

8.1. Python

More information:

https://docs.python.org/3/library/csv.html

CSV support is in the Python standard library.

An example:

[ins] In [1]: import csv
[ins] In [2]: infile = open('tmp10.csv', 'r')
[ins] In [3]: reader = csv.reader(infile)
[ins] In [4]: rows = list(reader)
[ins] In [5]: rows
Out[5]:
[['Device_Software_Image_Version', '11.12.20'],
 ['Device_Product', 'Catalyst'],
 ['Product_B_version', '1.1.2.1'],
 ...
 ['Device_hw_type', 'virtual appliance']]

8.2. Elixir

Information is here:

Insert this in your project’s mix.exs file:

defp deps do
[
  {:csv, "~> 2.3.1"},
]
end

And, here are two example functions that use this CSV module — One writes out a few lines to a CSV file.. The other reads lines in from a CSV file, splits each line into fields, and displays them.

defmodule Test24.CSV do

      @csvdata1 [
        ~w(dog cat bird),
        ~w(tomato radish squash),
        ~w(lemon orange tangerine),
        ~w(poppy sunflower phacelia),
      ]

  @doc """
  Write a few lines to a CSV file.

  Options:

  * :force -- If true, overwrite existing file.

  ## Examples

      iex> Test24.CSV.write_csv_file("data01.csv", force: true)
      :ok

      iex> Test24.CSV.write_csv_file("data01.csv")
      :ok

      iex> Test24.CSV.write_csv_file("data01.csv")
      {:error, "file data01.csv exists"}

  """
  @spec write_csv_file(String.t(), Keyword.t()) :: :ok | {:error, String.t()}
  def write_csv_file(out_file_path, opts \\ []) do
    if Keyword.get(opts, :force) != true and File.exists?(out_file_path) do
      {:error, "file #{out_file_path} exists"}
    else
      data1 = @csvdata1
      out_file = File.open!(out_file_path, [:write])
      data1
      |> CSV.encode
      |> Enum.each(fn line ->
        IO.write(out_file, line)
        #IO.puts("line: #{line}")
      end)
      File.close(out_file)
      :ok
    end
  end

  @doc """
  Read CSV lines from file.  Parse into fields.  Display them.

  ## Examples

      iex> Test24.CSV.read_csv_file("data01.csv")
      dog|cat|bird
      tomato|radish|squash
      lemon|orange|tangerine
      poppy|sunflower|phacelia
      :ok

      iex> Test24.CSV.read_csv_file("missing_file.csv")
      {:error, "file missing_file.csv not found"}

  """
  @spec read_csv_file(String.t()) :: :ok | {:error, String.t()}
  def read_csv_file(in_file_path) do
    if not File.exists?(in_file_path) do
      {:error, "file #{in_file_path} not found"}
    else
      in_file_path
      #|> Path.expand(__DIR__)
      |> File.stream!
      |> CSV.decode!
      |> Enum.each(fn line ->
        #IO.puts(line)
        #IO.inspect(line, label: "line")
        IO.puts(Enum.join(line, "|"))
      end)
      :ok
    end
  end

end

9. HDF5

9.1. Python

More information:

https://portal.hdfgroup.org/

Python has several packages that support HDF5 data files. Two of them are h5py and pytables.

9.1.1. `h5py`

More information:

An example:

import h5py

def test():
    infile = h5py.File('testdata05.hdf5', 'r')
    print('infile keys:', infile.keys())
    subgroup = infile['subgroup03']
    print('subgroup keys:', subgroup.keys())
    subgroup['dataset3-3']
    dataset1 = subgroup['dataset3-3']
    print('dataset1:', dataset1)
    for row in dataset1:
        print(row)
    print('dataset1 contents:\n', dataset1[()])
    infile.close()

test()

9.1.2. PyTables

More information:

https://www.pytables.org/

An example:

[to be added]

9.2. Elixir

Erlhdf5 — See https://github.com/RomanShestakov/erlhdf5.

10. Sqlite

10.1. Python

For relational tables stored in a file, Python has the sqlite3 module. It’s in the standard library. See https://docs.python.org/3/library/sqlite3.html.

sqlite3 implements the Python DB-API 2.0 specification. See https://www.python.org/dev/peps/pep-0249/.

Here is a small amount of sample code:

[ins] In [10]: import sqlite3
[ins] In [11]: con = sqlite3.connect('tmp01.db')
[ins] In [12]:
[ins] In [12]: cursor = con.execute('select * from samples')
[ins] In [13]: for row in cursor:
          ...:     print(row)
          ...:
('carrot', 25)
('tomato', 35)
('radish', 15)

10.2. Elixir

Elixir has a module giving support for Sqlite.

There is documentation here:

Put this in your mix.exs file:

defp deps do
  [
    {:sqlite, "1.1.0"},
  ]
end

Examples:

Here is a sample function that uses Sqlite to show the rows in a table in an Sqlite file/database:

@doc """
Open connection to DB file and show rows in table.

## Examples

    iex> Test24.show_rows("test01.db", "samples")
    Rows from Sqlite DB test01.db:
    ---------------------------------------------
    name: carrot  amount: 25
    name: tomato  amount: 35
    name: radish  amount: 15
    :ok

    iex> Test24.show_rows("test01.db", "samples", "size")
    Rows from Sqlite DB test01.db:
    ---------------------------------------------
    name: radish  amount: 15
    name: pepper  amount: 20
    name: carrot  amount: 25
    name: tomato  amount: 35
    name: arugula  amount: 40
    :ok

"""
@spec show_rows(String.t(), String.t(), String.t()) :: :ok
def show_rows(db_file_name, table_name, order \\ "") do
  order_by = if order == "" do
    ""
  else
    " order by #{order}"
  end
  {:ok, connection} = Sqlite.open(db_file_name)
  IO.puts("Rows from Sqlite DB #{db_file_name}:")
  IO.puts("---------------------------------------------")
  query = "select * from #{table_name}#{order_by}"
  Sqlite.q(query, connection)
  |> Enum.each(fn {n, amt} ->
    IO.puts("name: #{n}  amount: #{amt}")
  end)
  Sqlite.close(connection)
  :ok
end

Here is a function that can add a row to a table in an Sqlite database:

@doc """
Add a row to a table in a database file.

## Examples

    iex> Test24.add_row("test01.db", "samples", "\"parsley\", \"60\"")
    :ok

"""
@spec add_row(String.t(), String.t(), String.t()) :: :ok
def add_row(db_file_name, table_name, columns) do
  {:ok, connection} = Sqlite.open(db_file_name)
  query = "insert into #{table_name} values (#{columns})"
  Sqlite.q(query, connection)
  #|> IO.inspect(label: "result")
  Sqlite.close(connection)
  :ok
end

Appendix A: Copyright and License[appendix]

Appendix B: Document source

This document is written in the Asciidoc light-weight markup language format. It had been processed with Asciidoctor. The document source is here: source.

Data representations and data formats — usage and using

1. Asciidoc / Asciidoctor

1.1. Transforming asciidoc to reST

2. reStructuredText — Docutils

3. Markdown

4. HTML

5. XML

5.1. Python

5.2. Elixir

6. Yaml

6.1. Python

6.2. Elixir

6.2.1. fast_yaml

6.2.2. yaml-elixir

7. JSON

7.1. Python

7.2. Elixir

7.2.1. Jason

7.2.2. JSON

8. CSV

8.1. Python

8.2. Elixir

9. HDF5

9.1. Python

9.1.1. h5py

9.1.2. PyTables

9.2. Elixir

10. Sqlite

10.1. Python

10.2. Elixir

Appendix A: Copyright and License[appendix]

Appendix B: Document source

1.1. Transforming `asciidoc` to `reST`

6.2.1. `fast_yaml`

6.2.2. `yaml-elixir`

7.2.1. `Jason`

7.2.2. `JSON`

9.1.1. `h5py`