= Data representations and data formats -- usage and using :toc: left :toc-title: Contents :toclevels: 2 :sectnums: //:stylesdir: css/ //:stylesheet: adoc-foundation-potion.css //:stylesheet: adoc-readthedocs.css //:stylesheet: asciidoctor.css :stylesheet: dave01.css This document provides help and links about various data representation formats. The are roughly two kinds of data representations that we will consider: (1) Those used to write documentation and to format readable content, for example, Asciidoc/Asciidoctor, reST/Docutils, and Markdown. (2) Those used to encode data, usually intended for machine processing, for example, XML, Yaml, JSON, and CSV. We'll discuss each of these below. We'll also discuss several binary storage media, in particular, HDF5 and Sqlite3. == Asciidoc / Asciidoctor From the Asciidoctor Web site -- "Asciidoctor is a fast, open source text processor and publishing toolchain for converting AsciiDoc content to HTML5, DocBook, PDF, and other formats." More information: https://asciidoctor.org/ Installing Asciidoctor -- There are instructions here https://asciidoctor.org/docs/install-toolchain/. Or, do the following: 1. First, make sure that you have a reasonably up-to-date version of Ruby installed on your machine. E.g. do: `$ ruby --version`. 2. Then install with `gem`: + ---- $ gem install asciidoctor ---- + Or, on Linux, install with your favorite package manager. For example: + ---- $ apt-get install asciidoctor # or, alternatively ... $ aptitude install asciidoctor ---- Then, read the instructions here https://asciidoctor.org/docs/#get-started-with-asciidoctor. Interesting features provided by Asciidoctor: - There are converters that produce each of PDF, EPUB3, and LaTeX. See https://asciidoctor.org/docs/#supplemental-converters[Supplemental Converters]. - There is a built-in backend that produces Docbook. Use `$ asciidoctor -b docbook5 mydoc.txt`. Good to know: - Built-in attributes help you customize the look of the document you generate. See https://asciidoctor.org/docs/user-manual/#builtin-attributes[Built-in Attributes]. - Some built-in attributes can be set either (1) in the document header or from the command line. See the link above to determine which built-in attributes can only be set in the document header. To set attributes on the command line, use the `-a` command line option, which can be repeated. These examples customize the TOC (table of contents): + ---- $ asciidoctor -a toc mydoc.txt $ asciidoctor -a toc=left -a toclevels=4 -a sectnums mydoc.txt $ asciidoctor -a toc=left -a toclevels=4 -a sectnums -a toc-title="The TOC" mydoc.txt ---- - For help with displaying and customizing a table of contents, see https://asciidoctor.org/docs/user-manual/#user-toc[Table of Contents]. - For help with writing Asciidoctor content see https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/[Quick Reference] and https://asciidoctor.org/docs/asciidoc-writers-guide/[Writer's Guide] and other guides and manuals at https://asciidoctor.org/docs/[Asciidoctor Docs]. === Transforming `asciidoc` to `reST` We can convert an Asciidoc/Asciidoctor document to `reST` (reStructuredText) with help from `pandoc`. We use `asciidoctor` to produce Docbook content, then use `pandoc` to convert Docbook to Asciidoctor content. Notes: - On Linux, `pandoc` can be installed with the `aptitude` or the `apt-get` package manager. - `pandoc` is also available in the Anaconda distribution of Python. The following shell scripts show how to convert an Asciidoctor document to `reST`: ---------- #!/bin/sh asciidoctor -b docbook5 -o - $1 | pandoc -f docbook -t rst -o $2 ---------- Or, to write to `stdout`: ---------- #!/bin/sh asciidoctor -b docbook5 -o - $1 | pandoc -f docbook -t rst ---------- == reStructuredText -- Docutils Information is here: https://docutils.sourceforge.io/. `reStructuredText` or `reST` is, like Asciidoc, a lightweight markup language. `reStructuredText` and Docutils are analogous to Asciidoc and Asciidoctor. There are significant similarities and differences in the text source format. And, the tool chains are different; they are two separate implementations. You can compare the source text formats here: https://docutils.sourceforge.io/docs/user/rst/quickstart.html[reStructuredText] and https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/[Asciidoc]. Installation: - It's available at the Python Package Index: https://pypi.org/project/docutils/#description. Either download and install from source, or install with `pip`: `$ pip install docutils`. - Or, install from source -- Get the source here: https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/[Docutils source code]. Then, follow the instructions in the `README` file. There is plenty of documentation to help you get started writing reStructuredText and using Docutils at the https://docutils.sourceforge.io/docs/index.html[Documentation Overview]. == Markdown For information, see: - https://guides.github.com/features/mastering-markdown/ - http://daringfireball.net/projects/markdown/ There are various implementations. See https://www.w3.org/community/markdown/wiki/MarkdownImplementations. == HTML You can also write documentation directly in HTML. Of course other tools, several of which are discussed in this document, can make this task less excruciating. And, another strategy is to use a WYSIWYG editor that enable you to generate HTML, e.g. LibreOffice/LibreWriter. == XML === Python Python has several tools for processing XML, for example: - `ElementTree` -- `ElementTree` is in the standard Python Library. See https://docs.python.org/3/library/xml.etree.elementtree.html. Here is a sample code snippet: + ---- [ins] In [12]: import xml.etree.ElementTree as etree [ins] In [13]: doc = etree.parse('data_representation.xml') [ins] In [14]: rootnode = doc.getroot() [ins] In [15]: print('root tag:', rootnode.tag) root tag: {http://docbook.org/ns/docbook}article ---- - `Lxml` -- `Lxml` attempts to implement the `ElementTree` API. It also has extensions to that API (each element object in the element tree has added methods) and additional capabilities, for example, `xpath` searches and XSLT (eXtensible Stylesheet Language Transforms). See https://lxml.de/index.html. Here is a sample code snippet: + ---- [ins] In [23]: from lxml import etree [ins] In [24]: doc = etree.parse('data_representation.xml') [ins] In [25]: root = doc.getroot() [ins] In [26]: tag = root.tag [ins] In [27]: print('root tag:', tag) root tag: {http://docbook.org/ns/docbook}article [ins] In [28]: [ins] In [28]: nsmap = {'db': root.nsmap[None]} [ins] In [29]: nodes = root.xpath('//db:section', namespaces=nsmap) [ins] In [30]: nodes Out[30]: [, , , , , , , , , , , , , , ] [ins] In [31]: node1 = nodes[0] [ins] In [32]: node1.attrib Out[32]: {'{http://www.w3.org/XML/1998/namespace}id': '_asciidoc_asciidoctor'} ---- === Elixir Consider using https://hexdocs.pm/sweet_xml/SweetXml.html[SweetXml]. It's built on top of the `Xmerl` XML implementation in Erlang. I've written several Blog posts on processing XML with Elixir that you might find helpful: - http://davekuhlman.org/elixir-sweetxml-records.html[XML, SweetXml, Xmerl, and Erlang records from Elixir]. - http://davekuhlman.org/xml-elixir-structs.html[Elixir Structs for processing XML] - http://davekuhlman.org/xml-elixir-xmerl-functions.html[Functions that provide access to the fields in Xmerl XML records] == Yaml === Python Information is here: - https://github.com/yaml/pyyaml - https://pyyaml.org/wiki/PyYAMLDocumentation - See this about the deprecation of the `yaml.load` function: https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation Here is an example that loads Python data objects from a Yaml file: ---- [ins] In [20]: with open('Data/data06.yaml', 'r') as infile: ...: data = yaml.full_load(infile) ...: [ins] In [21]: pprint.pprint(data) ['test 1', {'name1': 'value1', 'name2': 'value2', 'name3': 'value3'}, {'name4': 'value4', 'name5': 'value5', 'name6': 'value6'}, [11, 22, 33], [44, 55, {'name7': 'value7', 'name8': 'value8'}, 66]] ---- And, here is an example that writes (dumps) Python data structures to a Yaml file: ---- [ins] In [25]: with open("junk01.yaml", 'w') as outfile: ...: yaml.dump(data, outfile) ...: ---- === Elixir I've found several Yaml implementations for Elixir. ==== `fast_yaml` Learn about it here: https://github.com/processone/fast_yaml Add the following to your `mix.exs`: ---- defp deps do [ {:fast_yaml, "~> 1.0.24"}, ] end ---- These examples decode/parse data from (1) a file and (2) a string: ---- iex> {:ok, data1} = :fast_yaml.decode_from_file("Data/data06.yaml") {:ok, [ [ "test 1", [{"name1", "value1"}, {"name2", "value2"}, {"name3", "value3"}], [{"name4", "value4"}, {"name5", "value5"}, {"name6", "value6"}], [11, 22, 33], [44, 55, [{"name7", "value7"}, {"name8", "value8"}], 66] ] ]} iex> {:ok, data2} = :fast_yaml.decode(content) {:ok, [ [ "test 1", [{"name1", "value1"}, {"name2", "value2"}, {"name3", "value3"}], [{"name4", "value4"}, {"name5", "value5"}, {"name6", "value6"}], [11, 22, 33], [44, 55, [{"name7", "value7"}, {"name8", "value8"}], 66] ] ]} ---- Encoding an Elixir data structure to a string -- Since `fast_yaml` is an Erlang library, it produces char lists. But, you can convert those to Elixir strings. Example: ---- iex> data3encoded = :fast_yaml.encode(data3) iex> data3encoded_as_strings = IO.chardata_to_string(data3encoded) ---- To write the encoded string to a file, try something like this: ---- iex> data3encoded = :fast_yaml.encode(data3) iex> File.write!("junk.yaml", data3encoded) ---- ==== `yaml-elixir` Learn about it here: https://github.com/KamilLelonek/yaml-elixir Add the following to your `mix.exs`: ---- defp deps do [ {:yaml_elixir, "~> 2.4.0"}, ] end ---- Parsing Yaml data from a file -- Use either of the following: ---- iex> data = YamlElixir.read_from_file!("Data/data06.yaml") [ "test 1", %{"name1" => "value1", "name2" => "value2", "name3" => "value3"}, %{"name4" => "value4", "name5" => "value5", "name6" => "value6"}, [11, 22, 33], [44, 55, %{"name7" => "value7", "name8" => "value8"}, 66] ] iex> {:ok, data} = YamlElixir.read_from_file("Data/data06.yaml") {:ok, [ "test 1", %{"name1" => "value1", "name2" => "value2", "name3" => "value3"}, %{"name4" => "value4", "name5" => "value5", "name6" => "value6"}, [11, 22, 33], [44, 55, %{"name7" => "value7", "name8" => "value8"}, 66] ]} ---- You can also read and parse data from a string. An example: ---- iex> IO.puts(content) - "test 1" - "name1": "value1" "name2": "value2" "name3": "value3" - "name4": "value4" "name5": "value5" "name6": "value6" - - 11 - 22 - 33 - - 44 - 55 - "name7": "value7" "name8": "value8" - 66 :ok iex> data = YamlElixir.read_from_string!(content) [ "test 1", %{"name1" => "value1", "name2" => "value2", "name3" => "value3"}, %{"name4" => "value4", "name5" => "value5", "name6" => "value6"}, [11, 22, 33], [44, 55, %{"name7" => "value7", "name8" => "value8"}, 66] ] ---- Apparently, `yaml-elixir` does not provide the ability to convert an Elixir data structure to a string. == JSON === Python JSON support is in the Python standard library. See https://docs.python.org/3/library/json.html[json — JSON encoder and decoder]. Here is an example of its use: ---- [ins] In [6]: import json [ins] In [7]: mymap = {"name": "lemon", "color": "yellow", "sizes": [12, 14, 16]} [ins] In [8]: data = json.dumps(mymap) [ins] In [9]: data Out[9]: '{"name": "lemon", "color": "yellow", "sizes": [12, 14, 16]}' [ins] In [10]: json.loads(data) Out[10]: {'name': 'lemon', 'color': 'yellow', 'sizes': [12, 14, 16]} ---- === Elixir There are several Elixir modules for JSON. ==== `Jason` You can add it to your `Mix` dependencies in `mix.exs`. For example: ---- defp deps do [ {:jason, ">0.0.0"} ] end ---- Here is an example of its use: ---- iex> mymap = %{name: "lemon", color: "yellow", sizes: [12, 14, 16]} %{color: "yellow", name: "lemon", sizes: [12, 14, 16]} iex> {:ok, data} = Jason.encode(mymap) {:ok, "{\"color\":\"yellow\",\"name\":\"lemon\",\"sizes\":[12,14,16]}"} iex> Jason.decode(data) {:ok, %{"color" => "yellow", "name" => "lemon", "sizes" => [12, 14, 16]}} ---- ==== `JSON` Information is here: - https://hex.pm/packages/json - https://hexdocs.pm/json/readme.html Add the following to your `mix.exs`: ---- defp deps do [ {:json, "~> 1.3"}, ] end ---- Here is an example of it's use: ---- iex> mymap = %{name: "lemon", color: "yellow", sizes: [12, 14, 16]} %{color: "yellow", name: "lemon", sizes: [12, 14, 16]} iex> {:ok, data} = JSON.encode(mymap) {:ok, "{\"color\":\"yellow\",\"name\":\"lemon\",\"sizes\":[12,14,16]}"} iex> JSON.decode(data) {:ok, %{"color" => "yellow", "name" => "lemon", "sizes" => [12, 14, 16]}} ---- == CSV === Python More information: - https://docs.python.org/3/library/csv.html CSV support is in the Python standard library. An example: ---- [ins] In [1]: import csv [ins] In [2]: infile = open('tmp10.csv', 'r') [ins] In [3]: reader = csv.reader(infile) [ins] In [4]: rows = list(reader) [ins] In [5]: rows Out[5]: [['Device_Software_Image_Version', '11.12.20'], ['Device_Product', 'Catalyst'], ['Product_B_version', '1.1.2.1'], ... ['Device_hw_type', 'virtual appliance']] ---- === Elixir Information is here: - https://hexdocs.pm/csv/CSV.html - https://github.com/beatrichartz/csv Insert this in your project's `mix.exs` file: ---- defp deps do [ {:csv, "~> 2.3.1"}, ] end ---- And, here are two example functions that use this CSV module -- One writes out a few lines to a CSV file.. The other reads lines in from a CSV file, splits each line into fields, and displays them. ---- defmodule Test24.CSV do @csvdata1 [ ~w(dog cat bird), ~w(tomato radish squash), ~w(lemon orange tangerine), ~w(poppy sunflower phacelia), ] @doc """ Write a few lines to a CSV file. Options: * :force -- If true, overwrite existing file. ## Examples iex> Test24.CSV.write_csv_file("data01.csv", force: true) :ok iex> Test24.CSV.write_csv_file("data01.csv") :ok iex> Test24.CSV.write_csv_file("data01.csv") {:error, "file data01.csv exists"} """ @spec write_csv_file(String.t(), Keyword.t()) :: :ok | {:error, String.t()} def write_csv_file(out_file_path, opts \\ []) do if Keyword.get(opts, :force) != true and File.exists?(out_file_path) do {:error, "file #{out_file_path} exists"} else data1 = @csvdata1 out_file = File.open!(out_file_path, [:write]) data1 |> CSV.encode |> Enum.each(fn line -> IO.write(out_file, line) #IO.puts("line: #{line}") end) File.close(out_file) :ok end end @doc """ Read CSV lines from file. Parse into fields. Display them. ## Examples iex> Test24.CSV.read_csv_file("data01.csv") dog|cat|bird tomato|radish|squash lemon|orange|tangerine poppy|sunflower|phacelia :ok iex> Test24.CSV.read_csv_file("missing_file.csv") {:error, "file missing_file.csv not found"} """ @spec read_csv_file(String.t()) :: :ok | {:error, String.t()} def read_csv_file(in_file_path) do if not File.exists?(in_file_path) do {:error, "file #{in_file_path} not found"} else in_file_path #|> Path.expand(__DIR__) |> File.stream! |> CSV.decode! |> Enum.each(fn line -> #IO.puts(line) #IO.inspect(line, label: "line") IO.puts(Enum.join(line, "|")) end) :ok end end end ---- == HDF5 === Python More information: - https://portal.hdfgroup.org/ Python has several packages that support HDF5 data files. Two of them are `h5py` and `pytables`. ==== `h5py` More information: - https://www.h5py.org/ - http://docs.h5py.org/en/stable/index.html An example: ---- import h5py def test(): infile = h5py.File('testdata05.hdf5', 'r') print('infile keys:', infile.keys()) subgroup = infile['subgroup03'] print('subgroup keys:', subgroup.keys()) subgroup['dataset3-3'] dataset1 = subgroup['dataset3-3'] print('dataset1:', dataset1) for row in dataset1: print(row) print('dataset1 contents:\n', dataset1[()]) infile.close() test() ---- ==== PyTables More information: - https://www.pytables.org/ An example: ---- [to be added] ---- === Elixir `Erlhdf5` -- See https://github.com/RomanShestakov/erlhdf5. == Sqlite === Python For relational tables stored in a file, Python has the `sqlite3` module. It's in the standard library. See https://docs.python.org/3/library/sqlite3.html. `sqlite3` implements the Python DB-API 2.0 specification. See https://www.python.org/dev/peps/pep-0249/. Here is a small amount of sample code: ---- [ins] In [10]: import sqlite3 [ins] In [11]: con = sqlite3.connect('tmp01.db') [ins] In [12]: [ins] In [12]: cursor = con.execute('select * from samples') [ins] In [13]: for row in cursor: ...: print(row) ...: ('carrot', 25) ('tomato', 35) ('radish', 15) ---- === Elixir Elixir has a module giving support for Sqlite. There is documentation here: - https://hex.pm/packages/sqlite - https://hexdocs.pm/sqlite/api-reference.html Put this in your `mix.exs` file: ---- defp deps do [ {:sqlite, "1.1.0"}, ] end ---- Examples: 1. Here is a sample function that uses Sqlite to show the rows in a table in an Sqlite file/database: + ---- @doc """ Open connection to DB file and show rows in table. ## Examples iex> Test24.show_rows("test01.db", "samples") Rows from Sqlite DB test01.db: --------------------------------------------- name: carrot amount: 25 name: tomato amount: 35 name: radish amount: 15 :ok iex> Test24.show_rows("test01.db", "samples", "size") Rows from Sqlite DB test01.db: --------------------------------------------- name: radish amount: 15 name: pepper amount: 20 name: carrot amount: 25 name: tomato amount: 35 name: arugula amount: 40 :ok """ @spec show_rows(String.t(), String.t(), String.t()) :: :ok def show_rows(db_file_name, table_name, order \\ "") do order_by = if order == "" do "" else " order by #{order}" end {:ok, connection} = Sqlite.open(db_file_name) IO.puts("Rows from Sqlite DB #{db_file_name}:") IO.puts("---------------------------------------------") query = "select * from #{table_name}#{order_by}" Sqlite.q(query, connection) |> Enum.each(fn {n, amt} -> IO.puts("name: #{n} amount: #{amt}") end) Sqlite.close(connection) :ok end ---- 2. Here is a function that can add a row to a table in an Sqlite database: + ---- @doc """ Add a row to a table in a database file. ## Examples iex> Test24.add_row("test01.db", "samples", "\"parsley\", \"60\"") :ok """ @spec add_row(String.t(), String.t(), String.t()) :: :ok def add_row(db_file_name, table_name, columns) do {:ok, connection} = Sqlite.open(db_file_name) query = "insert into #{table_name} values (#{columns})" Sqlite.q(query, connection) #|> IO.inspect(label: "result") Sqlite.close(connection) :ok end ---- [appendix] == Copyright and License[appendix] Copyright © 2020 Dave Kuhlman. Free use of this documentation is granted under the terms of the MIT License. See https://opensource.org/licenses/MIT. [appendix] == Document source This document is written in the Asciidoc light-weight markup language format. It had been processed with Asciidoctor. The document source is here: link:data_representation.txt[source]. // vim:ft=asciidoc:textwidth=0: