How can I parse a YAML file from a Linux shell script?

My use case may or may not be quite the same as what this original post was asking, but it's definitely similar.

I need to pull in some YAML as bash variables. The YAML will never be more than one level deep.

YAML looks like so:

KEY:                value
ANOTHER_KEY:        another_value
OH_MY_SO_MANY_KEYS: yet_another_value
LAST_KEY:           last_value

Output like-a dis:

KEY="value"
ANOTHER_KEY="another_value"
OH_MY_SO_MANY_KEYS="yet_another_value"
LAST_KEY="last_value"

I achieved the output with this line:

sed -e 's/:[^:\/\/]/="/g;s/$/"/g;s/ *=/=/g' file.yaml > file.sh
  • s/:[^:\/\/]/="/g finds : and replaces it with =", while ignoring :// (for URLs)
  • s/$/"/g appends " to the end of each line
  • s/ *=/=/g removes all spaces before =
Stefan Farestam

Here is a bash-only parser that leverages sed and awk to parse simple yaml files:

function parse_yaml {
   local prefix=$2
   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
   sed -ne "s|^\($s\):|\1|" \
        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p"  $1 |
   awk -F$fs '{
      indent = length($1)/2;
      vname[indent] = $2;
      for (i in vname) {if (i > indent) {delete vname[i]}}
      if (length($3) > 0) {
         vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
         printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
      }
   }'
}

It understands files such as:

## global definitions
global:
  debug: yes
  verbose: no
  debugging:
    detailed: no
    header: "debugging started"

## output
output:
   file: "yes"

Which, when parsed using:

parse_yaml sample.yml

will output:

global_debug="yes"
global_verbose="no"
global_debugging_detailed="no"
global_debugging_header="debugging started"
output_file="yes"

it also understands yaml files, generated by ruby which may include ruby symbols, like:

---
:global:
  :debug: 'yes'
  :verbose: 'no'
  :debugging:
    :detailed: 'no'
    :header: debugging started
  :output: 'yes'

and will output the same as in the previous example.

typical use within a script is:

eval $(parse_yaml sample.yml)

parse_yaml accepts a prefix argument so that imported settings all have a common prefix (which will reduce the risk of namespace collisions).

parse_yaml sample.yml "CONF_"

yields:

CONF_global_debug="yes"
CONF_global_verbose="no"
CONF_global_debugging_detailed="no"
CONF_global_debugging_header="debugging started"
CONF_output_file="yes"

Note that previous settings in a file can be referred to by later settings:

## global definitions
global:
  debug: yes
  verbose: no
  debugging:
    detailed: no
    header: "debugging started"

## output
output:
   debug: $global_debug

Another nice usage is to first parse a defaults file and then the user settings, which works since the latter settings overrides the first ones:

eval $(parse_yaml defaults.yml)
eval $(parse_yaml project.yml)

I've written shyaml in python for YAML query needs from the shell command line.

Overview:

$ pip install shyaml      ## installation

Example's YAML file (with complex features):

$ cat <<EOF > test.yaml
name: "MyName !!"
subvalue:
    how-much: 1.1
    things:
        - first
        - second
        - third
    other-things: [a, b, c]
    maintainer: "Valentin Lab"
    description: |
        Multiline description:
        Line 1
        Line 2
EOF

Basic query:

$ cat test.yaml | shyaml get-value subvalue.maintainer
Valentin Lab

More complex looping query on complex values:

$ cat test.yaml | shyaml values-0 | \
  while read -r -d $'\0' value; do
      echo "RECEIVED: '$value'"
  done
RECEIVED: '1.1'
RECEIVED: '- first
- second
- third'
RECEIVED: '2'
RECEIVED: 'Valentin Lab'
RECEIVED: 'Multiline description:
Line 1
Line 2'

A few key points:

  • all YAML types and syntax oddities are correctly handled, as multiline, quoted strings, inline sequences...
  • \0 padded output is available for solid multiline entry manipulation.
  • simple dotted notation to select sub-values (ie: subvalue.maintainer is a valid key).
  • access by index is provided to sequences (ie: subvalue.things.-1 is the last element of the subvalue.things sequence.)
  • access to all sequence/structs elements in one go for use in bash loops.
  • you can output whole subpart of a YAML file as ... YAML, which blend well for further manipulations with shyaml.

More sample and documentation are available on the shyaml github page or the shyaml PyPI page.

Rafael

It's possible to pass a small script to some interpreters, like Python. An easy way to do so using Ruby and its YAML library is the following:

$ RUBY_SCRIPT="data = YAML::load(STDIN.read); puts data['a']; puts data['b']"
$ echo -e '---\na: 1234\nb: 4321' | ruby -ryaml -e "$RUBY_SCRIPT"
1234
4321

, wheredata is a hash (or array) with the values from yaml.

As a bonus, it'll parse Jekyll's front matter just fine.

ruby -ryaml -e "puts YAML::load(open(ARGV.first).read)['tags']" example.md

Given that Python3 and PyYAML are quite easy dependencies to meet nowadays, the following may help:

yaml() {
    python3 -c "import yaml;print(yaml.load(open('$1'))$2)"
}

VALUE=$(yaml ~/my_yaml_file.yaml "['a_key']")

yq is a lightweight and portable command-line YAML processor

The aim of the project is to be the jq or sed of yaml files.

(http://mikefarah.github.io/yq/)

As an example (stolen straight from the documentation), given a sample.yaml file of:

---
bob:
  item1:
    cats: bananas
  item2:
    cats: apples

then

yq r sample.yaml bob.*.cats

will output

- bananas
- apples

Hard to say because it depends on what you want the parser to extract from your YAML document. For simple cases, you might be able to use grep, cut, awk etc. For more complex parsing you would need to use a full-blown parsing library such as Python's PyYAML or YAML::Perl.

starfry

I just wrote a parser that I called Yay! (Yaml ain't Yamlesque!) which parses Yamlesque, a small subset of YAML. So, if you're looking for a 100% compliant YAML parser for Bash then this isn't it. However, to quote the OP, if you want a structured configuration file which is as easy as possible for a non-technical user to edit that is YAML-like, this may be of interest.

It's inspred by the earlier answer but writes associative arrays (yes, it requires Bash 4.x) instead of basic variables. It does so in a way that allows the data to be parsed without prior knowledge of the keys so that data-driven code can be written.

As well as the key/value array elements, each array has a keys array containing a list of key names, a children array containing names of child arrays and a parent key that refers to its parent.

This is an example of Yamlesque:

root_key1: this is value one
root_key2: "this is value two"

drink:
  state: liquid
  coffee:
    best_served: hot
    colour: brown
  orange_juice:
    best_served: cold
    colour: orange

food:
  state: solid
  apple_pie:
    best_served: warm

root_key_3: this is value three

Here is an example showing how to use it:

#!/bin/bash
# An example showing how to use Yay

. /usr/lib/yay

# helper to get array value at key
value() { eval echo \${$1[$2]}; }

# print a data collection
print_collection() {
  for k in $(value $1 keys)
  do
    echo "$2$k = $(value $1 $k)"
  done

  for c in $(value $1 children)
  do
    echo -e "$2$c\n$2{"
    print_collection $c "  $2"
    echo "$2}"
  done
}

yay example
print_collection example

which outputs:

root_key1 = this is value one
root_key2 = this is value two
root_key_3 = this is value three
example_drink
{
  state = liquid
  example_coffee
  {
    best_served = hot
    colour = brown
  }
  example_orange_juice
  {
    best_served = cold
    colour = orange
  }
}
example_food
{
  state = solid
  example_apple_pie
  {
    best_served = warm
  }
}

And here is the parser:

yay_parse() {

   # find input file
   for f in "$1" "$1.yay" "$1.yml"
   do
     [[ -f "$f" ]] && input="$f" && break
   done
   [[ -z "$input" ]] && exit 1

   # use given dataset prefix or imply from file name
   [[ -n "$2" ]] && local prefix="$2" || {
     local prefix=$(basename "$input"); prefix=${prefix%.*}
   }

   echo "declare -g -A $prefix;"

   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
   sed -n -e "s|^\($s\)\($w\)$s:$s\"\(.*\)\"$s\$|\1$fs\2$fs\3|p" \
          -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" "$input" |
   awk -F$fs '{
      indent       = length($1)/2;
      key          = $2;
      value        = $3;

      # No prefix or parent for the top level (indent zero)
      root_prefix  = "'$prefix'_";
      if (indent ==0 ) {
        prefix = "";          parent_key = "'$prefix'";
      } else {
        prefix = root_prefix; parent_key = keys[indent-1];
      }

      keys[indent] = key;

      # remove keys left behind if prior row was indented more than this row
      for (i in keys) {if (i > indent) {delete keys[i]}}

      if (length(value) > 0) {
         # value
         printf("%s%s[%s]=\"%s\";\n", prefix, parent_key , key, value);
         printf("%s%s[keys]+=\" %s\";\n", prefix, parent_key , key);
      } else {
         # collection
         printf("%s%s[children]+=\" %s%s\";\n", prefix, parent_key , root_prefix, key);
         printf("declare -g -A %s%s;\n", root_prefix, key);
         printf("%s%s[parent]=\"%s%s\";\n", root_prefix, key, prefix, parent_key);
      }
   }'
}

# helper to load yay data file
yay() { eval $(yay_parse "[email protected]"); }

There is some documentation in the linked source file and below is a short explanation of what the code does.

The yay_parse function first locates the input file or exits with an exit status of 1. Next, it determines the dataset prefix, either explicitly specified or derived from the file name.

It writes valid bash commands to its standard output that, if executed, define arrays representing the contents of the input data file. The first of these defines the top-level array:

echo "declare -g -A $prefix;"

Note that array declarations are associative (-A) which is a feature of Bash version 4. Declarations are also global (-g) so they can be executed in a function but be available to the global scope like the yay helper:

yay() { eval $(yay_parse "[email protected]"); }

The input data is initially processed with sed. It drops lines that don't match the Yamlesque format specification before delimiting the valid Yamlesque fields with an ASCII File Separator character and removing any double-quotes surrounding the value field.

 local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
 sed -n -e "s|^\($s\)\($w\)$s:$s\"\(.*\)\"$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" "$input" |

The two expressions are similar; they differ only because the first one picks out quoted values where as the second one picks out unquoted ones.

The File Separator (28/hex 12/octal 034) is used because, as a non-printable character, it is unlikely to be in the input data.

The result is piped into awk which processes its input one line at a time. It uses the FS character to assign each field to a variable:

indent       = length($1)/2;
key          = $2;
value        = $3;

All lines have an indent (possibly zero) and a key but they don't all have a value. It computes an indent level for the line dividing the length of the first field, which contains the leading whitespace, by two. The top level items without any indent are at indent level zero.

Next, it works out what prefix to use for the current item. This is what gets added to a key name to make an array name. There's a root_prefix for the top-level array which is defined as the data set name and an underscore:

root_prefix  = "'$prefix'_";
if (indent ==0 ) {
  prefix = "";          parent_key = "'$prefix'";
} else {
  prefix = root_prefix; parent_key = keys[indent-1];
}

The parent_key is the key at the indent level above the current line's indent level and represents the collection that the current line is part of. The collection's key/value pairs will be stored in an array with its name defined as the concatenation of the prefix and parent_key.

For the top level (indent level zero) the data set prefix is used as the parent key so it has no prefix (it's set to ""). All other arrays are prefixed with the root prefix.

Next, the current key is inserted into an (awk-internal) array containing the keys. This array persists throughout the whole awk session and therefore contains keys inserted by prior lines. The key is inserted into the array using its indent as the array index.

keys[indent] = key;

Because this array contains keys from previous lines, any keys with an indent level grater than the current line's indent level are removed:

 for (i in keys) {if (i > indent) {delete keys[i]}}

This leaves the keys array containing the key-chain from the root at indent level 0 to the current line. It removes stale keys that remain when the prior line was indented deeper than the current line.

The final section outputs the bash commands: an input line without a value starts a new indent level (a collection in YAML parlance) and an input line with a value adds a key to the current collection.

The collection's name is the concatenation of the current line's prefix and parent_key.

When a key has a value, a key with that value is assigned to the current collection like this:

printf("%s%s[%s]=\"%s\";\n", prefix, parent_key , key, value);
printf("%s%s[keys]+=\" %s\";\n", prefix, parent_key , key);

The first statement outputs the command to assign the value to an associative array element named after the key and the second one outputs the command to add the key to the collection's space-delimited keys list:

<current_collection>[<key>]="<value>";
<current_collection>[keys]+=" <key>";

When a key doesn't have a value, a new collection is started like this:

printf("%s%s[children]+=\" %s%s\";\n", prefix, parent_key , root_prefix, key);
printf("declare -g -A %s%s;\n", root_prefix, key);

The first statement outputs the command to add the new collection to the current's collection's space-delimited children list and the second one outputs the command to declare a new associative array for the new collection:

<current_collection>[children]+=" <new_collection>"
declare -g -A <new_collection>;

All of the output from yay_parse can be parsed as bash commands by the bash eval or source built-in commands.

here an extended version of the Stefan Farestam's answer:

function parse_yaml {
   local prefix=$2
   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
   sed -ne "s|,$s\]$s\$|]|" \
        -e ":1;s|^\($s\)\($w\)$s:$s\[$s\(.*\)$s,$s\(.*\)$s\]|\1\2: [\3]\n\1  - \4|;t1" \
        -e "s|^\($s\)\($w\)$s:$s\[$s\(.*\)$s\]|\1\2:\n\1  - \3|;p" $1 | \
   sed -ne "s|,$s}$s\$|}|" \
        -e ":1;s|^\($s\)-$s{$s\(.*\)$s,$s\($w\)$s:$s\(.*\)$s}|\1- {\2}\n\1  \3: \4|;t1" \
        -e    "s|^\($s\)-$s{$s\(.*\)$s}|\1-\n\1  \2|;p" | \
   sed -ne "s|^\($s\):|\1|" \
        -e "s|^\($s\)-$s[\"']\(.*\)[\"']$s\$|\1$fs$fs\2|p" \
        -e "s|^\($s\)-$s\(.*\)$s\$|\1$fs$fs\2|p" \
        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" | \
   awk -F$fs '{
      indent = length($1)/2;
      vname[indent] = $2;
      for (i in vname) {if (i > indent) {delete vname[i]; idx[i]=0}}
      if(length($2)== 0){  vname[indent]= ++idx[indent] };
      if (length($3) > 0) {
         vn=""; for (i=0; i<indent; i++) { vn=(vn)(vname[i])("_")}
         printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, vname[indent], $3);
      }
   }'
}

This version supports the - notation and the short notation for dictionaries and lists. The following input:

global:
  input:
    - "main.c"
    - "main.h"
  flags: [ "-O3", "-fpic" ]
  sample_input:
    -  { property1: value, property2: "value2" }
    -  { property1: "value3", property2: 'value 4' }

produces this output:

global_input_1="main.c"
global_input_2="main.h"
global_flags_1="-O3"
global_flags_2="-fpic"
global_sample_input_1_property1="value"
global_sample_input_1_property2="value2"
global_sample_input_2_property1="value3"
global_sample_input_2_property2="value 4"

as you can see the - items automatically get numbered in order to obtain different variable names for each item. In bash there are no multidimensional arrays, so this is one way to work around. Multiple levels are supported. To work around the problem with trailing white spaces mentioned by @briceburg one should enclose the values in single or double quotes. However, there are still some limitations: Expansion of the dictionaries and lists can produce wrong results when values contain commas. Also, more complex structures like values spanning multiple lines (like ssh-keys) are not (yet) supported.

A few words about the code: The first sed command expands the short form of dictionaries { key: value, ...} to regular and converts them to more simple yaml style. The second sed call does the same for the short notation of lists and converts [ entry, ... ] to an itemized list with the - notation. The third sed call is the original one that handled normal dictionaries, now with the addition to handle lists with - and indentations. The awk part introduces an index for each indentation level and increases it when the variable name is empty (i.e. when processing a list). The current value of the counters are used instead of the empty vname. When going up one level, the counters are zeroed.

Edit: I have created a github repository for this.

jonseymour

Another option is to convert the YAML to JSON, then use jq to interact with the JSON representation either to extract information from it or edit it.

I wrote a simple bash script that contains this glue - see Y2J project on GitHub

mushuweasel
perl -ne 'chomp; printf qq/%s="%s"\n/, split(/\s*:\s*/,$_,2)' file.yml > file.sh

I know this is very specific, but I think my answer could be helpful for certain users.
If you have node and npm installed on your machine, you can use js-yaml.
First install :

npm i -g js-yaml
# or locally
npm i js-yaml

then in your bash script

#!/bin/bash
js-yaml your-yaml-file.yml

Also if you are using jq you can do something like that

#!/bin/bash
json="$(js-yaml your-yaml-file.yml)"
aproperty="$(jq '.apropery' <<< "$json")"
echo "$aproperty"

Because js-yaml converts a yaml file to a json string literal. You can then use the string with any json parser in your unix system.

If you have python 2 and PyYAML, you can use this parser I wrote called parse_yaml.py. Some of the neater things it does is let you choose a prefix (in case you have more than one file with similar variables) and to pick a single value from a yaml file.

For example if you have these yaml files:

staging.yaml:

db:
    type: sqllite
    host: 127.0.0.1
    user: dev
    password: password123

prod.yaml:

db:
    type: postgres
    host: 10.0.50.100
    user: postgres
    password: password123

You can load both without conflict.

$ eval $(python parse_yaml.py prod.yaml --prefix prod --cap)
$ eval $(python parse_yaml.py staging.yaml --prefix stg --cap)
$ echo $PROD_DB_HOST
10.0.50.100
$ echo $STG_DB_HOST
127.0.0.1

And even cherry pick the values you want.

$ prod_user=$(python parse_yaml.py prod.yaml --get db_user)
$ prod_port=$(python parse_yaml.py prod.yaml --get db_port --default 5432)
$ echo prod_user
postgres
$ echo prod_port
5432

You could use an equivalent of yq that is written in golang:

./go-yg -yamlFile /home/user/dev/ansible-firefox/defaults/main.yml -key
firefox_version

returns:

62.0.3

You can also consider using Grunt (The JavaScript Task Runner). Can be easily integrated with shell. It supports reading YAML (grunt.file.readYAML) and JSON (grunt.file.readJSON) files.

This can be achieved by creating a task in Gruntfile.js (or Gruntfile.coffee), e.g.:

module.exports = function (grunt) {

    grunt.registerTask('foo', ['load_yml']);

    grunt.registerTask('load_yml', function () {
        var data = grunt.file.readYAML('foo.yml');
        Object.keys(data).forEach(function (g) {
          // ... switch (g) { case 'my_key':
        });
    });

};

then from shell just simply run grunt foo (check grunt --help for available tasks).

Further more you can implement exec:foo tasks (grunt-exec) with input variables passed from your task (foo: { cmd: 'echo bar <%= foo %>' }) in order to print the output in whatever format you want, then pipe it into another command.


There is also similar tool to Grunt, it's called gulp with additional plugin gulp-yaml.

Install via: npm install --save-dev gulp-yaml

Sample usage:

var yaml = require('gulp-yaml');

gulp.src('./src/*.yml')
  .pipe(yaml())
  .pipe(gulp.dest('./dist/'))

gulp.src('./src/*.yml')
  .pipe(yaml({ space: 2 }))
  .pipe(gulp.dest('./dist/'))

gulp.src('./src/*.yml')
  .pipe(yaml({ safe: true }))
  .pipe(gulp.dest('./dist/'))

To more options to deal with YAML format, check YAML site for available projects, libraries and other resources which can help you to parse that format.


Other tools:

  • Jshon

    parses, reads and creates JSON

If you need a single value you could a tool which converts your YAML document to JSON and feed to jq, for example yq.

Content of sample.yaml:

---
bob:
  item1:
    cats: bananas
  item2:
    cats: apples
  thing:
    cats: oranges

Example:

$ yq -r '.bob["thing"]["cats"]' sample.yaml 
oranges