Parsing YAML out of a Markdown file

Question

I am working with some legacy code that I have inherited (ie, many of these design decisions were not mine).

The code takes a directory organized into subdirectories with markdown files, and compiles them into one large markdown file (using Markdown-PP: https://github.com/jreese/markdown-pp). Then it converts this file into HTML (using pandoc: https://pandoc.org/), and finally into a PDF (using wkhtmltopdf: https://wkhtmltopdf.org/).

The problem that I am running into is that many of the original markdown files have YAML metadata headers. When stitched together by Markdown-PP, the large markdown ends up with numerous YAML metadata blocks interspersed throughout. Most of this metadata is lost when converting into HTML because of the way pandoc processes YAML (many of the headers use the same key names, and pandoc combines the separate YAML headers and only preserves the first value of the corresponding key).

I originally had no YAML appearing in the HTML, but was able to change this by correctly modifying the HTML template for pandoc. But I only get the first value for each corresponding key. It was not clear if there was a way around this in pandoc, so I instead looked into trying to process the YAML into HTML before the pandoc step. I have tried parsing the YAML in the combined markdown using PyYAML (yaml.load_all()) but only get the first YAML block to appear.

An example of a YAML block:

---
author: foo
size_minimum: 100
time_req_minutes: 120
# and so on
---

The issue being that each one of 20+ modules in the final document have this associated metadata.

To try to parse the YAML, I was using code borrowed from this post: Is it possible to use PyYAML to read a text file written with a "YAML front matter" block inside?

with a few modifications.

import yaml
import sys

def get_yaml(f):
  pointer = f.tell()
  if f.readline() != '---
':
    f.seek(pointer)
    return ''
  readline = iter(f.readline, '')
  readline = iter(readline.__next__, '---
') #underscores needed for Python3?
  return ''.join(readline)

# Remove sys.argv, not sure what it was doing
with open(filepath, encoding='UTF-8') as f:
    config = list(yaml.load_all(get_yaml(f), Loader=yaml.SafeLoader)) # Load all to get all the YAML documents, Loader option required for most recent PyYAML, and list because it was originally returning a generator object
    text = f.read()
    print("TEXT from", f)
    #print(text)
    print("CONFIG from", f)
    print(config)

But even this only resulted in the first YAML block being read and output.

I would like to able to parse the YAML from the large markdown files, and replace it in the correct place with the corresponding HTML. I just am not sure if these (or any) packages have the capability of doing so. It may be that I just need to manually change the YAML to HTML in the original Markdown files (time intensive, but I could probably already be done with it if I had started that way).

riot_starter · Accepted Answer

What about this library: https://github.com/eyeseast/python-frontmatter

It parses both the front-matter and the Markdown in the file, placing the Markdown part in the content attribute of the resulting object.

Works with both front-matter containing and front-matterless (is there such a word?) files.

Parsing YAML out of a Markdown file

Tags:

python

html

markdown

yaml

pyyaml

pls_help_code_pls

1 Answers

riot_starter

Recent Activity

Donate For Us

Parsing YAML out of a Markdown file

Tags:

python

html

markdown

yaml

pyyaml

pls_help_code_pls

1 Answers

riot_starter

Related questions

Recent Activity

Donate For Us