Is there any tool able to create an AVRO schema from a 'typical' JSON document.
For example:
{
"records":[{"name":"X1","age":2},{"name":"X2","age":4}]
}
I found http://jsonschema.net/reboot/#/ which generates a 'json-schema'
{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "id": "http://jsonschema.net#",
  "type": "object",
  "required": false,
  "properties": {
    "records": {
      "id": "#records",
      "type": "array",
      "required": false,
      "items": {
        "id": "#1",
        "type": "object",
        "required": false,
        "properties": {
          "name": {
            "id": "#name",
            "type": "string",
            "required": false
          },
          "age": {
            "id": "#age",
            "type": "integer",
            "required": false
          }
        }
      }
    }
  }
}
but I'd like an AVRO version.
An Avro schema is created using JSON format. JSON is short for JavaScript Object Notation, and it is a lightweight, text-based data interchange format that is intended to be easy for humans to read and write. JSON is described in a great many places, both on the web and in after-market documentation.
For those using the C# Avro Apache library, the utility function DataFileReader<GenericRecord>. OpenReader(filename); can be used to instantiate the dataFileReader . Once instantiated, it the dataFileReader is used just like in Java.
Apache Avro ships with some very advanced and efficient tools for reading and writing binary Avro but their support for JSON to Avro conversion is unfortunately limited and requires wrapping fields with type declarations if you have some optional fields in your schema.
You can achieve that easily using Apache Spark and python. First download the spark distribution from http://spark.apache.org/downloads.html, then install avro package for python using pip. Then run pyspark with avro package:
./bin/pyspark --packages com.databricks:spark-avro_2.11:3.1.0
and use the following code (assuming the input.json files contains one or more json documents, each in separate line):
import os, avro.datafile
spark.read.json('input.json').coalesce(1).write.format("com.databricks.spark.avro").save("output.avro")
avrofile = filter(lambda file: file.startswith('part-r-00000'), os.listdir('output.avro'))[0]
with open('output.avro/' + avrofile) as avrofile:
    reader = avro.datafile.DataFileReader(avrofile, avro.io.DatumReader())
    print(reader.datum_reader.writers_schema)
For example: for input file with content:
{'string': 'somestring', 'number': 3.14, 'structure': {'integer': 13}}
{'string': 'somestring2', 'structure': {'integer': 14}}
The script will result in:
{"fields": [{"type": ["double", "null"], "name": "number"}, {"type": ["string", "null"], "name": "string"}, {"type": [{"type": "record", "namespace": "", "name": "structure", "fields": [{"type": ["long", "null"], "name": "integer"}]}, "null"], "name": "structure"}], "type": "record", "name": "topLevelRecord"}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With