License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/OASIcs.SLATE.2022.5
URN: urn:nbn:de:0030-drops-167515
Go to the corresponding OASIcs Volume Portal

Cardoso, Hugo André Coelho ; Ramalho, José Carlos

Synthetic Data Generation from JSON Schemas

OASIcs-SLATE-2022-5.pdf (1 MB)


This document describes the steps taken in the development of DataGen From Schemas. This new version of DataGen is an application that makes it possible to automatically generate representative synthetic datasets from JSON and XML schemas, in order to facilitate tasks such as the thorough testing of software applications and scientific endeavors in relevant areas, namely Data Science. This paper focuses solely on the JSON Schema component of the application.
DataGen’s prior version is an online open-source application that allows the quick prototyping of datasets through its own Domain Specific Language (DSL) of specification of data models. DataGen is able to parse these models and generate synthetic datasets according to the structural and semantic restrictions stipulated, automating the whole process of data generation with spontaneous values created in runtime and/or from a library of support datasets.
The objective of this new product, DataGen From Schemas, is to expand DataGen’s use cases and raise the datasets specification’s abstraction level, making it possible to generate synthetic datasets directly from schemas. This new platform builds upon its prior version and acts as its complement, operating jointly and sharing the same data layer, in order to assure the compatibility of both platforms and the portability of the created DSL models between them. Its purpose is to parse schema files and generate corresponding DSL models, effectively translating the JSON specification to a DataGen model, then using the original application as a middleware to generate the final datasets.

BibTeX - Entry

  author =	{Cardoso, Hugo Andr\'{e} Coelho and Ramalho, Jos\'{e} Carlos},
  title =	{{Synthetic Data Generation from JSON Schemas}},
  booktitle =	{11th Symposium on Languages, Applications and Technologies (SLATE 2022)},
  pages =	{5:1--5:16},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-245-7},
  ISSN =	{2190-6807},
  year =	{2022},
  volume =	{104},
  editor =	{Cordeiro, Jo\~{a}o and Pereira, Maria Jo\~{a}o and Rodrigues, Nuno F. and Pais, Sebasti\~{a}o},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-167515},
  doi =		{10.4230/OASIcs.SLATE.2022.5},
  annote =	{Keywords: Schemas, JSON, Data Generation, Synthetic Data, DataGen, DSL, Dataset, Grammar, Randomization, Open Source, Data Science, REST API, PEG.js}

Keywords: Schemas, JSON, Data Generation, Synthetic Data, DataGen, DSL, Dataset, Grammar, Randomization, Open Source, Data Science, REST API, PEG.js
Collection: 11th Symposium on Languages, Applications and Technologies (SLATE 2022)
Issue Date: 2022
Date of publication: 27.07.2022

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI