User Guide¶
Configuration¶
SDVG uses two configuration files: the SDVG instance configuration file and the data generation configuration file.
SDVG instance configuration¶
Description of SDVG instance configuration fields¶
The SDVG instance configuration includes the following fields:
log_format: Log format. Supported values:text,json. Default istext.http: HTTP server configuration described by theHTTPConfigstructure.open_ai: OpenAI configuration described by theOpenAIstructure.
The http structure describes the HTTP server configuration used for
interacting with SDVG and contains the following fields:
listen_address: Address for the HTTP server to listen on. Default is:8080.read_timeout: Data read timeout. Default is1m(1 minute).write_timeout: Data write timeout. Default is1m(1 minute).idle_timeout: Idle timeout. Default is1m(1 minute).
The open_ai structure describes the OpenAI configuration and includes the following fields:
api_key: API key for accessing OpenAI.base_url: Base URL for the OpenAI API.model: OpenAI model.
Examples of SDVG instance configuration¶
Example configuration for HTTP server:
Example configuration for OpenAI:
open_ai:
api_key: "sk-123"
base_url: "http://10.0.1.100:11434/v1"
model: "deepseek-r1:70b-llama-distill-q8_0"
Data generation configuration¶
This configuration is directly used for data generation after launching SDVG.
Description of data generation configuration fields¶
The data generation configuration includes the following fields:
random_seed: Seed for random number generation. If omitted or set to0, a random value is used.workers_count: Number of threads for data generation. Defaults to CPU count multiplied by 4.batch_size: Batch size for data generation and output. Default is1000.models: Map of data models, with the key as the model name and the value asmodels[*]structure.models_to_ignore: List of models to exclude from this SDVG data generation run (foreign keys referencing these models will still work ifrandom_seedandrows_countremain unchanged).output: Output configuration for generated data, described by theoutputstructure.
The output structure describes the generated data output configuration:
type: Output format type. Supported values:devnull,csv,parquet,http,tcs. Default iscsv.dir: Directory for storing generated data. Default is./output.create_model_dir: Specifies whether separate directories are created for each model. Default isfalse.params: Parameters for the chosen output format type, described by theoutput.paramsstructure.checkpoint_interval: Frequency of progress checkpoint file updates. Default is5s(5 seconds).
The models[*] structure describes a data generation model and includes:
rows_count: Number of rows to generate. Required field.rows_per_file: Number of rows per file, supported bycsvandparquet. Defaults isrows_count.generate_from: Starting row number for generation. Default is0.generate_to: Ending row number for generation. Default isrows_count.model_dir: Directory to store data for this model, relative tooutput_dir. Defaults to model name.columns: List of columns described by themodels[*].columnsstructure.partition_columns: Columns used for data partitioning. Supported forparquetandcsv.
The models[*].partition_columns structure specifies data partitioning columns:
name: Column name from schemamodels[*].columns. Required field.write_to_output: Flag indicating whether the partition column is included in final data files.
The models[*].columns structure describes a column in a data model:
name: Column name. Required field.type: Column data type. Supported values:integer,float,string,datetime,uuid.type_params: Parameters for the chosen data type (models[*].columns[*].type_paramsstructure).values: Enumeration of possible column values. Cannot coexist withdistinctparameters.ordered: Indicates if column values should be ordered (similar to sequence).distinct_percentage: Percentage of unique values. Must be between0and1. Cannot coexist withdistinct_count.distinct_count: Number of unique values. Must be greater than0. Cannot coexist withdistinct_percentage.null_percentage: Percentage of null values. Must be between0and1.ranges: A set of parameter ranges for a column that allows you to specify several configurations with their percentage distribution (range_percentage). Each range (ranges[*]) can contain:type_params: Parameters for the selected data type.values: Enumeration of possible values in the range.ordered: Flag for ordered values.distinct_percentage: Percentage of unique values.distinct_count: Number of unique values.null_percentage: Percentage of null values.range_percentage: Percentage of this range relative to total data.parquet_params: Parameters for formatting values inparquetoutput.foreign_key: Foreign key reference in the formatmodel_name.column_name. Values are sourced from this column. Cannot coexist with other column parameters.foreign_key_order: Indicates if the foreign key order should be preserved. Useful for maintaining value correspondence with external tables.
Attention: The
rangesparameter and direct specification of parameters at the column level (values,type_params,distinct_percentage,distinct_count,null_percentage,ordered) are mutually exclusive. They cannot be used simultaneously.
Structure models[*].columns[*].parquet_params:
encoding: Encoding for the column. Supported values:PLAIN,RLE_DICTIONARY,DELTA_BINARY_PACKED,DELTA_BYTE_ARRAY,DELTA_LENGTH_BYTE_ARRAY. Default isPLAIN.
Structure models[*].columns[*].type_params for data type integer:
bit_width: Bit width for integer. Supported values:8,16,32,64. Default is32.from: Minimum value for integer. Defaults to the minimum possible value for the selected bit width.to: Maximum value for integer. Defaults to the maximum possible value for the selected bit width.
Structure models[*].columns[*].type_params for data type float:
bit_width: Bit width for float. Supported values:32,64. Default is32.from: Minimum value for float. Defaults to the minimum possible value for the selected bit width.to: Maximum value for float. Defaults to the maximum possible value for the selected bit width.
Structure models[*].columns[*].type_params for data type string:
min_length: Minimum string length. Default is1.max_length: Maximum string length. Default is32.logical_type: Logical type of string. Supported values:first_name,last_name,phone,text.template: Template for string generation. SymbolA- any uppercase letter, symbola- any lowercase letter, symbol0- any digit, symbol#- any character. Other characters remain as-is.locale: Locale for generated strings. Supported values:ru,en. Default isen.without_large_letters: Flag indicating if uppercase letters should be excluded from the string.without_small_letters: Flag indicating if lowercase letters should be excluded from the string.without_numbers: Flag indicating if numbers should be excluded from the string.without_special_chars: Flag indicating if special characters should be excluded from the string.
Structure models[*].columns[*].type_params for data type datetime:
from: Minimum date-time value. Default is01.01.1900.to: Maximum date-time value. Default is01.01.2025.
Structure output.params for format csv:
float_precision: Floating-point number precision. Default is2.datetime_format: Date-time format. Default is2006-01-02T15:04:05Z07:00.without_headers: Flag indicating if CSV headers should be excluded from data files.delimiter: Single-character CSV delimiter. Default is,.
Structure output.params for format parquet:
compression_codec: Compression codec. Supported values:UNCOMPRESSED,SNAPPY,GZIP,LZ4,ZSTD. Default isUNCOMPRESSED.float_precision: Floating-point number precision. Default is2.datetime_format: Date-time format. Supported values:millis,micros. Default ismillis.
Structure output.params for format http:
endpoint: Endpoint for sending data.timeout: Timeout for sending data, specified as a string combiningh,m,swithout spaces, e.g.,1h,5m30s,2h5s. Default is1m(1 minute).batch_size: Number of data records sent in one request. Default is1000.workers_count: Number of threads for writing data. Default is1. Experimental field.headers: HTTP request headers specified as a dictionary. Default is none.-
format_template: Template-based format for sending data, configured using Golang templates.
Available for use informat_template: -
fields:
ModelName- name of the model.Rows- array of records, where each element is a dictionary representing a data row. Dictionary keys correspond to column names, and values correspond to data in those columns.
- functions:
len- returns the length of the given element.json- converts the given element to a JSON string.
Example value for the format_template field:
format_template: |
{
"table_name": "{{ .ModelName }}",
"meta": {
"rows_count": {{ len .Rows }}
},
"rows": [
{{- range $i, $row := .Rows }}
{{- if $i}},{{ end }}
{
"id": {{ index $row "id" }},
"username": "{{ index $row "name" }}"
}
{{- end }}
]
}
Default value for the format_template field:
Structure of output.params for tcs format:
Similar to the structure for the http format,
except that the format_template field is immutable and always set to its default value.
Examples of data generation configuration¶
Example data model configuration:
workers_count: 32
batch_size: 1000
random_seed: 0
output:
type: "devnull"
dir: output-dir
models:
token:
rows_count: 500000
model_dir: token_model
columns:
- name: id
type: uuid
- name: user_id
foreign_key: user.id
- name: session_id
type: string
type_params:
min_length: 16
max_length: 32
distinct_percentage: 1
- name: token_type
type: string
values:
- "access"
- "refresh"
user:
rows_count: 10000
columns:
- name: id
type: integer
type_params:
from: 1
to: 500000
ordered: true
- name: str_id
type: string
ordered: true
- name: ru_phone
type: string
type_params:
logical_type: phone
locale: ru
- name: first_name_ru
type: string
type_params:
logical_type: first_name
locale: ru
- name: last_name_ru
type: string
type_params:
logical_type: last_name
locale: ru
- name: first_name_en
type: string
type_params:
logical_type: first_name
- name: passport
type: string
type_params:
template: AA 00 000 000
distinct_percentage: 1
ordered: true
- name: rating
type: float
type_params:
from: 0.0
to: 5.0
- name: created
type: datetime
type_params:
from: 2020-01-01T00:00:00Z
ordered: true
- name: birthday
type: datetime
ranges:
- type_params:
from: 1900-01-01T00:00:00Z
- values: [null]
range_percentage: 0.1
- values:
- 2005-03-09T04:44:00Z
Example configuration for generating CSV files:
output:
type: csv
params:
float_precision: 1
datetime_format: 2006-01-02
models:
user:
rows_count: 10000
columns:
- name: id
type: uuid
- name: session_id
type: string
- name: last_seen_at
type: datetime
partition_columns:
- name: id
write_to_output: false
- name: session_id
write_to_output: false
Example configuration for generating Parquet files:
output:
type: parquet
params:
float_precision: 1
datetime_format: millis
compression_codec: UNCOMPRESSED
models:
token:
rows_count: 500000
rows_per_file: 250000
columns:
- name: id
type: uuid
- name: session_id
type: string
parquet:
encoding: RLE_DICTIONARY
distinct_percentage: 1
partition_columns:
- name: id
write_to_output: true
Example configuration for sending generated data via HTTP:
output:
type: http
params:
endpoint: "http://127.0.0.1:8080/insert"
timeout: 30s
headers:
Authorization: "Bearer <token>"
format_template: |
{
"table_name": "{{ .ModelName }}",
"meta": {
"rows_count": {{ len .Rows }}
},
"rows": {{ json .Rows }}
}
models:
user:
rows_count: 10000
columns:
- name: id
type: uuid
- name: session_id
type: string
Example configuration for sending generated data to TCS:
output:
type: tcs
params:
endpoint: "http://127.0.0.1:7101/insert"
timeout: 30s
models:
user:
rows_count: 10000
columns:
- name: id
type: uuid
- name: session_id
type: string
Launch¶
To start in interactive mode, simply run the SDVG binary:
To get information about available commands and their arguments:
Data generation¶
Before starting data generation, SDVG checks the output directory for conflicting files. If conflicts are found, they will be displayed as a list of errors upon startup. This helps avoid overwriting or corrupting existing data.
To start data generation with a specified configuration file:
Ignoring conflicts¶
If you want to automatically remove conflicting files from the output directory
and continue generation without additional prompts, use the -F or --force flag:
Continuing generation¶
To continue generation from the last recorded row:
Important: To correctly continue generation, you must not change the generation configuration or already generated data.