A comprehensive Python library for generating synthetic data with various distributions, correlations, and constraints for machine learning and data science applications.
- Synthetic Generator
- π Table of Contents
- π Features
- π― Why Synthetic Generator?
- π Quick Start
- π Detailed Documentation
- π― Use Cases
- π§ Advanced Features
- π Available Templates
- π¦ Package Information
- π οΈ Development
- π€ Contributing
- π License
- π Getting Started
- π Contact
- π Acknowledgments
- Multiple Distributions: Normal, Uniform, Exponential, Gamma, Beta, Weibull, Poisson, Binomial, Geometric, Categorical
- Data Types: Integer, Float, String, Boolean, Date, DateTime, Email, Phone, Address, Name
- Correlations: Define relationships between variables with correlation matrices
- Constraints: Value ranges, uniqueness, null probabilities, pattern matching
- Dependencies: Generate data based on other columns with conditional rules
- Schema Inference: Automatically detect data types and constraints from existing data (no distribution inference)
- Templates: Pre-built schemas for common use cases (customer data, medical data, e-commerce, financial)
- Privacy: Basic anonymization support
- Validation: Comprehensive data validation against schemas (data types and constraints only)
- Export: Multiple format support (CSV, JSON, Parquet, Excel)
- Easy-to-Use API: Simple, intuitive interface for data generation
- Web Interface: Modern, responsive web UI for interactive data generation
- Flexible Configuration: Support for both programmatic and configuration-based setup
- Reproducibility: Seed-based random generation for consistent results
- Performance: Optimized for large-scale data generation
Synthetic Generator is designed to make synthetic data generation simple, flexible, and powerful. Whether you're:
- Testing applications with realistic data
- Training machine learning models with diverse datasets
- Prototyping without sensitive information
- Data augmentation for research purposes
This library provides all the tools you need to create high-quality synthetic data that maintains the statistical properties of your original data while ensuring privacy and flexibility.
# Install from PyPI (Recommended)
pip install synthetic-generator
# Install from GitHub (Development)
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic-generator
pip install -e .# From a built-in template
synthetic-generator generate --template customer_data --rows 10000 --out customers.parquet
# From your real data (fit then sample)
synthetic-generator generate --in real.csv --rows 5000 --out synthetic.csvfrom synthetic_generator.quick import dataset, fit
import pandas as pd
# 1) From a template
df = dataset(template="customer_data", rows=1000, seed=42)
# 2) From your data (fit then sample)
# Create sample data or load from file
sample_data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000]
})
model = fit(sample_data)
df2 = model.sample(500, seed=123)from synthetic_generator import load_template, generate_data
# Load a pre-built template
schema = load_template("customer_data")
# Generate data
data = generate_data(schema, n_samples=500, seed=123)
print(data.head())import pandas as pd
from synthetic_generator import infer_schema, generate_data
# Create sample data (or load from file)
existing_data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000],
'department': ['IT', 'HR', 'Sales', 'IT', 'HR']
})
# Infer schema
schema = infer_schema(existing_data)
# Generate new data based on inferred schema
new_data = generate_data(schema, n_samples=1000, seed=456)Synthetic Generator supports various data types:
- Numeric:
INTEGER,FLOAT - Text:
STRING,EMAIL,PHONE,ADDRESS,NAME - Categorical:
CATEGORICAL,BOOLEAN - Temporal:
DATE,DATETIME
Available statistical distributions:
- Continuous:
NORMAL,UNIFORM,EXPONENTIAL,GAMMA,BETA,WEIBULL - Discrete:
POISSON,BINOMIAL,GEOMETRIC - Categorical:
CATEGORICAL,CONSTANT
Define relationships between variables:
schema = DataSchema(
columns=[...],
correlations={
"height": {"weight": 0.7}, # Height and weight correlation
"age": {"income": 0.4} # Age and income correlation
}
)Apply various constraints to your data:
ColumnSchema(
name="salary",
data_type=DataType.FLOAT,
distribution=DistributionType.NORMAL,
parameters={"mean": 50000, "std": 15000},
min_value=30000, # Minimum value
max_value=100000, # Maximum value
unique=True, # Unique values
nullable=True, # Allow null values
null_probability=0.05 # 5% null probability
)Generate data based on other columns:
ColumnSchema(
name="bonus",
data_type=DataType.FLOAT,
distribution=DistributionType.UNIFORM,
parameters={"low": 0, "high": 10000},
depends_on=["salary"],
conditional_rules={
"rules": [
{
"condition": {"salary": {"operator": ">", "value": 70000}},
"value": 5000
}
],
"default": 1000
}
)Generate realistic customer profiles with demographics, contact information, and preferences.
Create synthetic patient data with health metrics, demographics, and medical conditions.
Generate transaction data with realistic amounts, categories, and temporal patterns.
Create order and product data with realistic relationships and business rules.
You can install and run the web UI if needed:
pip install synthetic-generator[web]
synthetic-generator web # http://localhost:8000Web UI tips (v0.0.7+):
- Templates: clicking "Use Template" navigates to the Generator and auto-populates columns and parameters.
- Export: after generating data, export directly from the Generator page via the built-in Export panel (CSV, JSON, Excel, Parquet). There is no separate Export page.
- Schema Inference: Only infers data types and constraints, not distributions. Users can manually specify distributions in the Generator.
- Null Probability: Fixed issue where 100% null probability wasn't being applied correctly.
- JSON Serialization: Fixed NaN values in generated data to properly serialize as null in JSON.
# Generate data with custom parameters
from synthetic_generator import load_template, generate_data
schema = load_template("customer_data")
data = generate_data(schema, n_samples=1000, seed=42)from synthetic_generator import validate_data
# Validate generated data
results = validate_data(data, schema)
print(f"Valid: {results['valid']}")
print(f"Errors: {results['errors']}")
print(f"Warnings: {results['warnings']}")from synthetic_generator.export import export_data
# Export to various formats
export_data(data, 'csv', filepath='data.csv')
export_data(data, 'json', filepath='data.json')
export_data(data, 'excel', filepath='data.xlsx')
export_data(data, 'parquet', filepath='data.parquet')customer_data: Customer information with demographicsecommerce_data: E-commerce transaction datamedical_data: Medical patient data with health metricsfinancial_data: Financial transaction data
- PyPI: https://pypi.org/project/synthetic-generator/
- Version: 0.0.8
- Python: 3.8+
- Dependencies: pandas, pydantic, numpy, scipy
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic_generator
make install_devmake testpython examples/basic_usage.pyWe welcome contributions! Please see our Contributing Guidelines for details.
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic_generator
make install_devSynthetic Generator is released under the MIT License. See LICENSE.txt for details.
For a quick start guide, see QUICKSTART.md.
For detailed examples, check the examples/ directory.
Vo Hoang Nhat Khang
Maintainer & Developer
Synthetic Generator - Python Package
Contact via:
- Email: [email protected]
- GitHub: nhatkhangcs
- PyPI: synthetic-generator
Thanks to all contributors and the open-source community for making this project possible.
Happy coding with Synthetic Generator! π


