Skip to main content

IBM Synthetic Data Sets

Redguide

thumbnail 

Last updated on 28 January 2025

  1. .PDF (2.8 MB)

Share this page:   

IBM Form #: REDP-5748-00


Authors: Erik Altman, Dipali Aphale, Joy Deng, Yadu Nandan B, Saurabh Srivastava and Kelly Xiang

    menu icon

    Abstract

    IBM® Synthetic Data Sets is a family of artificially generated, enterprise-grade data sets designed to enhance predictive artificial intelligence (AI) model training and large language models (LLMs) to benefit IBM Z® and IBM LinuxONE clients, ecosystem, and independent software vendors. These pre-built data sets are downloadable and are packaged as comma-separated values (CSV) and data definition language (DDL) files, making them familiar to use, and compatible with everything from databases to spreadsheets to hardware platforms to standard AI tools. These data sets also leverage the IBM industry expertise and domain knowledge of the financial services sector without using any real client seed data, alleviating security concerns with Personally Identifiable Information (PII). Real data at client sites is often limited in scope to just their own organization's transactions, and clients do not always know which transactions are fraudulent or not. To address this scenario, IBM Synthetic Data Sets were curated for fraud detection use cases, so clients can download and enable development of predictive AI models and LLMs for financial services or optimize existing models for improved accuracy and risk mitigation.

    The IBM Synthetic Data Sets family contains:

    * IBM Synthetic Data Sets for Payment Cards

    * IBM Synthetic Data Sets for Core Banking and Money Laundering

    * IBM Synthetic Data Sets for Homeowners Insurance

    This IBM Redbooks® publication aims to introduce you to IBM Synthetic Data Sets and provide you with information on how IBM Synthetic Data Sets can enhance and optimize your predictive artificial intelligence (AI) model training and large language models (LLMs).

    Table of Contents

    Executive Overview

    Introduction to IBM Synthetic Data Sets

    Data set deep dive

    Available editions

    Previewing data schemas

    Using real data vs synthetic data

    Datageneration methodology

    AI ethics

    Legal usage terms

    Getting started

    FAQs

    Additionalresources

    Appendix: Data schemas for each IBM Synthetic Data Sets

     

    Special Notices

    The material included in this document is in DRAFT form and is provided 'as is' without warranty of any kind. IBM is not responsible for the accuracy or completeness of the material, and may update the document at any time. The final, published document may not include any, or all, of the material included herein. Client assumes all risks associated with Client's use of this document.