Data science is often about accessing raw data. This is what your greatest concern should be when you first start to learn data science. Although there are many great, real-life data sets available online for machine learning, I have found that this is not the case when learning SQL.
A basic knowledge of SQL is essential for data science. However, it’s much easier to access large databases with real data (such name, address, credit card, social media number, birthday, and so on) than to find toy datasets on Kaggle. These data are specifically created or curated to be used in machine learning tasks.
More Python Resources
- What’s an IDE?
- Cheat Sheet: Python 3.7 for Beginners
- Top Python GUI Frameworks
- Download: 7 essential PyPI libraries
- Red Hat Developers
- The most recent Python content
It would be wonderful to have a tool or library that could create large databases with multiple tables and data of your choice.
Even seasoned software testers, aside from those who are new to data science, may find it helpful to have a simple tool that generates large data sets with random (fake) entries.
This is why I’m happy to present a lightweight Python library, . This article will briefly describe the package. You can also read the docs for more information.
What is pydbgen exactly?
Pydbgen, a lightweight pure-Python library, generates random useful entries (e.g. name, address and credit card number; date, time; company name; job title; license plate number). You can save them as a Pandas object, an SQLite table within a Microsoft Excel file or in a Pandas Dataframe object.
How to install PythonDBgen
The current version (1.0.5), is available on PyPI (the Python Package Index repositorie). To make this work, you must have Faker. Enter:
It was tested with Python 3.6, but it won’t work with Python 2 installations.
How to use it
To start using Pydbgen, initiate a pydb object.
You can then access the various functions of the Python object. Enter:
It will return fictitious names if you enter as opposed to real.
Create a Pandas dataframe using random entries
You can select how many data types and how many will be generated. All data types are returned as strings/texts.
The resultant dataframe looks something like the image below.
Create a database table
You can select how many data types and which data types you want to generate. All data is returned in the text/VARCHAR format. The table name and filename can be specified.
This creates a file called.db that can be used with MySQL and the SQLite databases servers. This image shows a SQLite database table that was opened in DB Browser.
Create an Excel file
The following code generates an Excel file with random data, similar to the ones above. Note that phone_simple has been set to false so it can create long-form, complex phone numbers. This is useful if you need to test more complicated data extraction codes.
For scrap use, generate random email IDs
realistic_email is a built-in method of pydbgen that generates random email IDs based on a seed name. This is useful if you don’t want your actual email address to be displayed on the internet, but something similar.
Future improvements and user contributions
The current version may have many bugs. Please let me know if your program crashes while being executed (except for one caused by an incorrect entry). If you have a great idea and want to contribute to the source code, please visit the GitHub repo. Several questions are easy to answer:
- Is it possible to combine some statistical modeling/machine learning with the random data generator?
- Is it possible to add a visualization function to the generator?