Dealing with velocity, volume, and variety – MLOps and DataOps
Dealing with velocity, volume, and variety
When given any tutorial on how to process data, you are usually given a quick introduction to the three Vs (velocity, volume, and variety). These are the three ways in which the complexity of data can scale. Each of them presents a singularly unique problem when dealing with data, and a lot of data that you would have to deal with can be a combination of all three. Velocity is the speed of data coming in over a period of time, volume is the amount of data, and variety is the diversity of the data being presented.
So, this section will be divided according to the three Vs, and in each subsection, there will be a solution for a common problem that may arise with them. This way, you will get to see how Python can help in dealing with such massive amounts of data. Let’s start with volume as it is the simplest and probably the first thing that comes to people’s minds when it comes to big data.
Volume
The volume of data is a pretty simple thing. It represents a certain quantity of data, most, if not all, of which will be of the same type. If we are going to deal with a large volume of data, it will require understanding the time sensitivity of data as well as the resources that we would have on hand. The volume of data that is usually processed differs based on whether the data is massive based on width or length (i.e., whether there are a lot of fields for one row of data or there is a massive number of data rows). Both of these require different solutions, even specialized databases sometimes. There is also the possibility of datasets not being numbers and letters at all but instead being files of audio or video. In this section, we will use an example that will be very useful when we have a database or data file that contains a large number of fields/columns.
To start, we will need a high-volume dataset, so we will use an app called Mockaroo, which allows you to generate data fields and sample data using generative AI (very fitting in this chapter). Let’s go to the Mockaroo site and generate a few fields for our sample data:

Figure 11.2 – Mockaroo schema
The dataset we produced with Mockaroo looks like the following:

Figure 11.3 – Sample CSV created by Mockaroo
The preceding figure shows just a small piece of it; it’s 20 very large fields for 1,000 rows. Let’s write the script to parse through it:
import csvdef read_large_csv(file_path): with open(file_path, ‘r’) as csv_file: csv_reader = csv.reader(csv_file) next(csv_reader, None) for row in csv_reader: yield rowcsv_file_path = ‘MOCK_DATA.csv’for row in read_large_csv(csv_file_path): print(row)
The script may seem a little redundant in terms of reading the CSV file, but the reason it is like this is so that all of the rows in the CSV aren’t loaded into the memory of the OS at the same time. This method will reduce the load on the memory of the data and is a great way to read large amounts of data in a system where the memory can’t hold a lot of data. What it does is that it reads one row of the data and then releases that data from the memory before reading the other rows. This is efficient management of memory during the reading of a high volume of data, which in turn makes the reading a lot faster and smoother, as demonstrated in this diagram:

Figure 11.4 – Workflow behind a generator
Now, that was simple enough, but what happens when it’s just one row at a time, but constant, such as streaming data? All of it needs to be processed live as it comes in. How would we achieve this? Let’s find out.