In the second section of this module, we look at the quotas to consider when running Dataflow. Let’s get started! One of the quotas that Dataflow consumes is CPU. CPU quota is the total number of virtual CPUs across all of your VM instances in a region or zone. Any Google Cloud product that creates a Compute Engine VM such as Dataproc, GKE, or AI Notebooks, consumes this quota. CPU quota can be viewed in the UI on the IAM Quota page. For example, right now, I am consuming 219 CPUs in the northmaerica-northeast1 region. Say you want to start a Dataflow job with 100 workers. If the VM size selected n1-standard-1, meaning 1 CPU core per VM, the CPU usage will be 100. If the VM size selected n1-standard-8, that would mean 800 CPUs are needed. If the limit is 600, the job will display error because the CPU limit has been exceeded. Another quota to consider is the number of in-use IP addresses in each region. The in-use IP address quota limits the number of VMS that can be launched with an external IP address for each region in your project. Like the CPU quota, this quota is shared across all Google Cloud products that create VMs with an external IP address. When you launch a Dataflow job, the default setting is for the VM to launch with an external IP address. Jobs that access APIs and services outside Google Cloud require Internet access. However, if your job does not need to access any external APIs or services, you can launch the Dataflow job using internal I Ps only, saving which saves money and conserves the In-use IP address quotas. In our next module, we will show you how to launch VMs VMs with internal IPs only. Unlike the CPU quota, the in-use IP address quota is independent of the machine type; there is no difference between launching 150 n1-standard-1s vs 150 n1-standard-8s. In the slide image here, the In-use IP address limit for a few regions is 575. In the previous slide for CPU quota, the maximum number of CPUs per region was 600. When you launch a Dataflow job, the more restrictive quota takes precedence. Let us look at quotas for persistent disks. You can choose between two different types of Persistent Disks when running Dataflow jobs. You can launch jobs with either legacy Hard Disk Drives or modern Solid State Drives. Each disk type has a limit per region that can be used. For example, in the image shown here, Google Cloud products in my project that use HDDs northamerica-northeast1 are consuming 23.5 TB of disk space out of the available 102.4TB To specify the disk type, set the worker_disk_type flag to the prefix shown in the image, and and end it with either pd -ssd or pd-standard. Use Pd-standard for HDD and pd-ssd for SSD. In the slide example, we set the disk type to SSD using both Python and Java. When you launch a batch pipeline, the ratio of VMs to PDs is 1:1. For each VM, only one PD is attached. For jobs running Shuffle on worker VMs, The default size of each persistent disk is 250 GB. If the batch job is running using Shuffle Service, the default PD size is 25 GB. Recall that Dataflow Shuffle moves the shuffle operation out of the worker VMs and into the Dataflow service backend, which is why the default persistent disk size attached to the VM is smaller. Note that you can use the disk_size_gb flag to override the default for batch pipelines with shuffle on VM and Dataflow Shuffle. Streaming pipelines, however, are deployed with a fixed pool of Persistent Disks Each worker must have at least 1 persistent disk attached to it, while the maximum is 15 persistent disks per worker instance. As with Batch jobs, Streaming jobs can be run either on the worker VMs or on Dataflow. When you run a job using Dataflow, the feature that is used is Dataflow's Streaming Engine. Streaming Eengine moves pipeline execution out of the worker VMs and into the Dataflow service backend. For jobs launched to execute in the worker VMs, The default persistent disk size is 400 GB. Jobs launched using Streaming Engine have a persistent disk size of 30 GB. Just like with batch pipelines. These can be overridden using the disk_size_gb flag. It is important to note that the amount of disk allocated in a streaming pipeline is equal to max_num_workers flag. For example, if you launch a job with 3 workers initially and set the maximum number of workers to 25, 25 disks will count against your quota, not 3. To set the maximum number of workers that a pipeline can use, use the --max_num_workers flag. This cannot be above 1000. When you launch a streaming job that does not use Streaming Engine, the flag --max_num_workers is required. For streaming jobs that do use Streaming Engine, the --max_num_workers flag is optional. The default is 100.