It was an interesting week for me, dealing with 9 Terabytes of VHDs to upload to Azure. To be honest, I was surprised of the time it costs, because all the calculations we have made to predict the total time needed to upload, were unfortunately wrong. How and why ?
To upload VHDs to Azure, I used the Azure PowerShell cmdlet Add-AzureVHD. You can use the Add-AzureVHD by downloading and installing the Azure Powershell module : Download HERE. You should install the Azure powershell module on the machine from where you will initiate the upload.
The post aim is not to share how to use the Add-AzureVHD command, but to give you hints to get the best of it.
The upload process
When you upload a VHD to Azure using Add-AzureVHD, the following steps are conducted:
Step1 : Hash calculation : A MD5 hash is calculated against the VHD. This step can’t be skipped or avoided. The aim is to be able to check the VHD integrity after its upload to Azure
Step 2: Blob creation in Azure: A page blob is created in Azure with the same VHD size.
Step 3: Empty data blocks detection : (For Fixed VHD type only) The process looks for the empty data blocks to avoid copying blank data blocks to Azure
Step 4 : Upload : The data is uploaded to Azure
How to optimize
Step1 : Hash calculation
The hash calculation depends on three factors: Disk speed, VHD size and the processor speed. Let’s optimize each factor:
- Disk speed: The higher the read throughput is, the faster the hash calculation will be. If your VHD is placed on a SATA disk with 60 MB/s read throughput, the hash calculation will work at 500 Mbits. So for a VHD of 500 GB, the hash calculation will need more than two hours. Place your VHD on fast disks to obtain significant time gain.
- VHD size: The more your VHD is huge, the more the hash calculation will need time. The question is can we optimize it. The answer reside in using dynamic VHDs. A dynamic VHD contains the same size of data within it. Imagine a 500 fixed VHD containing just 100 GB of data, imagine the waste of going through 400 GB of blank blocks to calculate the hash. In addition, you may compact your dynamic VHDs before uploading them to Azure, compacting dynamic VHDs can reduce the VHD size. You should know that the blob size that will be created in Azure during the upload process will be equal to the VHD size for fixed size VHDs and the maximum size for dynamic VHDs. But you have not to worry about that when compacting your dynamic VHD because you can later, expand your VHD in Azure, in case you would like a greater VHD size.
- Processor speed : The hash calculation is a mathematical operation, so it’s clear that the faster our processor is, the faster the calculation will be. However, todays processor are fast enough to handle such operations, and bottleneck here is the disk read throughput, unless you are using a 1 Ghz old Dual Core processor to calculate the hash of a VHD located on a RAID10 SSD drives on a 10 Gbits FC SAN. You can take a look to your task manager during a hash calculation to see the processor usage.
Step 2: Blob creation in Azure
In this step, a blob with the same VHD size will be allocated in Azure. Nothing to optimize
Step 3: Empty data blocks detection
This step is only performed if the VHD to be uploaded is a fixed size VHD. The Azure command scans the VHD to look for empty data blocks. I really like this step because it can bring us an enormous upload time gain. Imagine that you want to upload a 500 GB fixed size VHD, and that really only 100 GB are used! Empty data blocks detection will let you gain 4x the upload time. In the other hand, this step is time consuming, because all the VHD is processed to look for empty data. For example, processing a 500 GB will take more than one hour. This why, again, uploading a dynamic VHD is more advantageous, no empty data
Step 4 : Upload
This is the final step, the data is uploaded to Azure. The only optimization is to have a fast internet connection (Fast upload link).
Lesson from an experience:
- The VHDs to be uploaded should be dynamic expanding VHDs
- If your VHDs are fixed size, convert them before uploading, you will gain a significant upload time (Hash calculation + Empty data blocks detection)
- If your VHDs are already dynamic, try to compact them to the minimal size (Hash calculation gain). You can then expand them to the desired size in Azure.