Micro Services in the Cloud : Azure Service Fabric Mesh : The What and Why ?

Months ago, Microsoft announced the introduction of Azure Service Fabric Mesh (SFM), a new easy and salable managed service to deploy your micro services application based on containers.

Many of you are skeptic about Microsoft, a feeling that i have seen through many threads and discussions on the net. In this post, i will try to share my experience with this new service (even if it’s in early phase, public preview) in order to explain the What and Why.

  • What is Service fabric Mesh ?
  • Why Service Fabric Mesh?

1- What is Service Fabric Mesh ?

For people whom familiar with Service fabric or Azure Service Fabric, Service Fabric Mesh is a managed Service Fabric offer. It’s supposed to deliver the same Azure Service Fabric features, but in a managed way : You don’t create, manage or operate the underlying infrastructure, you only deploy you services directly to  Service fabric. Note that SFM does not always provide the same features, but equivalent features, and maybe new features (The features time-line between both offers is not always the same), but the whole concept is the same.

For people new to this area, SFM is a managed (which means that the infrastructure is hidden and managed by Azure) service where you deploy your services (to be more precise, you containers) using a very simple, beautiful and powerful way.

Here the Microsoft overview article of Service Fabric Mesh, when more “commercial” details and sentences are used, but to be honest i believe is this service.

Take my previous blog post, where i explained the Micro Services architecture using simple words, and where i have used a very know example ; The HRA application.

HRA Application services

The Application is composed of 4 services. If we map a Service to a container, we will obtain 4 containers. You will ask your developer to create 4 docker images, 1 image for each service. You will ask your developer to upload/push these images to a registry, let’s say Azure Container Registry (an equivalent to Docker Hub).

3 docker images pushed to Azure Container registry

Then comes Azure Service Fabric Mesh : You have just to prepare an ARM json template (see Azure ARM templates) where you describe you application architecture and configurations : I want to deploy 4 services, each service use the image x, each service should be highly available with at least y instances, and each instances should have N cpu and M GB of RAM. Each service listens internally on port A, B, C… and only the Portal service is published to the internet on port 443.

–> Is not this magic and beautiful ? Believe me, the structure of the ARM template is very easy and understandable, making DevOps very affordable.

Once you deploy this ARM template, within less than 5 minutes, your application is ready to be used.

To conclude : Azure Service Fabric Mesh is the service provided by Azure that combines:

  • The Power of Azure Service Fabric
  • Managed service (PaaS)

2- Why Service Fabric Mesh ?

Many of you will ask : Why do i have to use SFM, and not use another service like :

  • Kubernetes : Azure Kubernetes Service, which provides the power of the Cloud (Azure) and Kubernetes
  • Azure Container Service
  • Amazon ECS
  • Amazon EKS
  • Or create your own cluster on top of IaaS (like a swarm cluster)

Even though this is a matter of preference, the following are my arguments about why i love SFM and why it’s my preferred cloud container service (For micro services)

2.1- Fully managed

SFM is fully managed, which means that you define only what matters for you : The containers. SFM will manage all the underlying compute, storage and networking infrastructure. HA is also provided by simple defining up to 2 instances for each container.

2.2- ARM resources

What i like in SFM, is that the components are first class ARM resources, which means that you can manage, deploy and configure them using ARM calls (ARM templates, API, Terraform in the future…), and benefit from RBAC. The following are examples of ARM resources :

  • Application : Your SFM application is an ARM resource
  • Service : Each “micro service” is an ARM resource
  • Network : The network in which you application is deployed in an ARM resource

2.3- Enterprise features

SFM is currently in preview, but according to many discussions and the road map, many enterprise features are/will be added that makes the integration of our applications easier and powerful:

  • HA volume: You will be able to attach a Highly Available fast (SSD) volume to a set of containers. This will allow simultaneous and concurrent access to a storage location (local and fast) to your services. You can share anything between your services thanks to this shared volume
  • Reliable Collections/Dictionary: If you don’t know what we mean by “Reliable”, then here a small description : Your services can have access to a highly available, near-real-time synchronized collection/dictionary that your application and services can use to store and exchange values. This is very useful for state-full services, where each instance of your service should access an accurate and synced state.
    • Example : My service, which is instanced via 2 containers (c1 and c2) must store the status of the user session during its life-time (token for example). With a Reliable Dictionary, you can store the token within a key/value pair object in a dictionary, that your service can get/set anytime. See here for a more explained video
  • Services Communication and routing: It’s very easy today for Service 1 to communicate with service 2. No need to use the IP or the hostname. Service1 can communicate with Service2 just by sending requests to htttp://Service2:port/abcd (using the Service name). On the future, SFM will support Envoy which is sort of Reverse Proxy for MicroServices, allowing your services to communicate only with it, and it will orchestrate the communications. It supports many features like routing, circuit braking, authentication, transformation… The native support is a key point to why i like SFM
  • Rolling updates : When you have a new version of your Service1, you can deploy it securely. SFM will replace your containers one by one to ensure your service is not affected
  • Auto-scale Rules : In the future, SFM will support auto-scale rules in order to support scaling the containers instances automatically.
  • Scaling performance :  I have seen how SFM scaled from 1 to 500 containers in less than 30s

And more is coming…

2.4- Pricing

NB : This depends on the pricing units, which is not yet disclosed

The pricing of SFM is very simple : You are charged per CPU core and RAM fraction you allocate to your containers.

For my HRA app, i can use the following configuration :

  • Portal : 1 cores,2 GB RAM
  • Service 1 : 0.5 cores, 1 GB RAM
  • Service 2: 0.5 cores, 0.5 GB RAM
  • Service 3: 2 cores, 4 GB RAM

I can then deploy x instances per Service, and apply a scaling plan to scale-In ou Out accordingly –> I will pay just what i consume, reducing the entry price

3- Conclusion

SFM is new player (still a baby) on the container as a Service market, but what differentiates it from other solutions is that is fully managed, and that the management experience is unique (ARM). I have used AKS, and i really like it and would recommend it alongside with SFM (even though AKS stills lack of some management features, and that it’s not a fully managed service).

Check the following links for more information:

The oldest “bug” for Windows domain joined computers


When you have a computer joined to an Active Directory domain, and that you decide to leave the domain to work-group, you are “surprisingly” asked to enter credentials for an account with permissions to remove this computer from the domain. I know this is not always a bug, but in all my cases, i was just wanting the leave the domain.

So what credentials do you need to provide : anything will work, just type something and click OK.



Nice Saturday!

Azure Service Fabric news from //BUILD

Hi all,

Azure Service Fabric is one of my preferred azure services today, at is embraces the microsoervices era. At //BUILD, a new SF offer has been introduced, which is Azure Service Fabric Mesh, a fully managed and server-less SF brand. The following points resume the session that you can find here : https://www.youtube.com/watch?v=0ab2wIGMbpY

  • Deploying Service Fabric Clusters locally (Standalone)
    • Now you can use a JSON manifest file to describe the nodes where to deploy, certificates to install, configuration to set… This file will be uploaded to Azure, and Azure will provide the required packages to make the installation
  • Single Portal experience
    • View and manage the Azure SF clusters and the Local Clusters via the Azure portal
  • Azure Service Fabric Mesh
    • A fully managed SF running on Azure. You no longer need to manage the SF infrastructure like VMSS, LBs or any other thing.
    • Supports containers
    • Creating Applications and Services now can be done via ARM templates, as Applications and Services are first class ARM resources !! This just amazing2018-05-09_11-53-49.png
    • All the .yaml files that are used to define the Service Fabric resources are regrouped into a JSON file2018-05-09_13-56-26.png
  • Secrets Store (Secrets Resource)
    • Built within SF, Applications and Services have managed idendity with AAD, and can access Azure Key Vault to get secrets and certificates. Secrets and certificates rollover is supported too.
  • Volume Resource
    • A volume resource presented to the containers, two types :
      • Backed by Azure File Storage
      • Backed by SF Volume Disk (replicated local disks)2018-05-09_14-10-03.png
  • Diagnostic and monitoring
    • Containers will write stdout/stderr to a volume, and Application Insights will analyse these data from that volume.
    • Azure Monitor for SF metrics
  • Reliable Collections
    • Enhancements and new features, the application and the reliable collection are now separated
  • Intelligent traffic routing
    • Great news about this as new routing features are added2018-05-09_14-25-50.png
    • Introduction of Envoy. This will simplify services communication2018-05-09_14-26-53.png

Service Fabric Mesh Pricing

This was not disclosed, waiting for it.

Understanding Azure CosmosDB Partitioning and Partition Key


Many of you have asked me about the real meaning of Cosmos DB partitions, the partition key and how to choose a partition key if needed. This post is all about this.

1- Cosmos DB partitions, what is it ?

The official Microsoft article explains well partitions in CosmosDB, but to simplify the picture:

  • When you create a container in a CosmosDB database (A Collection in case of SQL API), CosmosDB will provision a capacity for that container
  • If the container capacity is more than 10GB, then CosmosDB requires an additional information to create it : WHY ?

When CosmosDB provisions a container, it will reserve capacity over its compute and storage resources. The Storage and Compute resources are called Physical Partitions.

Within the physical partitions, Cosmsos uses Logical Partitions, the maximum size of a Logical Partition is 10GB

You get it now : When the size of a container exceeds (or can exceed) 10GB, then Cosmos DB needs to spread data over the multiple Logical Partitions.

The following picture shows 2 collections (containers):

  • Collection 1 : The Size is 10 GB, so CosmosDB can place all the documents within the same Logical Partition (Logical Partition 1)
  • Collection 2 : The size is unlimited (greater than 10 GB), so CosmsosDB has to spread the documents across multiple logical partitions


“The fact of spreading the documents across multiple Logical Partitions, is called partitioning

NB1: CosmosDB may distribute documents across Logical Partitions within different physical partitions. Logical Partitions of the same container do not belong necessarily to the same physical partition, but this is managed by CosmsosDB

NB2: Partitioning is mandatory if you select Unlimited storage for your container, and supported if you choose 1000RU/s and more

NB3: Switching between partitioned and un-partitioned containers is not supported. You need to migrate your data

2- Why partitioning matters ?

This the first question i have asked to myself: Why do i need to know about this partitioning stuff? The service is managed, so why do i need to care about this information if Cosmsos DB will distribute automatically my documents across partitions.

The answer is :

  • CosmosDb does not manage Logical partitioning
  • Partitioning has impacts on Performance and related to the partition Size limit

2.1- Performance

When you query a Container, CosmosDB will look into the documents to get the required results (To keep it simple because this is a more elaborated ). When a request spans multiple Logical Partitions, it consumes more Request Units, so the Request Charge per Query will be greater
–> HINT : With this constraint, it’s better that the queries don’t span multiple logical containers, so it’s better that the documents related to the same query stay within the same logical partition

2.2- Size

When you request the creation of a new document, CosmosDB will place it within a Logical Partition. The question is how CosmosDB will distribute the documents between the Logical Partitions : A Logical Partition can’t exceed 10GB : So CosmosDB must intelligently distribute documents between the Logical Partitions –> This is easy i think, a mechanism like round robin can be enough, but this is not true! Because in case of round robin, your documents will be spread between N logical Partitions. And we have seen that queries over multiple logical Partitions consume a lot of RUs, so this is not optimal

We have now the Performance-Size dilemma : How cosmosDB can deal with these two factors ? How we can find the best configuration to :

  1. Keep ‘documents of the same query’ under the same Logical Partition
  2. Not reaching the 10GB limit easily

–> The answer is : CosmosDB can’t deal with this, you have to deal with it by choosing a Partition Key

3- The Partition Key

My definition: The Partition Key is a HINT to tell CosmosDB where to place a document, and if two documents should be stored within the same Logical Partition. The partition key is a value within the JSON document

NB : The PartitionKey must be submitted during a query to cosmsosDB

Let me explain this by an example:

Suppose we have a Hotel multi-tenant application that manages the hotel rooms like reservation. Each room is identified by a document where all the room’s information are located. The Hotel is identified by a hotelid and the room by id

The document structure is like the following:

“hotelid” : “”,
“name” : “”,
“room” : {
“id” : “”,
“info” : {
“info1” : “”,
“info2” : “”

The following is the graphical view of the JSON document:


Suppose we have documents of 6 rooms:







3.1- No Partition Key

If you create a container with a size of 10GB, the container will be not partitioned, and all the documents will be created within the same Logical Partition. So all your documents should not exceed the size of 10GB.

3.2- Partition Key : Case 1

Partition Key = /hotelid

In this case, when CosmosDB will create the 6 documents based on the /hotelid, it will spread the documents on 3 Logical Partitions, because there are 3 /hotelid distinct values.

  • Logical Partition 1 : /hotelid = 2222
    • 3 documents
  • Logical Partition 2 : /hotelid = 3333
    • 2 documents
  • Logical Partition 3 : /hotelid = 4444
    • 1 document

What are the Pro and Limits of this partitioning scheme:

  • Pro
    • Documents from the same hotel will be placed on a distinct Logical Partition
    • Each Hotel can have documents up to 10GB
    • Queries across the same hotel will perform well since they will not span multiple Logical Partitions
  • Limits
    • All rooms of the same hotel will be placed within the same Logical Partition
    • The 10GB limit may be reached when the rooms count grows

3.1- Partition Key : Case 2
NB : CosmosDB supports only 1 JSON properties for the Partition Key, so in my case i will create a new properties called PartitionKey

Suppose that after making a calculation, we figured out that each Hotel will generate 16GB of documents. This means that i need that the documents be spread over two Logical Containers. How can i achieve this ?

  • The /hotelid has only 1 distinct value per Hotel, so it’s not a good partition key
  • I need to find a value that can have at least 2 distinct values for a Hotel
  • I know that each Hotel have multiple rooms, so the multiple room ids

The idea is to create a new json proprieties called PartitionKey, the PartitionKey can have two values :

  • hotelid-1 if the roomid is odd
  • hotelid-2 id the roomid is even

This way:

  • Whe you create a new document (which contains a room), you have to look to the roomId, if it’s even than PartitionKey = hotelid-2, if it’s odd: PartitionKey = hotelid-1
  • This way, Cosmos will place even rooms within a Logical partition, and odd rooms within another Logical Partition

–> Result : The hotel documents will span two Logical Partitions, so 20 GB of storage

What are the Pro and Limits of this partitioning scheme:

  • Pro
    • Documents related to the same hotel will be placed on 2 Logical Partitions
  • Limits
    • Queries related the same hotel will be spread across 2 Logical Partitions, which will result on an additional request charge

4- How to choose the Partition Key?

This is most difficult exercise when designing your future CosmosDB data structure, here some recommendations to guide your thought it:

  • What is the expected size per document? This will give you an information about the level of partitioning you will make. Think about the examples above. If each document is 100KB max, then you can have up to 105k documents per Logical Partition, which means 105k room per hotel (More than enough), so /hotelid is a good partition key against the Size Constraint
  • If you are faced to more combinations of partition keys and are unable to get decided, do the following:
    • Do not use the partition key that will fire the Size constraint quickly : Reaching the Size limit makes the application unusable
    • Choose the Partition Key that will consume less Request Charge, but how to predict that : You have to determine the most used queries across your application, and choose the best Partition Key according to them.
  • Add new properties to your json document (PartitionKey), even if they are not really useful, just to achieve a good Partitioning

5- I have determined a good Partition Key, but i afraid hitting the 10 GB limit per Logical Partition ?

This is the most asked question after choosing the Partition Key : What if all the documents with the same Partition Key value hit the 10GB limit !!

Like he example above, try to find a mandatory value that gives your the X factor you want : The idea is to say: Can i find an additional properties that i can use in my partition key ?

NB : The Request Charge will be multiplied by X, but at least i can predict it

This was simple in my case, but in case have a factor X, you can use a Bucket calculator function. Here’s a blog about this : You just provide how much logical partitions you want to span your documents into. A good blog post here about the subject.

Hope that this article helps.


Azure Virtual Machine Serial Access : Finally available

Hi all,

5 years after the first feedback request, Azure has finally added the Console Access feature to its Virtual Machine service.

The history ?

In the past, accessing a virtual machine was only possible via the network. Anything preventing you for accessing it (managing it) other than from the network path (ssh, rdp, remote powershell…) has dramatically bad impact –> Redeploy

For example :

  • You have enabled the firewall on the Virtual Macine –> Redeploy
  • You have a blue screen (that you can fix by changing a setting) –> Redeploy
  • Your screen is stuck on the ‘Please hit a key to  continue’ –> Redeploy

Today, Azure has added the feature of Serial console access, which means that you can access the Virtual Machine, just you were accessing it via the console port –> No need for network connectivity to the OS

This is a so waited and wanted feature that is currently on Public Preview, check it here : https://azure.microsoft.com/en-us/blog/virtual-machine-serial-console-access/

Future improvements

  • Adding the F8 keyboard key support to handle accessing early stage booting screen
  • Adding RDP support to Windows because only cmd or powershell administration is provided today

Enjoy !!


Azure Networking Cross connectivity : The Options

Hi all,

I’m continually working on designing Cloud solutions, and specifically, Azure based Cloud solutions. One of the building blocks when starting dealing with Azure, is the Networking infrastructure that we need to build.

One of the challenges that we may and certainly will encounter is how to imagine the cross connectivity model, between the different networks. Cross connectivity can involve the following networks :

    • On-premises DC to Azure VNET
    • Azure VNET to Azure VNET
    • Azure VNET to Azure VNET on different region
    • ROBO to Azure VNET

1- The options

The following table shows the different options that you can use to cross connect different networks to Azure VNET :

Network Options
On-premises DC Express Route

Site to Site (S2S) VPN

3rd part S2S VPN


VNET Peering

Express Route

3rd part S2S VPN

Azure VNET Different Region VNET to VNET (VPN)

Regional VNET Peering

Express Route

3rd part S2S VPN

ROBO Site to Site (S2S) VPN

3rd part S2S VPN

2- Understanding the options

2.1- Site to Site VPN

The Site to Site VPN is a connectivity option that can be used to connect an Azure VNET to any network over internet, and using the VPN technology. The S2S VPN is the fastest way to establish a trusted private connection between your network and an Azure VNET.

The S2S VPN requires that you deploy an Azure VPN Gateway on the Azure VNET (An Azure managed gateway), and establish a VPN connection with a compatible VPN device on your side. The Azure VPN Gateway provides 99.9 SLA (under the hood, 2 VPN gateway instances in Active/Passive mode). You can in addition achieve a more resilient configuration with the Active/Active configuration (No published SLA)

a- Requirements and prerequisites

  • A VPN Gateway deployed on each VNET (2 VNETs can’t use the same VPN gateway)
  • A compatible VPN device on your side (Note that even if your device is not listed, it can be used as long as it supports the VPN configuration required by Azure)

b- Pro and Cons

Pro Cons
The fastest way to establish cross connectivity

No special configuration (VPN over internet)

A good solution for ROBO

No quality SLA (internet : latency, jitter…)

A maximum of 1.25 Gbps

c- Pricing

You will pay for :

2.2- Express Route

ExpressRoute is the Microsoft offer that enable the customer to establish a low latency, private and high bandwidth network connection to the Azure data-centers. Without entering the technical details, ER is a Layer 3 private connection to Azure networks, it travel through a dedicated circuit from your data-center to the Azure networks, without going to Internet.

ExpressRoute connectivity (Microsoft Credit picture)

ER is a Enterprise offer to customers that require high bandwidth and low latency connection to their Azure workloads. It can provide different bandwidth options, that can go from 50Mbps to 10Gbps, and an SLA of 99.9

a- Requirements and prerequisites

  • An ER Gateway deployed to your VNET (An ER circuit can be shared between different VNETs)
  • An Exchange/Network provider that can provide the connection to Azure ER

b- Pro and Cons

Pro Cons
High Bandwith, low latency, private connection to Azure

An ER circuit can be shared between different VNETs, providing both full mesh connection between the VNETs and a connectivity to on-premises for all VNETs

Can be extended to connect VNETs on other regions

The possibility to use Azure Microsoft peering (and Public Peering) to reach other Microsoft Services directly without going through internet. Services like Azure PaaS services (Web Apps, Azure SQL..) and Office 365 services…

Can take time to prepare and establish the circuit (weeks to months: contract with a network/exchange provider)

The cost can be significant for high bandwidth/unlimited tiers

Cannot be easily used for ROBO

c- Pricing

You will pay for :

  • The deployed ER Gateway (Per hour)
  • The outbound data leaving Azure in case you subscribed to the metered plan
  • ER Premium Add-on in case you wish to enable premium features like sharing the ER circuit with VNETs outside the geopolitical region when the ER circuit is established
  • The ER circuit

2.3- VNET Peering

VNET Peering is an Azure technology that allows you to link/peer/connect 2 or more Azure Virtual Networks, using few clicks and without deploying any additional resource. 2 peered VNETs are like a bigger VNET, that’s it. Imagine putting a wire between two networks, and start exchanging traffic between them, this is VNET peering.

VNET peering establishes a private, LAN-like connectivity between 2 or more virtual networks. Resources within the virtual networks will see each other just like they were on the same one.

a- Requirements and prerequisites

  • 2 or more Virtual Networks (Note that peeing VNETs in different regions are currently in preview, and not available on all regions)

b- Pro and Cons

Pro Cons
Easy to configure (few clicks)

LAN-like performance

Not negligible Cost (The pricing model is per volume, so not very predictable)

c- Pricing

You will pay for :

  • The data In and Out the VNET. For example, if you send 1 GB from VNET 1 to VNET 2, you will pay 1 GB leaving VNET 1 and 1 GB entering VNET 2

2.4- VNET to VNET

The VNET to VNET connectivity is just a S2S VPN between two VNETs, provided by the Azure VPN Gateways. It’s an additional cross connectivity option, that is cost-effective

a- Requirements and prerequisites

  • A VPN Gateway on each VNET

b- Pro and Cons

Pro Cons
Simple to configure

Cost-effective as traffic between 2 VNETs within the same region is free, you pay only for the VPN Gateways (per hour)


c- Pricing

You will pay for :

2.5- 3rd party S2S VPN

You can opt for  establishing a network cross connectivity using your own technology, by using a virtualized VPN device (Or a device that uses any tunneling protocol). By deploying a Virtual Machine where your software is running (must be supported by Azure like a Linux-based Virtual Appliance), you can establish a connection to your other networks and then route traffic to your VNET using Route Tables (UDR)

a- Requirements and prerequisites

  • A Virtual Network Appliance supported by Azure

b- Pro and Cons

Pro Cons
Keep your Enterprise technology High Availability : Most HA protocols are not supported on Azure like VRRP

Cost : That depends

Bandwidth/Latency : Traffic is over internet

Additional Management (Not a managed service)

c- Pricing

You will pay for :

  • The Virtual Machines you deploy
  • The outbound data leaving Azure

3- How to choose between the solutions

In a coming post, i will share a design i have recommended to one of my customers, showing one of the architecture that we can build, using Express Route and VNET peering. But this was for the context of that particular customer. Choosing which technology to use depends on many factors including :

  • Budget : We saw that ER and Peering are relatively expensive comparing the S2S VPN and VNET2VNET
  • Needs : I don’t need an ER between my ROBO and Azure if i have few data to exchange. But i need VNET peering if latency is mandatory between my workloads spread between VNETs
  • Time To Market : Establishing a S2S VPN is a way quicker than an ER circuit, so an emergency may leave you with no choice, at least for the short term

My recommendations

I can’t just recommend something without knowing the context and the needs, but in general, i see the picture like the following :

  • If you are in a Hybrid configuration for the mid/long term (More than 1 Year), then providing an Enterprise connection between your datacenter and Azure is crucial. East-West traffic requires high bandwidth / low latency connections, so ExpressRoute is the unique good choice.
  • If your workloads are spread between VNETs, and the latency/bandwidth matters, VNET peering is the best choice. If the connection quality is not mandatory, then you can opt for VNET2VNET connectivity, or you can share your ExpressRoute if it already exist.
  • The small and medium ROBOs can use S2S VPNs to connect to Azure. If the performance matters than an alternative architecture may see place. Establishing ER for a ROBO is neither practical nor cost effective. So you can opt for a hybrid architecture where all your offices are connected to a POP with high bandwidth / Low latency links, and an ER is linking the POP to Azure.

Setup Highly Available Network Virtual Appliances in Active/Passive mode on Microsoft Azure (Using Zookeeper)


This guide is a replacement or an alternative to the published article in Github (https://github.com/mspnp/ha-nva) that I consider not very clear, and that I’m certain have discouraged many of you implementing it on production.

1- Architecture

1.1- Solution components

The following picture shows an implementation of a 2 nodes Active/Passive NVA distribution.

Image credit : Microsoft

The architecture includes the following components :

Network Virtual Appliances

2 nodes are supported. You should create two instances of your NVA, which means 2 Virtual Machines. The virtual machines have to be deployed within an Availability Set in order to ensure better availability, and for the group to achieve 99.95 % SLA. With the announcement of Availability Zones preview, you can plan to deploy each NVA in a zone, in order to ensure zone failure resiliency within the same region and achieve a 99.99% SLA. The configuration is Active/Passive, which means that even if both nodes are up and can process traffic, only one node will receive traffic on the same time.

NB : If you need an Active/Active configuration, you can use the HA Ports, which is a feature of Azure Load Balancer standard. HA ports will allow you to ‘say’ to the ‘load balancer’ to load balance any traffic to the backend pool members. More information about HA Ports : https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-ha-ports-overview

Zookeeper Nodes

Zookeeper is a centralized service that will automatically detect the failure of the active node, and switch the traffic to the passive node (by applying some configuration to your Azure resources : Public IP address, UDRs). The passive node will become active in this case. At least 3 Zookeeper nodes have to be deployed. The number 3 is the minimum required to achieve the ‘quorum’, which means that the decision made by the nodes have to be approved by at least 2 nodes. To keep it simple, 2 nodes are needed for high availability. But with 2 nodes, there are scenarios where the 2 nodes are unable to communicate with each other (whatever the reason is), and each one needs to make a decision (Node 1 wants to switch the Active/Passive nodes, Node 2 wants to keep the same configuration). In this case, we need a ‘judge’, this why we need to add a 3rd node, that will support either decision 1 or decision 2. This article related to Windows Server Failover cluster quorum can help you understand the principle (https://technet.microsoft.com/en-us/library/cc731739(v=ws.11).aspx). Zookeeper is a light-weight service, so a minimal VM size like A1 is enough.

Pubic IP Address

A Public IP address resource (Static Public IP address) is recommended to be attached to the Active NVA node, to be used for outbound internet traffic and inbound traffic coming from internet. This is the edge of your network. The Public IP address will be moved from the Active node to the passive node by zookeeper, in case of an active node failure detection.

User Defined Routes (UDRs)

This is the heart of the solution. UDRs are a collection of routes that you can apply the Subnets. They are used to force the VMs within Subnets to send traffic to selected targets instead of the default routes. You can have many routes within a UDR.

1.2- Network Virtual Appliance interface counts

This is a ‘bigger’ topic than to be discussed on few lines, but I need to mention it so that the next sections be understandable. The picture above shows 2 NICs per NVA. The first NIC will be used for external traffic (Inbound and outbound), and the Public IP address will be attached to it. The second NIC will be used for internal traffic, all internal subnets will send traffic to that NIC. You can have designs when you have more NICs per NVA, for example, you want to create an Internal Zone (Communication between Subnets within the VNET), a Cross-Premise connectivity zone (Commination to and from on-premise), a DMZ zone and an External Zone. The NICs count will affect the VM size (Each VM size have a maximum supported NIC count, so keep it in mind when choosing the VM size)

1.3- Zookeeper nodes network configuration

The zookeeper nodes must be placed within the same subnet that one of the NVA’s NICs. Do not place it in a no direct Subnet (where an UDR is applied to reach the NVA), or place it on a Subnet that have the default routes to the NVA’s subnet. The zookeeper nodes need to continually probe one of the NVA’s NICs. For example, on the above picture, the zookeeper nodes are placed on the DMZ internal subnet, and they continual probe the NVA’s NIC that is on the DMZ internal subnet. Note that zookeeper will initiate a failover when the probed NIC stops responding, even if the other NICs still alive. For example, if somehow, a firewall rule was added and prevents zookeeper from probing the NIC, it will initiate a failover even if the NVA is alive, so be careful. In addition, if somehow, all the other NICs are dead and the probed NIC is alive, zookeeper will not initiate a failover. So choose a ‘port’ that is shared by the NVA core service, and that represents the NVA health state.

The zookeeper nodes needs to exchange heart-beat and metadata information continually. Each zookeeper node will listen on a port (that you can define during the configuration), and the other nodes will communicate with it on that port. The ports should be different between the nodes (On the example below, the ports are 2181, 2182 and 2183 respectively for node 1, node 2 and node 3). If you will enable the zookeeper node’s OS firewall, do not forget to permit communication over the chosen ports.

1.4- User Defined Routes configuration

The UDRs should follow a very simple rule, to be compatible with the zookeeper initiated failover. Each UDR must contains routes that route traffic to only one interface. This will allow zookeeper to deterministically set the next hope in case of failover. You can create multiple UDRs, each UDR points to an interface, and apply them to the subnets.

The picture on the left shows a bad UDR configuration (UDR-Bad), because the routes do not have the same next hop. The picture on the right shows a good configuration(UDR-Good).

2- How it works?

This zookeeper solution implementation has a very simple concept. Let’s first see what do we need to do in case of the Active node (Node 1) fails:

  1. Configure all the UDRs, or to be more specify the routes inside the Route Tables to stop routing the traffic to Node 1 NICs, and send the traffic to Node 2. This is possible by changing the next hop in each route a the Node 2 specific IP
  2. Attach the Public IP address to Node 2 ‘external’ NIC. This is possible by un-assigning the Public IP address from Node 1 ‘external’ NIC and assign It to Node 2 ‘external’ NIC

This is what zookeeper will make:

  1. Continually probe the Node 1 NIC to see if it’s alive
  2. In case of the probe fails, it will initiate the steps mentioned above, based on a configuration file.
  3. Continually probe the Node 2 NIC to see if it’s alive
  4. In case of the probe fails, it will initiate the steps mentioned above, based on a configuration file. But this time to failover to Node 1 if it’s alive

3- Implementation

This section will show you how to implement zookeeper into your infrastructure.

3.1- Prerequisites and requirements

In order to succefully implement zookeeper, you will need to validate all the following points :

  • Create an Azure AD application, to act as the identity used by zookeeper to make changes to your Azure resources during the failover. You will need to generate a certificate (pfx) during the application creation. This pfx will be converted later to a ‘jks’ format to be used by zookeeper. The Azure AD app will be assigned permissions on the resource that will be addressed by zookeepers during the failover
  • The nvadaemon-remote.json file
  • The log4j.properties file
  • The configurezook.sh file
  • An Azure Template to deploy the zookeeper VMs / configurations: 2 files: template.json and param.json
  • Optional : A Deploy.ps1 that contains the powershell code to deploy the template


A- Create the Azure AD SPN

Follow this post to create an Azure AD SPN. Keep the pfx file and the password to be used later : https://buildwindows.wordpress.com/2017/12/03/create-an-azure-ad-service-principal-based-on-azure-ad-app-secured-by-a-certificate/

B- Assign the Azure AD App permissions on the Azure Resources

Zookeeper will use the Azure AD App identity to make changes on some Azure resources in order to make the failover ( The changes were discussed earlier in this post)

Give the Azure AD app the Contributor role on the following resources :

– The network Interface that Zookeeper will attach and detach the Public IP Address from

– The Route Tables that zookeeper will modify during the failover

Give the Azure AD app the Reader role on the following resources :

– The Public IP Address resource

The link in A- Create Azure AD SPN shows how to make roles assignments

C- Prepare the nvadaemon-remote.json

I highly recommend you to download and install the following tool to view and edit a JSON file : http://tomeko.net/software/JSONedit/

The following picture shows a view of the nvadaemon-remote.json file

There are 2 main sections: zookeeper and daemon

C1- General Settings

Fill the required information like the following table instructions, the red parameters must reflect your environment :




This is a comma separated string. The format is :

zookpernode1:port1, zookpernode2:port2, zookpernode3:port3

Feel your zookeepers nodes names, and the ports to be used. You can keep use the default ports


This is the time interval between 2 successive probes (milliseconds)


How many retries before a zookeeper node considers the other node dead

A Zookeeper node will consider the other zookeeper node dead after : retrySleepTime x numberOfRetries.

In my example: 5 seconds


The Azure Subscription ID


The Azure AD App client ID (Object ID). This value can be copied from the created Azure AD Application that you have created previously


The Azure AD tenant ID. You can get this vaklue from the Azure Portal à Azure Active Directory à Properties (Directory ID)


The store where the key will be stored on the zookeeper container. Keep it to default : /nvabin/nva.jks


A password to protect the access to the Key store within the zookeeper containers.


The PFX certificate password defined earlier


This is the number of probes failure, after which Zookeeper will consider the Active node dead and initiate a failover.


This is the maximum time interval between 2 successive probes (milliseconds)

In my example, zookeeper will initiae a failover after : 3 x 3000 = 9 seconds


C2- Routing Settings

Route Tables

The routeTables section is an array. Each line is the resource ID of a route table resource. As discussed earlier, you should have a route table for each interface. A route table should note route to different next hops.

Public IP address

This is also an array with one line. On the name, put anything that identifies the public ip address (you can keep the default). On the id, paste the Public IP address resource ID


In this section, we will add the NVAs NICs resources. Because we have 2 NVAs, the array will contains 2 objects (0 and 1). 0 means the first NVA, 1 means the second NVA.

  • For each NVA, add all the interfaces that should be addressed by the route tables configured earlier. If you have configured 2 Route Tables, then you will need to add only 2 NICs.
  • Each interface have a name and id
  • For the NIC where the Public IP address will be attached, use the same name that was used for the Public IP address (In my case it’s pip-fw-test). This will allow zookeeper to know which NIC will be assigned the Public IP address during the failover. On the id, type the resource id of the NIC. This NIC private IP will be used by zookeeper as the next hope for the first route table (routeTable 0 à NIC 0)
  • The next route table will have as next hop the next NIC (in my case nic2) (route table 1 à NIC 1). The NIC name can be changed by must be equal between the NVA’s
  • Note that in my case, nic3 is not needed, as it will be not addressed by any route table. I should have remove it.
  • In the probeNetworkIntrface , type the id of the NVA’s NIC that zookeeper will probe to get the health state of the NVA.
  • In the probePort, type the port that will be probed by zookeeper

D- Copy the files to the Cloud Storage


All the files that will be used during the zookeeper installation should be available for the zookeeper VMs during the deployment. The recommended way to do it is to copy the files to a new/existing Azure Storage Account. So the prerequisite is to create a new storage account or use an existing one. Note that the files can be removed after the zookeeper deployment.

The files to be copied are :

  • Certificate pfx file
  • nvadaemon-remote.json
  • log4j.properties
  • configurezook.sh



Go to the Azure Portal and select your Storage Account.

Go to Blobs

Create a new Container :

Name : installzook

Policy Access: Container


Go to the container and upload the file : configurezook.sh

Go to File

Create a new File Share named : zookeepermonitor

The quota can be set to 1

Upload the files :

  • Certificate pfx file
  • nvadaemon-remote.json
  • log4j.properties



E- Prepare the param.json file

The param file contains the parameters to deploy your zookeeper platform. Fill the following partamers:

  • Location : The region where to deploy the resources
  • adminUsername : the username for the Ubuntu zookeeper nodes admin user
  • adminPassword : The password
  • VmSize : the vmsize (Standard_A1_v2 is enough)
  • vnetName : The VNET where the zookeeper will be deployed (Must be the same VNET where the NVAs are deployed)
  • vnetResourceGroup : The VNET RG
  • AvailabilitySetName : A name for the availability Set
  • VmnamePrefix : The zookeeper nodes name prefix. The example shows zook0, so the VM names will be zook01, zook02 and zook03
  • PrivateIPs : The Private IPs of each zookeeper node
  • Subnet : The subnet for the zookeeper nodes. It must be one the NVA’s subnets
  • CERTPASSWORD : The password of the pfx file created earlier
  • CERTSTOREPASSWORD : the password to protect the certificate on the zookeeper nodes cert store
  • CERTNAME : the name of the pfx certificate file
  • SANAME : The storage account URL (where the files are stored)
  • SAKEY : the Storage account key
  • Customscripturl : the custom script URL
  • customScriptCommandToExecute : the command to execute (leave it unchanged, unless you have changed the script file name)


3.2- Deploy

Using Powershell, CLI, you can now deploy the 3 Zookeepers nodes:

New-AzureRmResourceGroupDeployment -Name DeploymentName -ResourceGroupName ExampleResourceGroup `

-TemplateFile “template file path” TemplateParameterFile “template parameter file path”



4- Download the files

You can download the files from this link : https://www.dropbox.com/sh/hvxpzcgz6z1dg8f/AABL0ntxPT57lBaUluvlr1MUa?dl=0




















Get Azure Datacenter IP ranges via API V2


In my previous post, I showed how to create a light-weight Azure function that allows you to request the Azure Datacenter IP ranges via API. You can rapidly test it by following the instructions on the section 3- Try it before deploying
it here : https://buildwindows.wordpress.com/2017/11/19/get-azure-datacenter-ip-ranges-via-api/

The feedback was positive, but a lot have asked for a way to see if there were updates compared to the last version, and what the updates are if any.

In this post, I will publish the second version of the API (with the how-to), that allows you to :

  • Get the current Azure Datacenter IP ranges
  • Get the current Azure Datacenter IP ranges for a specific region
  • Get region names (since, unfortunately, the region names published by Microsoft are not exactly the same used by the Microsoft Azure services)
  • New : Get the current version ID of the Azure Datacenter IP ranges
  • New : Get the previous version ID of the Azure Datacenter IP ranges
  • New : Get the difference between the current and the previous version for all regions
  • New : Get the difference between the current and the previous version for a specific region

The new features will allow you an easy integration with your environment, and simplify the update of your Firewall rules within your infrastructure.

  • 1- How to request the API ?

The API supports only POST requests. You can make the following API requests using the following body construction.

Here the examples using Powershell, but you can use any tool to request the API using the same body content


#Get the current Azure IP address ranges of all region

$body = @{“region”=“all”;“request”=“dcip”} | ConvertTo-Json

#Get the current Azure IP address ranges of a specific region, example europewest

$body = @{“region”=“europewest”;“request”=“dcip”} |ConvertTo-Json

#Get the azure regions names, that we can request IPs for

$body= @{“request”=“dcnames”} |ConvertTo-Json

#Post the request

$webrequest=Invoke-WebRequest -Method “POST” -uri ` https://azuredcip.azurewebsites.net/getazuredcipranges -Body $body

ConvertFrom-Json -InputObject $webrequest.Content

#New in V2

#Get the (added and/or removed) IP address ranges updates of a specific region

$body = @{“request”=“getupdates”;“region”=“asiaeast”} | ConvertTo-Json

#Get the (added and/or removed) IP address ranges updates of all regions

$body = @{“request”=“getupdates”;“region”=“all”} | ConvertTo-Json

#Get the current Azure DC IP ranges version ID

$body = @{“request”=“currentversion”} | ConvertTo-Json

#Get the previous Azure DC IP ranges version ID

$body = @{“request”=“previousversion”} | ConvertTo-Json

#Post the request

$webrequest Invoke-WebRequest -Method “POST” -uri ` https://azuredcip.azurewebsites.net/getazuredcipupdates -Body $body

ConvertFrom-Json -InputObject $webrequest.Content

  • 2- How to build the solution ?

2.1- Solutions components

The V2 version is still using only Azure Functions, but unlike V1, it uses multiple functions within the Function App:

  • 1 Function App
    • Function 1
    • Function 2
    • Function 3
    • Proxy1
    • Proxy2
    • Storage Account

The following table details each component configuration. If you want to create the solution within your environment, create the same components using the given configuration:

Function App



App Service Plan

azuredcip This Function App will host the entire solution. It will include 3 functions, 1 Storage Account and two Proxies Shared or greater. Use at least a Basic Tier to benefit from SSL, Custom Names and backup







azuredcipranges This function will return you the V1 information.
HttpTrigger – Powershell
Allowed HTTP Methods : POST
Authorization level : Anonymous
You can add Function Keys if you want to secure the API Access. In my case, my API still Public (anonymous) to continue support my V1







azuredciprangesupdater This function will do the following :
– Get the current Azure DC IP ranges version and store it to the storage account

– Always keep the previous Azure DC IP ranges version file in the storage account

– Create and store a file containing the current and previous files difference and store it to the storage account

– Return the mentioned information based on the API request body

HttpTrigger – Powershell
  1. Inputs (Type : Azure Blob Storage)
Name Path SA connection
vnowinblob azuredcipfiles/vnow.json AzureWebJobDashboard
vpreviousinblob azuredcipfiles/vprevious.json AzureWebJobDashboard
vcompareinblob azuredcipfiles/vcompare.json AzureWebJobDashboard
  1. Outputs (Type : Azure Blob Storage)
Name Path SA connection
vnowoutblob azuredcipfiles/vnow.json AzureWebJobDashboard
vpreviousoutblob azuredcipfiles/vprevious.json AzureWebJobDashboard
vcompareoutblob azuredcipfiles/vcompare.json AzureWebJobDashboard

Keep the default http output

Allowed HTTP Methods : POST

Authorization level : Function

You can use the default function key or generate an new key. This API will not be directly exposed, so you can protect it with a key





triggerazdciprangesupdate This function will trigger the azuredciprangesupdater weekly to update the current and previous version TimerTrigger – Powershell
Schedule : 0 0 0 * * 3

(Each Wednesday, but you can choose any day of the week, as Microsoft will not apply the updates before one week of their publication)

Proxy 1



Root template

Allowed HTTP methods

Backend URL

getazuredcipranges This proxy will relay requests to azuredcipranges /getazuredcipranges POST https://azuredcip.azurewebsites.net/api/azuredcipranges
(add the key if you have secured it)

Proxy 2



Root template

Allowed HTTP methods

Backend URL

getazuredcipupdates This proxy will relay requests to azuredcipranges /getazuredcipupdates POST https://azuredcip.azurewebsites.net/api/azuredciprangesupdater?code=typethekeyhere

Storage Account






This is the storage account automatically created with the Function app (it can have any other name) azuredcipfiles Upload the following files:
– vnow.json

– vprevious.json

NB : These files are fake files. During the first API update request, the vnow content will be copied to the vprevious file. The vnow content will be then replaced by the real current version. At this moment, you can only request the current version. After one week, a new version of the file will be published by MS, so another cycle will set the vnow and the vprevious to real 2 consecutive versions, so you can benefit from the update and comparison feature.

2.2- Download the files

You can download the needed files here, you will find the powershell code for each function (function1, function2, function3) and the two json files (vnow.json and vprevious.json).

NB : As mentioned before, after the first request to the API (getupdates), you will have a valid vnow version, but the previous version will be the version uploaded now. You need to wait at least 1 week to have the valid version.


Create an Azure AD service principal (Based on Azure AD App) secured by a Certificate

When working with Azure, you may need to create an Azure AD Application, to act a Service Principal and use it to run operation on Azure resources. This post will show you how to register an Azure AD application secured by a Self-Signed Certificate, all via Powershell. You can modify the third script if you want to create the application using an existing certificate. The used scripts can be downloaded from here

1- Create a pfx certificate

In order to the Azure AD App to be secured, a certificate needs to be created. You can prepare the following information to create your certificate :

  • common name (cn)
  • A password to protect the private key

The files Create-SSCv1.ps1 (for Windows 2008 R2/7) and Create-SSCv2.ps1 (for Windows 2016 /10) are powershell scripts that allow you to create a self-signed certificate.

Example using Create-SSCv1.ps1 (the DNS name replaces the common name)

.\Create-SSCv1.ps1 -DNSName zookeeperazure -Password P@ssw0rdzook -PFXPath c:\-PFXName 

Example using Create-SSCv2.ps1 (More control over some options)

.\Create-SSCv2.ps1 -SubjectName zookeeper -Password P@ssw0rd -PFXPath C:\temp -PFXName 
zookeeper -MonthsValidity 24 -FriendlyName zookeepernva

2- Import the Certifictate the windows Certificates Store

The file Import-CertToStore.ps1 will import the certificate to the personal Store, in order to be used to create the Azure AD App later. Provide the password used on the previous step

.\Import-CertToStore.ps1 -Path C:\temp\zookeeper.pfx -Password P@ssw0rd

3- Create an Azure AD application to act as a Service Principal Name

Use the script file Create-azureadapp.ps1 to create the Azure AD application. The Azure Ad Application should have the same name than the certificate CN, so that the script can work. You will be prompted to login to Azure

.\Create-azureadapp.ps1 -ApplicationName zookeeper

You can see now that an new Application has been added to your Azure AD registered application. Azure Portal à Azure Active Directory à App registration

4- Add the application to an Azure Role

Now that your application has been created, you can assign it to any Azure RBAC role. For example, I assigned the created application (zookeeper) the Reader role on the resource group RG-Azure