Why Cloud Engineers Need to Understand Networking (Even if it’s Not In the Job Description)

Explore the role of networking in cloud engineering and learn how you can expand your skills beyond the job description for better problem-solving.

Subscribe

Subscribe

So, you work in IT? That means your job stops where the job description does, right? This is not true of any job, as learning should never be limited by a job description - otherwise, we would never be challenged beyond what we already know!

Now, let’s narrow our scope down to IT engineers, specifically those who are cloud-focused. Being cloud-focused means we’re building new and exciting stuff all day and it’s all cloudy and AWS handles all of the networking. That seems like something we could only dream about, but then one day you realize AWS handles 90% of the workload while your networking team handles the other 10%. 

While this idea sounds great in theory, it is essential that as engineers, we strive to be well-rounded and always seek to push ourselves outside of our comfort zone. In this article, we will talk about the reality of this concept that engineers need to face by using a real-life example, and then I’ll provide some applicable tools to help you achieve this idea. 

The Reality of Software Engineering: We Need Networking

We all have a part in the grand scheme of our infrastructure from App Code to Security to Networking and everything in between. In the past, when the app server couldn’t connect to the file server, we would throw our hands up in the air and say it was the networking team's problem and we were roadblocked. It is easy to pass on the blame without taking responsibility, but the reality is that being willing to admit to your mistakes is an important part of the job.  

I write code; why do I need to know networking?

I hear this question a lot, and to answer, I have an example of a situation that recently happened at StratusGrid. 

One of our lead software engineers recently found that he was having an issue working with a VPC-connected Lambda and could not reach the Internet through a NAT Gateway. He is well-versed in Networking and ran through every test to determine the issue. At that point, the issue was handed off, and after a couple of hours of troubleshooting, we were able to determine it was due to NACLs on the public subnet where the NAT Gateway was located. 

When we looked back on the issue, we found out that the lead spotted the ACLs but didn’t know enough about them. He is a full-stack developer and went above and beyond on networking knowledge. This is the type of networking knowledge that can empower you as a Software Engineer.

Our Tools & Knowledge for Networks

As engineers, we all have our favorite tools that just work - some tools that cause us pain (though they get the job done), and tools we’re encouraged to use. We’re going to talk about some of these tools and how StratusGrid uses them as well as a few of their pros and cons.

  • The AWS VPC Module

For StratusGrid, most of our engineers work in Terraform (HCL) or CDK (TypeScript) for our day-to-day jobs. The wonderful thing about Terraform is the amazing community-supported modules, especially the AWS-specific ones. There are very few deployments where StratusGrid doesn’t use the community AWS VPC Model. VPCs are simple on the surface (though they can get complex very quickly), and by utilizing this module and its features, we’re able to quickly deploy VPCs and changes. 

With this being said, the VPC module isn’t perfect and has some quirks, though it allows our engineers to quickly deploy with standardization in a programmatic manner for a variety of networks.

One of the things that StratusGrid engineers really like about the VPC module is that it allows you to easily define Public/Private/Database subnets to create a proper three-tier application architecture. In the logic of the module, it allows you to define if the database subnets should have routes to the Internet, and it allows you to define your NAT Gateway logic from a single one to one per availability zone (AZ). These are just a few of the many features it offers.

  • VPC Flow Logs

One of my favorite things about the VPC module is how easy it is to log every single packet on your network. StratusGrid has repeatedly used this feature to diagnose what is happening and where a packet is getting dropped. 

In a traditional network, you might start with a Wireshark packet capture on the source and destination box and hope it's not being dropped somewhere else in the line, and then you would need to have your network team help diagnose where it's being dropped. None of that needs to happen with VPC Flow Logs; with the example code shown below combined with the AWS VPC Module discussed above, you can have every ENI in the VPC logging within about five minutes:

#Variables

variable "cloud_watch_retention" {

 description = "Global Repo CloudWatch Log Retention in Days"

 type        = number

 validation {

   condition     = contains([0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653], var.cloud_watch_retention)

   error_message = "Not a valid retention day option, see https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group ."

 }

}

variable "env_name" {

 description = "Environment name string to be used for decisions and name generation. Appended to name_suffix to create full_suffix"

 type        = string

}

# KMS for CloudWatch Logs - https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html

data "aws_iam_policy_document" "cloudwatch_kms" {

 statement {

   actions = [

     "kms:*",

   ]

   principals {

     identifiers = [

       "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root",

     ]

     type = "AWS"

   }

   resources = [

     "*",

   ]

   sid = "Enable IAM User Permissions"

 }

 statement {

   actions = [

     "kms:Encrypt*",

     "kms:Decrypt*",

     "kms:ReEncrypt*",

     "kms:GenerateDataKey*",

     "kms:Describe*"

   ]

   condition {

     test = "ArnEquals"

     values = [

       "arn:aws:logs:*:${data.aws_caller_identity.current.account_id}:log-group:*",

     ]

     variable = "kms:EncryptionContext:aws:logs:arn"

   }

   principals {

     identifiers = [

       "logs.${var.region}.amazonaws.com",

     ]

     type = "Service"

   }

   resources = [

     "*",

   ]

   sid = "Allow cloudwatch to encrypt logs"

 }

}

#KMS

resource "aws_kms_key" "cloudwatch" {

 description         = "Default Key for CloudWatch Log Groups"

 enable_key_rotation = true

 policy              = data.aws_iam_policy_document.cloudwatch_kms.json

}

# CloudWatch KMS Key Alias

resource "aws_kms_alias" "cloudwatch" {

 name          = "cloudwatch-default-key"

 target_key_id = aws_kms_key.cloudwatch.key_id

}

#CloudWatch Log Group for VPC Flow Logs

resource "aws_cloudwatch_log_group" "vpc_flow_logs" {

 count = var.env_name == "dev" ? 0 : 1 #If dev no flow logs, otherwise enable them

 name              = "${var.name_prefix}-vpc-flow-logs${local.name_suffix}"

 retention_in_days = var.cloud_watch_retention

 kms_key_id        = aws_kms_alias.cloudwatch.arn

}

#Module additional code

module "vpc" {

 # VPC Flow Logs

 enable_flow_log                                 = var.env_name == "dev" ? false : true #If dev no flow logs, otherwise enable them

 create_flow_log_cloudwatch_iam_role             = true

 flow_log_destination_type                       = "cloud-watch-logs"

 flow_log_file_format                            = "plain-text"

 flow_log_traffic_type                           = "ALL"

 flow_log_cloudwatch_log_group_retention_in_days = var.cloud_watch_retention

 flow_log_destination_arn                        = aws_cloudwatch_log_group.vpc_flow_logs[0].arn

}

Our Networking Knowledge

You may have heard the phrase, “Forget everything you know about networking in the cloud”, but that phrase is a bit broad. It’s true in essence, but in practice, it’s mostly wrong and in my experience, it depends on the day (just like everything else in IT). With cloud networking, specifically in AWS, the VPC is your network and unlike EC2 classic which won't exist past this year.

AWS controls all aspects of your backend network like the routers and switches, but you can always add your own router and replace some of their routers’ functions. ARP still exists in subnets in the VPC, Route Tables are easily customizable, you can do NACLs for stateless rules, and layer 3 security group rules for segmentation. 

As a matter of fact, you can even accidentally change your default gateway to route over an S2S VPN if you want. You still have routing protocols such as BGP in your VPC, so you can communicate with on-premise infrastructure or cloud-to-cloud. All of the basic networking concepts still apply from the OSI Model (excluding physical cables most of the time): packets still have MTUs and your machines must be able to communicate over the network via micro-segmentation.

Common Things to Check When You’re Having Network Issues

While doing networking in the cloud, here are a few common issues that can occur and are often the root cause of the issue:

  • Security Groups - Unlike in Azure and Traditional Networking, AWS doesn’t operate with subnet-level firewall rules. Everything is tied to a security group and you allow other security groups or CIDR blocks to communicate to the specific instance/ENI. Always make sure to double-check your security groups and remember they’re stateful.
  • NACLs - AWS Offers NACLs, and in the certifications, they’re heavily advertised. StratusGrid won’t recommend NACLs in any environment unless high compliance is required. NACLs don’t play well with ephemeral ports or in any subnet where internet access is required, and they wreak havoc on all sorts of small things.
  • The Subnet Type - Be sure to verify the subnet type you’re putting an EC2 instance or an ENI in. Unlike an on-prem network, NAT can’t be forwarded with the modern managed NAT Gateway. So a server in a private subnet can’t have RDP or SSH exposed to the internet through the NAT Gateway unless the route table is modified. While this is technically acceptable, it doesn’t meet many compliance requirements or AWS Service Validations.
  • Routing - When StratusGrid configures routing between other clouds/datacenters/AWS, we will almost always recommend a dynamic routing protocol such as BGP. This bypasses the need for static routing, more complicated networking and more complicated documentation compared to the BGP configuration.

Final Destination local firewall - everything can be set up correctly, but a common mistake is not adjusting the firewall rules on the final destination. This can cause health checks to fail and the server to not work.

All of us have job descriptions. They are one of the most important pieces of information we use during the hiring process, annual reviews, and as a compass to guide our day-to-day focus. However, it can become all too easy to limit yourself by using the description as a way to say, “Sorry, but that’s not my job.” Yes, our job is to stay focused on the assigned tasks at hand but our job is also to innovate, support other teams, and challenge ourselves to grow beyond our current capacity. 

Ready to Elevate Your Cloud and Network Solutions? Contact StratusGrid Today!

Are you facing challenges with network or cloud issues? StratusGrid is here to help. Our team of experts specializes in providing innovative solutions to complex cloud and networking problems. Whether you're looking to enhance your cloud infrastructure, troubleshoot network issues, or simply want to learn more about how networking can empower your cloud strategies, we've got you covered.

Contact us today and let our engineers guide you through your cloud and networking journey.

BONUS: Download Your FinOps Guide to Effective Cloud Cost Optimization Here ⤵️

FinOps-Guide-Downloadable (2)

Similar posts