The Intentionet team is joining AWS. We thank our customers and the open source community for their support over the years. We are proud of the technology and the community we have built, and we are excited to continue, under a new umbrella, our mission of transforming how networks are engineered. Visit batfish.org and join our Slack channel to stay abreast of ongoing developments.

// Common AWS Internet Gateway misconfigurations

Common AWS Internet Gateway misconfigurations

Introduction

An Amazon Web Services (AWS) Internet Gateway (IGW) allows external users and endpoints to initiate communication with your AWS resources, such as Elastic Compute Cloud (EC2) instances and containers. This post talks about the common configuration mistakes we’ve seen users make when setting up an IGW and how you can use Batfish Enterprise to easily troubleshoot these issues.

Two common mistakes

We’ll take a look at two common IGW configuration errors that can cause EC2 instances to lose internet connectivity:

  • You forgot to add the default route in the subnet table

This is the most common mistake that users make while setting up an IGW. If the subnet route table for the EC2 instance doesn’t include a default route to the IGW, the instance will not be internet-addressable.

  • You forgot to assign a public IP to the EC2 instance

If you fail to attach a public IP address to the EC2 instance, the instance will not be internet-addressable.

For the purposes of this post, assume that you’ve hit an internet connectivity issue after making a configuration change, and you’re not sure what happened. Using the steps described here, we’ll see how Batfish Enterprise can help you detect each of these errors as a root cause, in ways that are faster and more meaningful that simply running multiple CLI commands.

Common troubleshooting challenges

Before we dive into the Batfish way of troubleshooting your AWS configuration, let’s briefly identify some of the most common pitfalls with traditional troubleshooting methods.

  • Juggling multiple commands. When a connectivity issue occurs, you typically have to run a sequence of time-consuming individual checks, such as pings, traceroutes, and separate checks for NACL, SG, and routing configuration. By running multiple checks and getting only partial details from each one, you’re left stitching all of the information together manually in order to derive your conclusions.
  • Incomplete information. Many traditional checks return only partial information. For example, a ping or traceroute can give you status information, such as whether the host was reachable—but if it’s unreachable, you still won’t know the reason why.
  • Lack of directional visibility. When you use ping or traceroute to check for host reachability, any issue it reports could be on either the forward path or return path.

How Batfish Enterprise is different

With Batfish Enterprise, you can quickly and accurately identify the root cause, all from a single source. (To see Batfish Enterprise Virtual Traceroute in action, check out this video.)

Batfish Enterprise provides:

  • A single pane of glass for running necessary checks
  • Hop-level detail insights to see how, where, and why packet is being processed, including:
    • Routing
    • Security rules (Security Groups, Network ACL)
    • NAT
  • Bi-directional visibility for identifying whether an issue is on the forward or return path.
  • Traceroute capabilities for all traffic, including TCP, UDP, ICMP, and so on.

Our AWS setup

Figure 1 shows the AWS setup we’ll used for our two Batfish troubleshooting scenarios. We have a single Virtual Private Cloud (VPC) prod with two instances, web1 and web2, each in their own subnet. The VPC also has a dedicated jump server jump1 with its own management subnet as well.

Note: If you want to replicate this setup for yourself, the Terraform files & instructions that we have used are available here. Terraform is a popular infrastructure-as-code (IaC) tool that you can optionally deploy to stand up your AWS infrastructure. Alternatively, you can build your own setup using a different tool, such as the AWS console, AWS CLI, AWS CDK, Ansible, or Pulumi.

 

Figure 1: AWS infrastructure topology rendered by Batfish Enterprise

Troubleshooting with Batfish Enterprise

Okay, enough information! Let’s get started on our two scenarios and see how Batfish Enterprise can help us quickly identify the root cause in each one. As we stated earlier, both scenarios share a common problem: the EC2 instance is unable to reach the internet.

Scenario 1: Default route to IGW not present in subnet routing table

Let’s start troubleshooting our first problem with instance prod-web1 using Batfish Enterprise Traceroute. It is unable to reach the Internet, so we’ll represent this by using Google DNS (8.8.8.8) as our destination in the traceroute.

Traceroute input is very simple. You just need to specify which instance you are initiating the traceroute From, the Destination of interest, and the Application, as shown in Figure 2.

Figure 2: Traceroute Input

 

Batfish Enterprise will run a Virtual Traceroute using these inputs and show us the result, as shown Figure 3.

Figure 3: Traceroute result

 

We can see the packet goes from prod-web-1 to its corresponding subnet prod-pub1-sub and gets dropped there.

We now understand that something is wrong at the subnet router. In the left sidebar, we can see an interesting message under the Hop that says No Route, indicating that there is no route available on the prod-pub1-sub route table to reach the internet. This tells us that we need to look at the subnet routing table, which we can do by clicking the View all routes option.

 

Figure 4: Subnet route table

 

Figure 4 shows the route table of the prod-pub1-sub. We see that there is no default route available to IGW, which is the root cause of our problem.

The fix is easy—just go to your AWS console and update the subnet routing table, or if you are using Terraform, update the definition file so that the subnet routing table has a default route to the IGW when it is created.

Note: If you are replicating our setup using the Terraform files available here, you’ll see that we have included the solution as a comment within the Terraform code. To fix the problem, uncomment this solution.

Scenario 2: EC2 instance does not have a public IP

To take a look at our second problem, let’s Run Batfish Enterprise Traceroute from prod-web2 as shown in Figure 5.

Figure 5: Traceroute output

As we can see, the packet goes from prod-web2 to all the way to prod-igw and gets dropped there. In the left sidebar, we can see that the packet is being Denied in. Now, we are curious why it is being denied. To find out the reason, let’s click on View testFilters results, which will take us to the testFilters page as shown in Figure 6.

The testFilters results provide us with a detailed view of the behavior of the packet filter (security group, network acl, etc…) and shows us why a packet is being denied or allowed.

Figure 6: testFilters output

Here’s the most interesting part of this problem: the message Denied private instance IPs NOT associated with a public IP under Trace. This message indicates that there is no public IP associated with prod-web2. That’s the root cause of our problem in this scenario.

 To fix this problem, we need to assign the public IP address to the prod-web2 instance.

Note: If you are replicating our setup using the Terraform files available here, you’ll see that we have included the solution as a comment within the Terraform code. To fix the problem, uncomment this solution.

What success looks like

So far, we have seen examples that failed. Now you might be interested to see what a successful end-to-end traceroute looks like in Batfish Enterprise and contrast that with what a regular traceroute from a server console looks like.

First, let’s log into prod-jump1 and run a traceroute to Google DNS (8.8.8.8), as shown in Figure 7.

Figure 7: Traceroute from CLI

Now let’s compare that with Virtual Traceroute using Batfish Enterprise as shown in Figure 8.

Figure 8: Traceroute output (Forward path)

The basic view shows the forward path of the from prod-jump1 to Google DNS. But what if something was wrong in the return path? How would we see what is happening there?

Just click on Bidirectional in the Traceroute view and see the full power of the Batfish Enterprise traceroute.

 

Figure 9: Traceroute output (bi-directional)

As we can see in Figure 9, we have path visibility in both directions now, and we can see that both the forward and return flow would be delivered. This type of visibility is very important when troubleshooting access to a particular service, where the problem could be in the return path.

Another great option with Batfish Enterprise Traceroute is the topological view. Many users prefer to see the topological path versus linear. Click Show topology to see the topological view as shown in Figure 10.

 

Figure 10: Traceroute Topological View (Bi-Directional)

Summary

Traditional tools used to debug connectivity issues are not well suited for the cloud. A simple routing table misconfiguration requires a user to run multiple commands and then piece all of the associated information together in order to identify the root cause. Batfish Enterprise greatly simplifies this process by providing an easy to use, intuitive, single pane of glass, while adding unprecedented visibility with bi-directional and topological traceroute views.

Learn more

Write us today to learn more about Batfish Enterprise and what we can do for your company!

Contact us