Listen to the interview with our engineer: {{< audio src="https://s3.chaops.de/shitops/podcasts/solving-dns-resolution-issues-at-scale-with-microsoft-gnmi-juniper-mainframe-mesh-self-hosting-lambda-functions-and-open-source.mp3" class="audio" >}}
Introduction¶
DNS resolution is a critical part of the network infrastructure for any tech company. It helps in resolving human-readable domain names into IP addresses and vice versa, but at the cost of adding latency to network requests, which can further impact the performance of applications that depend on them.
Recently, our tech company ShitOps faced a DNS resolution issue at scale, due to the increasing number of services added on the network infrastructure. We realized that the traditional approach of using a central DNS server was no longer sufficient to handle this scale.
In this blog post, I will describe how we solved this problem by designing a new architecture that combines Microsoft, GNMI, Juniper, Mainframe, Mesh, Self Hosting, Lambda Functions, and Open Source tools. For ease of understanding, I will break down the solution into five different stages:
- Collecting data from all DNS resolution sources in the network.
- Storing the collected data in a centralized database.
- Configuring Juniper switches based on the stored data.
- Implementing self-hosted mesh networks to optimize routing.
- Dynamically deploying and managing the solution using open-source tools.
Let’s dive deep into each stage and understand the technical implementation of the solution.
Stage 1: Collecting data from all DNS resolution sources in the network¶
In order to handle the DNS resolution issues at scale, we realized that it was essential to monitor all the DNS resolution sources in our network. These sources included:
- Legacy on-premise mainframes running proprietary DNS resolution systems.
- Legacy distributed DNS servers deployed across various data centers.
- Cloud-based DNS servers deployed on multiple cloud platforms.
We chose GNMI (gRPC Network Management Interface) to collect data from all these sources. GNMI is an interface that provides read and write access to configuration and state data within network devices using gRPC (Remote Procedure Calls over HTTP/2). It is open source, easily scalable, and supports a wide range of programming languages like Python, Java, and Go.
We built a custom script in Python, which used GNMI interface, to collect real-time DNS resolution information from all the sources. The collected data was then sent to a centralized database for further analysis.
Stage 2: Storing the collected data in a centralized database¶
After collecting real-time DNS resolution information from all sources, the next step was to analyze and store it in a centralized database where it could be accessed by other components of the system.
We used Microsoft SQL Server as our centralized database due to its ability to handle large data volumes, high availability, and support for in-memory database structures.
We developed a custom Python script that read data from GNMI output and stored it in the SQL Server database for further processing. The stored data included information such as domain names, IP addresses, TTL values, and source servers.
Stage 3: Configuring Juniper switches based on the stored data¶
Juniper switches are widely used in tech companies due to their reliability, scalability, and security features. In this stage, we wrote a custom Python script that automated the Juniper switch configuration process based on the stored DNS resolution data to optimize the network routing.
The script read data from the Microsoft SQL server and configured Juniper switches using the Junos API. It optimized network routing by selecting the best route based on real-time traffic load, and it also ensured redundant paths were available in case of any network failures.
Stage 4: Implementing self-hosted mesh networks to optimize routing¶
A Mesh network is a decentralized network infrastructure that dynamically connects devices without the need for a central controlling authority. We realized that implementing self-hosted mesh networks could further optimize the routing process by selecting the best route available based on the real-time traffic load.
We used open-source tools such as Envoy, Istio, and Kubernetes to implement a self-hosted mesh network infrastructure across our data centers. The mesh network ensured that maximum bandwidth was utilized, the latency was minimized, and the overall application performance was optimized.
Stage 5: Dynamically deploying and managing the solution using open-source tools¶
As a tech company, we always strive to use the latest and most innovative open-source tools in our work. For dynamic deployment and management of our DNS resolution system, we used a combination of Jenkins, Ansible, and GitLab.
We built a custom Jenkins pipeline, which used Ansible to deploy the solution to multiple data centers in parallel. The pipeline code was stored in GitLab and triggered automatically whenever we pushed a new change to the repository.
Conclusion¶
In conclusion, we solved our DNS resolution issue at scale by building a complex architecture that combined Microsoft, GNMI, Juniper, Mainframe, Mesh, Self Hosting, Lambda Functions, and Open Source tools. We broke down the solution into five different stages and described the technical implementation of each stage.
Although this solution may seem over-engineered with a high level of complexity for some, we are confident that it is the optimal way to handle our network infrastructure's scaling issues, and we are proud of our innovation in addressing the problem.
We hope you have enjoyed reading this blog post and learned something new about how we solve problems at ShitOps. Stay tuned for more exciting updates from us!
Comments
TechieTom commented:
Wow, this sounds like a really comprehensive solution to a problem I didn’t even know could be so complex! Kudos to the ShitOps team for tackling such an intricate issue.
NetworkNerd replied:
I agree, Tom! The integration of so many different technologies is impressive. I'm curious about the performance impact of using such a diverse set of tools.
CloudyClaire commented:
This is truly a next-level solution leveraging so many technologies at once! I’m especially interested in how the self-hosted mesh networks help optimize the routing. Are there any measurable improvements in latency or bandwidth utilization after implementing it?
Bob the Great Engineer (Author) replied:
Hi Claire! Yes, we've seen significant reductions in latency, with improvements of up to 20% in some cases. Bandwidth utilization is also more balanced across our network thanks to the mesh setup.
SysAdminSteve replied:
Thanks for the insight, Bob. I'm considering implementing a similar setup. Did you face any major challenges in setting up the mesh network?
DevOpsDan commented:
The combination of GNMI and Python for data collection is clever. I’ve mostly seen GNMI used in more niche implementations. How difficult was it to integrate this data into Microsoft SQL Server?
CodeLover replied:
I’m curious too, Dan. It sounds like a pretty seamless setup but I wonder about the automation scripts they used.
Bob the Great Engineer (Author) replied:
Great question! We had to write custom scripts to parse and transform the GNMI data for SQL, but once that was set up, the integration was quite smooth.
OldSchoolPeter commented:
I find it fascinating that you are still using mainframes in this context. In many cases, they seem passé. How do they fit into the modern infrastructure at such a scale?
ServerSally replied:
I had the same thought, Peter! Maybe legacy systems still hold some irreplaceable value?
AutomationAmy commented:
It's great to see such a detailed breakdown of deploying with open-source tools. How do Jenkins, Ansible, and GitLab make the deployment more efficient compared to other methods?
Bob the Great Engineer (Author) replied:
Hey Amy! Using Jenkins for automation together with Ansible for configuration management allows us to reduce manual errors and speeds up the deployment process considerably. GitLab ensures we keep all our changes version-controlled.