OpenShift/Kubernetes Logging Overview

OpenShift/Kubernetes Logging Overview

During a meeting it was brought up to me that the OpenShift/Kubernetes logging strategy  isn’t very concise. Though looking into this I wanted to put some context around the technology. “How does OpenShift capture logs?” “What is captured and logged?” “What is my recommendations for using the logging system?”

EFK Stack

EFK stands for Elastic Search (E), Fluentd (F),  and Kibana (K). This is a modification on the traditional ELK stack that has become popular in recent years for log aggregation, collection and sorting. Kibana acts as the user interface for the collected logs. Elastic Search is the search and analytics engine. Fluentd is a unified logging system with hundreds (500+ as of the time if this writing [1]) of plugins.

What is captured

Looking thought the Kubernetes documentation, it’s made a bit more clear what is captured where and how applications logs are managed from the container level. In the section titled ‘Logging at the node level’ [2]  it is explained that “Everything a containerized application writes to stdout and stderr is handled and redirected somewhere by a container engine. For example, the Docker container engine redirects those two streams to a logging driver, which is configured in Kubernetes to write to a file in json format.” It is said in OpenShift documentation “Fluentd reads from /var/log/messages and /var/log/containers/.log for system logs and container logs, respectively. You can instead use the systemd journal as the log source. There are three deployer configuration parameters available in the deployer ConfigMap.” [3]. For additional information and resources on Fluentd, I strongly recommend watching the ‘OpenShift Commons’ videos from May 17, 2017 [4].

Cluster wide -vs- Project logging

This is not a simple question to answer. I’m drawing upon my experience working with other clustered technologies and customers that have implemented OpenShift. My recommendation is to do what’s right for your environment. I know… not very useful. Hopefully my heuristics will lead you to your answer.

Business Reasons

More often than not your company, organization, group and team have their structure. I’ve worked with companies and agencies that have had every sort of organically grown business structure. Some extremely independent, some centralized, and some ignorant to the structure entirely. We have to consider how you do business today and what will actually work and how we can fit into that system. What are the security requirements? What are the data retention rates? What is the disaster recovery strategy?

Technical Reasons

If I were to propose that every project have their own EFK stack to manage only their logs. A customer running 100+ project will have a LOT of redundancy and the overhead for a security team to manage and track logs could be prohibitory expensive/complicated. How does a security team monitor the creation of new projects, validated their access and ultimate ensure the security and compliance of the systems?

If I proposed one giant company-wide EFK stack it would lighten the burden for some but could cause data management and growth complications. Our security team is happy, because they have one log on to one server to see all the system and application logs being generated by the containers and applications.  Let me assume for a minute a non-common use case for OpenShift, batch processing. I want to use this platform to ETL function on a file I have stored out in S3. That project or job that live and die on my whim, might introduce long stale data into my logging system and tool chain. The point of a job is to run and be gone, so I might not care about the details.

While working as a US Army consultant in 2011, we were implementing Splunk. Working though the data ingest rates and figuring out what was good and stale data was complicated and we had fairly static workloads. Working though all the requirements will likely guide you the right direction. I suggest pruning what is important and measuring them often, high signal to noise ratio. This typically means smaller units or project based logging. It becomes quite daunting to measure every job, application and container in your environment on an ongoing basis. Off load that responsibility to the application and project owners.

Since I mentioned Splunk, I thought it is important to include the following section as well. ‘Configuring Fluentd to Send Logs to an External Log Aggregator’. You can configure Fluentd to send a copy of its logs to an external log aggregator, and not the default Elasticsearch, using the secure-forward plug-in. From there, you can further process log records after the locally hosted Fluentd has processed them [3].

REFERENCES

Fluentd Plug ins list
[1] https://www.fluentd.org/plugins/all

Logging at the node level
[2] https://kubernetes.io/docs/concepts/cluster-administration/logging/#logging-at-the-node-level

Aggregate Logging – OpenShift Docs
[3] https://docs.openshift.com/container-platform/3.3/install_config/aggregate_logging.html

Fluentd – OpenShift Commons Briefing
[4] https://blog.openshift.com/openshift-commons-briefing-72-cloud-native-logging-fluentd/

Intro to CloudForms Tags

This blog post was originally posted by myself on Blogger (8/26/2016).

CloudForms Intro

Red Hat CloudForms offers unified management for hybrid environments, providing a consistent experience and functionality across container ­based infrastructures, virtualization, private and public cloud platforms. Red Hat CloudForms enables enterprises to accelerate service delivery by providing self service, including complete operational and lifecycle management of the deployed services. It provides greater operational visibility through continuous discovery, monitoring and deep inspection of managed resources. And it ensures compliance and governance via automated policy enforcement and remediation. All the while, CloudForms is reducing operational costs, reducing or eliminating the manual processes that burden IT staff.

For more information visit http://redhat.com/cloudforms

Tags

I think tags are one of the most important features of Red Hat CloudForms. CloudForms ability to tag resources for later use in reporting, chargeback/showback and automation is critical for getting more in-depth knowledge and generating laser focused reports that provide value.

In this article I am going to touch on general guidelines I use when building a tag schema. I believe there are two rules when talking about CloudForms tags. It’s better too over tag your resources than under tag. If you can measure it; you can manage and monitor it. Just like any data structure, a well thought out schema will save you a lot of work.

Tag Schema Recommendations


Business Tags:

The most important thing about your business tag schema is they make sense to you and your companies. The examples I list below are a very rough estimation of what your business will look like and how it will operate. Think about logically grouped business resources and come up with a tag for them.

Business Unit
– Sales (North America)
– Engineering
– QA
– Marketing
– IT Development
– IT Operations
Business Project
– oVirt
– OpenStack
– Project Phoenix
Business Owner
– VP – Linux Torvalds
– Project Owner – Richard Stallman
– Manager of Marketing (East) – Doris Hopper 

IT Operations Tags

If you’re reading this blog post, these tags probably matter most to you. Remember, measure what matters most to you. Help the business understand your value and what you do. I know it, you hopefully know it, let them know it too.

Infrastructure

– VMWare

  – Production Systems

  – Development Lab

– Solid State Hard drive (SSD)

– Dell Hardware

Software

– Database

  – PostgreSQL 5

  – Oracle DB 11G

– Web Server

– CRM

SLA

– Diamond

– Gold

– Silver

– Bronze

Site (Geographic)
– New York Datacenter
– Hong Kong Datacenter
– Zone A – Virginia DC
– Zone B – Virginia DC

 

Change Tags

Imagine if IT and the business came to an agreement on service windows based on what worked for each business unit. This can happen. Maybe you want to have test deployments on production resources. Tagging change windows into your resources will help with reporting and also automation.

Patch Window

– Patch Window A (Second Tuesday of the month)
– Patch Window B (Last Sunday of the month)
– Canary Deployment Environment

SLA

A service level agreement (SLA) is a contract between a service provider (either internal or external) and the end user that defines the level of service expected from the service provider.

– Diamond

– Gold

– Silver

– Bronze

Security Tags

Security tagging is something I’m still working through. I know there is value in creating a Security Role, Group and Users for Dashboard and Reporting. I’m looking at linking this into Policies and exporting log events to a SIM (Security Information Management).

– Information Assurance Security Team Check

  – IA Validated
– IA Not Checked

– Service Catalog Provisioned (Provisioned Machines Certified Gold Master)

– Satellite Verified

Call to Action
I would love to find out what you are doing with your tagging schema. Contact me so we can discuss.