CloudNativeCon: Day 1 Sessions

I tried to attend the beginner sessions mostly, though it was a little difficult to tell which ones were supposed to be that. I ended up in two ‘how we started using Kubernetes’ talks back to back from Box and Buffer. Then a session on Helm for managing charts (package descriptions), one on Logging using Fluentd, and then one on monitoring application performance from Cisco.

Box created their own scripted deployment system that seems very similar to what Helm does (because they started before Helm). They use Artifactory to serve their images instead of Dockerhub because it allows them to have a private repository. They use Jsonnet templates with a makefile that iterates over each JSON file and replaces the specifics for each environment. Then they use Applier to synchronize the github JSON files produced from the Jsonnet templates and the Kubernetes ideal POD config so that they don’t end up changing the clusters when they don’t need to. They use Flannel to create the networking layer between the one IP per server and the one IP per container networking problem, but they’re interested at looking at Project Calico instead. They use Linkerd for API discovery backed by Kubernetes and namespaces to run multiple virtual clusters in one physical cluster.

Buffer had a bunch of great wisdom regarding different logging solutions they’d tried with Kubernetes. First they tried elasticsearch with Fluentd, but they ended up filling up their logging storage with all of the events that were produced and they chose to try a different solution rather than digging into how to make the events more granular and how to re-write their application to do stateful logging better (in the prototype phase). Second they tried Sidecar with an AWS Kinesis firehose, but this ended up with really complex POD configuration and difficulty managing secrets and a lot of network overhead from having extra logging containers. Finally they settled on using a Kubernetes based Daemonset log collector that tailed the container logs and sent the output to a centralized resource using TCP with Fluentd and AWS Kinesis. They use Prometheus for monitoring and deployment with Helm. One of the best improvements to developer productivity was having standard naming conventions for Kubernetes labels, secrets, logs, and services. They’re investigating Linkerd and adding TLS to make their containers more secure.

Helm seems like a pretty straightforward package deployment manager. They have a templating engine and a CLI that allows you to easily list, install, and upgrade the packages you want to run in your container. It deploys Tiller onto your container and that runs the particulars of the package that you want to install. If you do an upgrade, it’ll transfer the data from the container with the old version to the container with the new version (Keeps the old database container?). Helm chart packages are a good way to learn Kubernetes internal best practices and they’re looking for more people to contribute different packages.

I found the Fluentd and Cisco talks to be a little opaque. The Fluentd talk had some good initial information about logging patterns, but quickly veered into doing logging on embedded devices and GO lang examples and requests for contributers. The Cisco talk seemed to not use any of the other Kubernetes ecosystem resources and was about their personal (but open-source) measurement bus and setting up events to be put into that bus for application alerts. These are probably talks that were meant for a more intermediate audience, but had some interesting thoughts for a newbie like myself.

Overall a really interesting set of breakout sessions and I’ve seen mention of several other ones that were really good. It’s always a great conference when there’s multiple tracks you wish you could see at the same time. 🙂

CloudNativeCon: Keynotes

Today I’m attending CloudNativeCon, which is also co-located with Kubecon for all things Kubernetes. I was given a diversity ticket to attend by Women Who Code and I’m excited to be here! I know the basic amount about Kubernetes, having mostly used the Hadoop ecosystem instead.

So my goal here is to figure out how I could do the same things I do with Hadoop (Spark, Impala primarily) in a Kubernetes ecosystem that would also let me run jobs easier (Hive seems impenetrable and I’m not sure if that’s even the correct Hadoop ecosystem tool to run jobs).

The keynotes were interesting in learning history of Kubernetes. I was at a Scalability meetup over a year ago that presented on Kubernetes and I thought it was neat then. Seems like there’s been a lot of activity since. The uniqueness of Kubernetes seems to be that each worker machine reaches out to the master machine to get the work it can do rather than the master delegating jobs to the workers.

Cloud Native Computing Foundation (CNCF) seems to have a lot of plans to improve training, increase community, and absorb new projects as they’re really useful to the ecosystem. The plans with the Linux Foundation to offer training on edX sound super interesting for someone with very little experience in Kubernetes.

Box Co-founder Sam Ghods gave a keynote defining platforms and why cloud providers aren’t platforms (they don’t abstract enough of the messy stuff away) and why Kubernetes is a platform (it allows you to deploy on all different kinds of cloud providers and bare metal, it has load balancing, scaling, deployment, and remote storage). Optimized for all applications and all infrastructures with a stable API and a great community!

Googler Chen Goldberg gave a great keynote about the details of the Kubernetes project including information about the community change over time. They’re starting a Kubernetes Developer Onboarding Programming to continue to develop the community – goo.gl/ebtSgJ It was great to see a woman on stage! I liked that their Special Interest Group model is designed to “Grow Leadership”.

So far it’s interesting. Looking forward to the beginner talks about how to start with Kubernetes.

Picard UpdateVcfSequenceDictionary and SortVcf

So if you have a VCF that isn’t sorted according to chromosome, you may want to fix that.  One would think that you could just do that by using SortVcf in Picard with a sequence dictionary.  But, no.  Instead you have to do two steps:

  1. java -jar picard.jar UpdateVcfSequenceDictionary I=<in.vcf> O=<updated.vcf> SD=<dict>
  2. java -jar picard.jar SortVcf I=<updated.vcf> O=<out.vcf> SD=<dict>

You can’t pipe them either, so you’ll just have to delete updated.vcf after you’ve finished generating out.vcf

Python Gene List Subset

Here’s a Python snippet of how to subset gene lists.  If you have two lists of gene names (NCBI EntrezGene), then you can do an intersection of those genes to use as a selector to find the expression value for the genes in a Pandas DataFrame.


g1 = ['SLC1', 'PSEM','KLK1','IL']
g2 = ['SLC1','KLK3','IL','OR3','TTN']
gr = set(g1).intersection(g2)
expr_vals = { 'SLC1': [0.1,0.2,0.3,0.4], 'PSEM': [0.5,0.6,0.7,0.8], 'KLK3': [0.9, 1.0, 1.1, 1.2], 'IL': [1.3, 1.4, 1.5, 1.6]}
edf = pd.DataFrame(data=expr_vals).transpose()
edf[edf.index.isin(list(gr))]

PythonGeneListExpr

Now you have a gene list (gr) of the genes of interest and a data frame (edf) of the gene expression data for those genes.