Better netflow visualization with code_swarm coolness!

Howdy all,

In my last post, I may have mentioned codeswarm, a nifty tool for visualizing how frequently a software project gets updated throughout time. Since it’s an open-source project, I figured that it was worth having a look at the code and seeing if there are other uses for it.

If you check out the Google Code page, you’ll notice that the project isn’t terribly active – the last upload dates back to May 2009. But hey, it does what it’s supposed to do and it’s pretty straightforward.

Reading through the source files, in fact, use of the tool is super simple: you set up an XML file that contains the data to be used, you run Ant, and you let the program do the rest. The format of the sample data is very simple, frankly: a file name, a date, and an author.

So let’s see what other uses we could come up with. Here are a few ideas I thought might be cool:

  • What about adapting it to track your social media messages? First, if you’re following a lot of people, it would look wicked cool. Second, if you’re trying to prune your Follow list, that could be really practical for figuring out who’s the noisiest out there.
  • Sometimes when you’re trying to figure out bottlenecks in your traffic, it’s useful to have a decent visualization tool. Maybe this could be helpful!
  • Finally, you sometimes need a good way to track employee activities. Would this not be a kickass way to see who’s active on your network?

I decided to work on the second idea. I’m not looking to rework the code at this point, just to reuse it with a different purpose.

Prerequisites

To pull this off, you’re going to need the following:

  • The codeswarm source code and Java, so that you can run the code on your system
  • Some netflow log files to test out
  • flow-tools, so that you can process said netflow log files
  • A scripting language so that you can process and parse the netflow traffic into XML. My language of choice was ruby, but it could be as simple as bash.

 The netflow filter

Before we can parse the netflow statistics into the appropriate format, we need to know what we’ll be using and how to extract it. Here’s what I used: each IP endpoint should have its own line; the IP address maps to the “author” field (because that’s what is visible). The protocol and port will map to the filename field, and the octets in the flow will map to the weight field.

The following is the netflow report config file. You should save this in the codeswarm directory as netflow_report.config:

stat-report t1
 type ip-source/destination-address/ip-protocol/ip-tos/ip-source/destination-p$
 scale 100
 output
  format ascii
  fields +first
stat-definition swarm
 report t1
 time-series 600

If you save some netflow data in data/input, you can test out your report by running this line:

flow-merge data/input/* | flow-report -s netflow_report.config -S swarm

Parsing the netflow

If the report worked out correctly for you, the next logical step is to write the code to create the .XML file that will be parsed by codeswarm. You’ll want to set your input directory (which we’d said would be data/input) and your output file (for instance, data/output.xml).

Here’s the source code for my grabData.rb file:

#!/usr/local/bin/ruby
# Prepare netflow data for codeswarm.
$outputFilePath = "data/output.xml"
$outputFile = File.new($outputFilePath, "w")
$outputFile << "<?xml version="1.0"?>n"
$outputFile << "<file_events>n"
# Grab the netflow information using flow-tools
$inputDirectory = "data/input"
$input = `flow-merge data/input/* | flow-report -s netflow_report.config -S swarm`
# This is the part that gets a bit dicey. I believe that in order to properly visualize
# the traffic, we should add an entry for each party of the flow. That's exactly what we're
# Going to do. The "author" in this case is going to be the IP address. The "filename" will
# be the protocol and port. The weight will be the octets.
$input_array = $input.split("n")
$input_array.grep(/recn/).each do |deleteme|
 $input_array.delete(deleteme)
end
$input_array.each do |line|
 fields = line.split(",")
 last = fields[0]
 source = fields[1]
 dest = fields[2]
 srcport = fields[3]
 dstport = fields[4]
 proto = fields[5]
 octets = fields[8].to_i / 1000
$outputFile << " <event filename="#{proto}_#{srcport}" date="#{last}" author="#{source}" weight="#{octets}"/>n"
 $outputFile << " <event filename="#{proto}_#{dstport}" date="#{last}" author="#{dest}" weight="#{octets}"/>n"
end
$outputFile << "</file_events>"
$outputFile.flush
$outputFile.close

And we’re done! This should generate a file called data/output.xml, which you can then use in your code swarm. You can either edit your data/sample.config file or copy it to a new file, then run ./run.sh.

Reality Check

I was really excited when running my first doctored code swarm; unfortunately, though the code did work as expected, the performance was terrible. This was because the sample file that I used was rather large (over 10K entries). Probably considerably more than what the authors had expected for code repository checkins. Also, I suspect that my somewhat flimsy graphic card is unable to handle realtime rendering of the animation, so I set up the config file to save each frame to a PNG so I could reconstitute the animation later. Syntax for this is:

ffmpeg -r 10 -b 1800 -i %03d.jpg test1800.mp4

Moreover, I believe my scale was off; I changed the number of milliseconds per frame to 1000 (1 frame, 1 second).

The second rendering was much more interesting, but it did yield a heck of a lot of noise; let’s not forget that we’re working with hundreds, if not thousands, of IP addresses. However, if we do a little filtering we can probably make the animation significantly more readable.

All in all, this was a rather fun experience but a bit of a letdown. Codeswarm wasn’t meant to handle this high a volume of data, so it makes things tricky, and less readable than what I expected; if you play with your filters, you will definitely be able to see some interesting things but if you’re looking for a means to visually suss out what’s happening on your entire network, you are bound to be disappointed. By next time, I hope to talk a bit about more appropriate real-time visualization tools for netflow and pcap files, maybe even cut some code of my own.

Times (and blogs), they are a-changing…

Hello, one and all!

Welcome to the new FAIL tales! I hope you like our new digs. Simpler yet more sophisticated, methinks.  I’ve really grooved on Blogger these past years; it’s easy to post, easy to customize, and, since it’s maintained by someone else, it’s a cinch to administer. But it’s once again time for change, time to move on to bigger and better things.

I was once told by an exceptionally insightful professor that, as computer scientists, we were going to be learning throughout our entire careers. “Computer Science is all about change,” he would say, “get used to the idea that you’ll constantly have to re-learn everything you think you know.”

At least, he said something like that (memories fade and this was over a decade ago). The scary thing is, I think he’s right. I’ve been into computers since I was nine years old, and I haven’t seen the industry slow down by a single Mhz. Technology is getting cheaper and more accessible by the minute, and innovation is happening faster and faster; slow down in a field like this one, and you might as well retire.

Welcome to my new blog – stay a while. I hope you like it here 🙂