10/15/2018

Indices are Important

I recently joined the biggest data project I’ve ever been a part of at work. It’s not necessarily huge by modern big data standards, but the main element I’m working on has over 130 million records, and takes up more than 5 Gigabytes on HDFS. The main dataset for the project is over 160 GB.

This post is an attempt to document and motivate the practice of conscientiously indexing your datasets to improve preprocessing performance.

My main dataset is the netflow data published by the Los Alamos National Lab, and the task I’m going to focus on is cross-referencing that data with the red team data in order to tag netflow events.

For simplicity’s sake, let’s assume a netflow event is “bad” if a member of the red team logged into its source in the preceding hour. With this simple heuristic, the task of tagging an individual netflow event becomes a matter of querying the red team data for auth events within a set time window, with a set destination (the auth destination matches the netflow source).

I’ve been working with pandas, so here’s the above expressed with pandas DataFrames:

def netflow_is_bad(event, window='1 hour'):
    window = pd.Timedelta(window)
    redteam_auths = DF_REDTEAM[ (DF_REDTEAM.time <= event.time)
                              & (DF_REDTEAM.time >= (event.time - window)
                              & (DF_REDTEAM.dest == event.src) ]
    return not redteam_auths.empty

Here, we query the red team authentication events for records whose timestamp is greater than the start of the relevant window and less than the time of the netflow event, and whose destination is the source of the netflow. We accomplish this with a big giant mask.

This is expensive.

The mask for each of those three conditions costs O(R) (where R is the size of the red team dataset) to construct, and another O(R) to & together, and another O(R) to apply to the data frame (the step which actually extracts the relevant elements). This approach takes 3:45” to tag 100,000 netflow events.

There must be a better way!

Raymond has the right of it, there must be a better way.

Basic Indexing

Fundamentally, both the netflow data and the red team data are time series datasets; our main query of the red team set is slicing by time. Fortunately pandas, and many other data object managers (like SQL) have baked in support for these semantics. In our case, this is pandas’ DatetimeIndex.

An index will help our program by making some of those linear-time operations constant time (basically; indexes are either trees or hashtables, depending on the implementation, and using them actually costs log(D) where depth D is a fixed implementation parameter, but D << N, so we’ll hand-wave and call it O(1)).

Our red team data is already timestamped, so we can tell pandas to use that column as the index, rather than the default behavior (which is to create a RangeIndex; number each item sequentially, which adds no particularly useful information).

DF_REDTEAM = pd.read_csv(PATH_REDTEAM, index_col='time', **OTHER_PARAMS)

Now, instead of masking to select the red team auth events in our window, we can just grab a slice:

def netflow_is_bad(event, window='1 hour'):
    window = pd.Timedelta(window)
    redteam_auths = DF_REDTEAM.loc[(event.time - window):event.time]
    return not redteam_auths[redteam_auths.dest == event.src].empty

This is not only a simpler, more readable implementation, but much faster: only 1:19” to tag 100,000 netflow events (~2x improvement). However, there’s still one more linear operation that we have the chance to make constant time: the source/ destination comparison.

Multi-index

Enter the MultiIndex. A dataset with a multi-index can be queried in tiers, so to speak. The documentation I linked to uses multi-indexing and hierarchical indexing more or less interchangeably. Under this structure, to select a given data element, you query first by one key, then the next, and the next, and so on, for however many keys you set in the index.

In our case, we want to add the red team auth event’s destination to the index, so we can query by that in constant time. In pandas, it couldn’t be easier:

DF_REDTEAM = pd.read_csv(
    PATH_REDTEAM, index_col=['dest', 'time'], **OTHER_PARAMS)

Note how the order of index_col above corresponds to the order of our calls to .loc below. Now, we can select by red team destination, and then by time:

def netflow_is_bad(event, window='1 hour'):
    window = pd.Timedelta(window)
    return not DF_REDTEAM.loc[event.src]\
                         .loc[(event.time - window):event.time]\
                         .empty

I think this is the most readable implementation yet, and is the fastest, taking only 21.9 seconds to tag 100,000 netflow events (a 10x speedup).

The moral of the story is that you should give a moment of thought to the structure of your data, and how you’ll be querying it. Don’t just let pandas or whatever framework decide for you what your indexing structure will be. The power is in your hands!

A Note on Vectorization

In the above cases, I’ve been constructing the tag column with the following:

DF_NETFLOW['red'] = pd.Series(map(netflow_is_bad, DF_NETFLOW.itertuples()))

I found that this was faster than DataFrame.apply. However, the ideal implementation is vectorized and operates on the whole data frame at once:

DF_NETFLOW['red'] = netflow_is_bad(DF_NETFLOW)

I couldn’t work this out though, since red team query depends on the source of an individual netflow event. I haven’t yet figured out how to do this on the broadcast scale. If we didn’t care about source, the current time-based query structure actually works, but alas, that would be a meaningless heuristic.