Intezer Community Tip: How to Optimize ssdeep Comparisons with ElasticSearch

Written by Jonathan Abrahamy - 19 September 2017

Why Standard Hash Functions Aren’t Helpful In Memory

At Intezer, we specialize in analyzing code from memory to deal with injections, process hollowing, and other memory attacks. When doing so, we extract different memory items that need to be investigated. As code is loaded to memory from a file, it differs somewhat from its original structure. Therefore, using conventional hash-based threat intelligence services (such as VirusTotal) or file reputation databases in order to investigate these items becomes virtually ineffective, since using hash functions on these memory items produces unique results for each memory item. As a result, it is impossible to identify memory items using a standard hash function like sha256 or md5.

Locality-Sensitive Hashing (Fuzzy Hashing)

When dealing with items from memory, a powerful alternative for standard hashing is locality-sensitive hashing. These hash algorithms consider the structure of data, so similar items will receive similar hash results. Examples of these hash algorithms include sdhash and ssdeep. Using these hash algorithms, it’s possible to connect memory items to each other, treating them as a group–enabling us to collect more comprehensive intelligence about them in the process.

ssdeep Hash

The ssdeep hash algorithm splits the file into chunks, and run a hash function on each one of them. The results of that hash construct the final hash result. If only a few bytes of the file changes, it will only slightly change the hash value.

The ssdeep hash is formatted as follows:

chunksize:chunk:double_chunk
The chunksize is an integer that describes the size of the chunks in the following parts of the ssdeep hash. Each character of the chunk represents a part of the original file of length chunksize. The double_chunk is computed over the same data as chunk, but computed with chunksize * 2. This is an example for a typical ssdeep hash:

768:v7XINhXznVJ8CC1rBXdo0zekXUd3CdPJxB7mNmDZkUKMKZQbFTiKKAZTy:ShT8C+fuioHq1KEFoAU
The ssdeep library has a “compare” function used for comparing 2 ssdeep strings, grading their similarity – a number between 0 to 100.

Using ssdeep in Scale

Unfortunately, running the ssdeep compare function on a very large amount of files and memory items is not scalable at all. Brian Wallace’s excellent article in Virus Bulletin describes a way to improve the usage of ssdeep for finding similarities at scale. The article describes optimizations that can be done to improve performance on finding similarities using ssdeep, including:

1. Only examining items that have chunksize equal, double or half of the chunksize of the ssdeep to compare (chunksize * 2 or chunksize / 2)

2. Only examining items that have a common seven-character substring in their chunk or double_chunk with the ssdeep to compare.

Using these two optimization rules, it is possible to drastically improve performance when trying to find similarities using ssdeep. In order to do so, the ssdeep data needs to be stored in a database.

A Short Introduction to ElasticSearch

ElasticSearch is an open source, distributed, JSON-based search and analytics engine which provides fast and reliable search results. It excels in free text searches and is designed for horizontal scalability. (You can read more about it here.) ElasticSearch’s text search capabilities could be very useful in getting the desired optimizations for ssdeep hash comparison. To understand the rest of the post, some basic familiarization with ElasticSearch is needed.

Implementation of Optimization With ElasticSearch

In order to implement the search algorithm, create an ElasticSearch index to store the ssdeep hash values of the items in a way that will enable you to achieve the wanted optimizations. First, break down the ssdeep hash to three fields: chunksize, chunk, and double_chunk:

     "properties": {
        "chunk_size": {
          "type": "integer"
        },
        "chunk": {
          "type": "text"
        },
        "double_chunk": {
          "type": "text"
        },
        “sha256”: {
          “type”: “keyword”
        },
        “ssdeep”: {
          “type”: “keyword”
        }
      }

Save the ssdeep itself to use it later in the ssdeep compare function and get a grade.

In this example, the sha256 of the item is used as an identifier.

Next, in order to query which items have a common seven-character substring, use an NGram Tokenizer. In short, it tells ElasticSearch to take a certain text field and break it down to substrings in a certain size for search. The strings break down like a ‘sliding window’ over the text. For example, if the field has the text value of “Intezer”, and the chosen ngram size is 3. Then phrases which contain [“Int”, “nte”, “tez”, “eze”, and “zer”] will match the document. Therefore, it can be seen that if the Ngram Tokenizer for chunk and double_chunk fields is set with ngram size 7, then items that match the second optimization condition can quickly be found.

This is the updated mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "ssdeep_analyzer": {
          "tokenizer": "ssdeep_tokenizer"
        }
      },
      "tokenizer": {
        "ssdeep_tokenizer": {
          "type": "ngram",
          "min_gram": 7,
          "max_gram": 7
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "_all": {
        "enabled": false
      },
      "dynamic": "strict",
      "properties": {
        "chunk_size": {
          "type": "integer"
        },
        "chunk": {
          "analyzer": "ssdeep_analyzer",
          "type": "text"
        },
        "double_chunk": {
          "analyzer": "ssdeep_analyzer",
          "type": "text"
        },
        “sha256”: {
          “type”: “keyword”
        }
      }
    }
  },
  “record”: {}
}

The first condition (chunksize is equal, double, or half to the chunksize of the ssdeep to query) can be also added to the query. Eventually, when checking a certain ssdeep hash for similarity, the query would look like this:

 {
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "chunksize": [chunksize, chunksize * 2, chunksize / 2]
          }
        }
        {
          "bool": {
            "should": [
              {
                "match": {
                  "chunk': {
                    "query": chunk
                  }
                }
              },
              {
                "match": {
                  "double_chunk": {
                    "query": double_chunk
                  }
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  }
}

All that remains is to run the ssdeep compare function on the values returned from the query–a group significantly smaller than the group of all the items.

Our team at Intezer has created a small proof of concept for the solution, you can find it in our github repository. It contains the mapping and a code implementation for the described solution (written in Python). We’re happy for members of the info security and dev communities to use it.

The Bottom Line

Using ssdeep to find similarities between files can be quite effective when employing the right optimization methods. Using ElasticSearch to achieve these optimizations seems to be an easy and scalable solution, especially since ElasticSearch is doing all the ‘heavy lifting’ when it comes to searching and finding the relevant items.

Source List:

About Intezer:

Through its ‘DNA mapping’ approach to code, Intezer provides enterprises with unparalleled threat detection that accelerates incident response and eliminates false positives, while protecting against fileless malware, APTs, code tampering and vulnerable software.

Curious to learn what’s next for Intezer? Join us on our journey toward achieving these endeavors here on the blog or contact us.

Jonathan Abrahamy

Jonathan Abrahamy is a former officer in an elite information security unit of the Israeli Defense Forces. He currently serves as development manager and senior software engineer at Intezer, with eight years of experience in software design and development.