Resource requirements

The TruffleHog scanner supports concurrency. By default, it uses a concurrency value that is equal to the number of CPU cores that you have. The detection engine will fully utilize this concurrency, but only some source integrations support concurrency. Some source integrations that fetch data via APIs, such as Slack, Jira, and Confluence may have their throughput limited on the API server side and may not saturate your CPU.

  • CPU: 4 cores or more
  • Memory: 16GB or more
  • Storage: 10GB or more in the system’s temporary directory

Resource Calculator

To help you estimate the scan time for your specific setup, we’ve created a resource calculator spreadsheet. This spreadsheet allows you to input your data size and machine specs to get a rough idea of how long your scans will take. It provides four sections to update:

Instructions:

  1. Click the link below to access the resource calculator spreadsheet:

  2. Important: Before using the calculator, make a copy of the spreadsheet for your own use.

  3. Enter interval (hours): Indicate the desired scan completion interval in hours.

  4. Enter size of source being scanned: Specify the size of the source data that will be scanned in gigabytes (GB).

Based on the combination of these two factors, the spreadsheet automatically calculates the CPU cores and memory required to complete the scan within the specified interval.

The color of the CPU cores and Memory cells represents the pain/cost of provisioning those resources, from easy to difficult.

Note that this is a rough and rather subjective estimate based on factors like:

  • Ease of acquiring additional compute resources
  • Typical ratios of CPU cores to memory
  • Scalability and expandability of existing systems
  • Organizational policies/restrictions around provisioning

Green: Provisioning these resources should be relatively easy and inexpensive.

Red: Provisioning these resources may be more difficult and/or more expensive.

Please note: Remember that these are rough estimates. Actual scan times may vary depending on several factors, including:

Hardware:

  • Storage type (SSD vs. HDD)
  • CPU architecture
  • Cache size

Software:

  • Operating system
  • Background applications

Network:

  • Network bandwidth
  • Network latency
  • Network congestion

Other:

  • Data type (compressed data may be slower)
  • Scan configuration (including attachments scanning, user scanning in Github, etc)

The temporary directory that is used for cloning repositories for scanning can be changed via the $TMPDIR env var for linux and darwin/OSX.

Dependencies

This setup requires specific tools for effective operation. Git is essential for repository management, while rpm2cpio, binutils, and cpio are necessary for extracting files from .rpm and .deb package formats.

  • Git: For cloning repositories.
  • rpm2cpio: To extract content from RPM packages.
  • binutils: Includes the “ar” tool, crucial for extracting contents from .deb files.
  • cpio: A versatile file archiver utility, compatible with various archive formats including .rpm and .deb.

Installing Dependencies on Ubuntu

To install these dependencies on an Ubuntu system, follow these steps:

  1. Open a terminal.
  2. Update your package lists to ensure you get the latest version available:
    sudo apt update
    
  3. Install the required packages:
    sudo apt install git rpm2cpio binutils cpio