Skip to main content
Version: v4.17

Repo Crawler

The Repo Crawler is Cyral's data and account discovery tool. Use it to scan your specified repositories to find:

  • data locations (for example, database tables and columns) that contain types of sensitive data; and
  • database accounts who have connected recently, and when they connected. This helps you detect, for example, accounts that should have been deprovisioned but remain active.

To use the Repo Crawler, the main steps are:

  • Create a Cyral API access key and gather database connection details.
  • Install and run the Repo Crawler.
  • Use the results:
    • Automatic data map suggestions: When the crawler finds a data location you might want to protect, it gives you the option to protect that location by including it in a Data Map protected by your policies.
    • Local account discovery: When the crawler finds database user accounts with access to your data, it shows the results in the repository's tab in the Cyral control plane UI.

Prerequisites

Cyral API access key for the crawler

Set up an API access key for the crawler. This is an account in Cyral that the Repo Crawler will use to connect to Cyral.

  1. In the Cyral control plane UI, click API Access Keys in the lower left.

  2. Click the ➕ button. Give the account a name, and give it the following permissions:

    • For all Repo Crawlers:
      • Repo Crawler permission
    • For automatic data map suggestions:
      • View Datamaps permission
      • Modify Policies permission
    • For local account discovery:
      • View Sidecars and Repositories permission
      • Modify Sidecars and Repositories permission

    Click Create.

  3. Copy or store the Client ID and Client Secret so that you can use them later in this procedure.

    Cyral recommends that you store them in AWS Secrets Manager. Note the ARN of the secret; you will pass this ARN later as the CyralSecretARN.

    Store the credentials in the format shown here:

    {
    "client-id": "...",
    "client-secret": "..."
    }
    note

    If you lose these values, you can generate new ones using the Rotate Client Secret button in the Edit API Client window. In the Cyral control plane UI, click API Access Keys, click the name of your API access key, and click Rotate Client Secret.

Database connection setup for crawler

  1. Make sure the database service is connected to and accessible through Cyral, as explained in Track a repository.

  2. For each database service to be scanned, find or create an account on the database service. We call this the local database account. This account must have the following permissions:

    • read permissions on all tables and columns to be scanned

    • sufficient permissions to read all the user accounts across all databases on the database service to be scanned. This is required for local account discovery.

  3. Store the local database account credentials:

    • Store the local database account credentials in AWS Secrets Manager (see format below) and provide the secret's ARN as the RepoSecretARN in the Repo Crawler deployment template.

      {
      "username": "...",
      "password": "..."
      }
    • Alternatively, you can provide the local database account username and password directly during crawler deployment later in this procedure as the RepoUsername and RepoPassword. Note that when you use this option, an AWS Secrets Manager secret will be created in the format described above which contains the provided credentials.

  4. Have ready the following connection details for your database:

    • RepoName: Name of the data repository, as saved in Cyral control plane
    • RepoType: Type of repository, like "PostgreSQL".
    • RepoHost, RepoPort: Cyral sidecar host and port where the crawler will connect to the repository.
    • RepoDatabase: Name of the database the Repo Crawler will connect to.

Install and run the Repo Crawler

Cyral provides the following deployment options for the Repo Crawler:

The CloudFormation and Terraform approaches will install the Repo Crawler as an AWS Lambda function, including all of its dependencies such as IAM permissions, etc., while the express deployment will install the Repo Crawler as a Docker container on the host system.

Install and run the Repo Crawler using CloudFormation

  1. Plan your CloudFormation stack deployment in AWS. The VPC and subnet where you deploy must provide:

    • Access to the internet
    • Network access to the Cyral control plane
    • Network access to the repositories you will monitor
    • Network access to AWS APIs, including AWS Secrets Manager, S3, and DynamoDB.
  2. Create the CloudFormation stack in AWS.

    • For Template source, choose Amazon S3 URL

    • For Amazon S3 URL, specify the Cyral Repo Crawler template download URL as follows:

      https://cyral-public-assets-<region>.s3.<region>.amazonaws.com/cyral-repo-crawler/cyral-repo-crawler-cft-latest.yaml

      where region is one of us-east-1, us-east-2, us-west-1, or us-west-2

      note

      You also have the option to use a versioned path for the crawler template. Form the versioned URL according to the following general format:

      https://cyral-public-assets-<region>.s3.<region>.amazonaws.com/cyral-repo-crawler/<version>/cyral-repo-crawler-cft-<version>.yaml

      where version is your desired version as discussed with Cyral support, for example, v0.5.3

      For example, to get version v0.5.3 for running in us-east-2, you would use the URL,

      https://cyral-public-assets-us-east-2.s3.us-east-2.amazonaws.com/cyral-repo-crawler/v0.5.3/cyral-repo-crawler-cft-v0.5.3.yaml
  3. In the Specify stack details page, provide the following information:

    • Stack name: Give the crawler Lambda function a recognizable name like, "Cyral-crawler"

    • ControlPlane: Hostname of your Cyral control plane

    • ControlPlaneRestPort: Keep the default value unless Cyral support advises otherwise. This is the REST API port number of your Cyral control plane.

    • ControlPlaneGrpcPort: Keep the default value unless Cyral support advises otherwise. This is the GRPC port number of your Cyral control plane.

    • CyralSecretARN, CyralClientId, CyralClientSecret: These fields provide the Cyral service user credentials for the crawler. There are two ways to set this up:

      • Store the credentials in AWS Secrets Manager and provide the secret's ARN here as the CyralSecretARN. See the earlier section, "Prerequisites" to learn how to format the secret.

        or

      • Leave CyralSecretARN blank, and provide the Cyral API client ID and client secret in the CyralClientId and CyralClientSecret fields. Get these values from the API Access Keys screen in Cyral, as shown in Step 1.

  4. In Repository Configuration, provide the information the crawler will use to connect to your repository:

    • RepoName: Name of the data repository, as saved in Cyral control plane

    • RepoType: Type of repository. For example, PostgreSQL.

    • RepoHost, RepoPort: Host and port where the crawler will connect to the repository.

    • RepoSecret, RepoUsername, and RepoPassword: These fields provide the repository login credentials for the crawler. There are two ways to set this up:

      • Store the credentials in AWS Secrets Manager and provide the secret's ARN here as the RepoSecretARN. See the earlier section, "Prerequisites" to learn how to format the secret.

        or

      • Leave RepoSecretARN blank, and provide the username and password in the RepoUsername and RepoPassword fields.

    • RepoDatabase: Name of the database the Repo Crawler will connect to.

    • EnableDataClassification: Set this to true to enable automatic data map suggestions.

    • EnableAccountDiscovery: Set this to true to enable local account discovery.

    Optionally, configure the following advanced parameters, or just leave them at their default values:

    • QueryTimeout: The maximum time any query can take before being canceled, as a duration string, e.g. 10s or 5m. If zero or negative, there is no timeout.

    • MaxOpenConns: Maximum number of open connections to the database.

    • MaxConcurrency: Advanced option to configure the maximum number of concurrent query goroutines. If zero, there is no limit. Applies on a per-database level. Each database crawled in parallel will have its own set of concurrent queries, bounded by this limit. If zero, there is no limit.

    • RepoIncludePaths: A comma-separated list of glob patterns, in the format <database>.<schema>.<table>, which represent paths to include when crawling the database. If empty or "*" (default), all paths are included.

    • RepoExcludePaths: A comma-separated list of glob patterns, in the format <database>.<schema>.<table>, which represent paths to exclude when crawling the database. If empty (default), no paths are excluded.

  5. Snowflake configuration: If you'll scan a Snowflake repo, provide its connection details here. Otherwise, leave this section blank.

  6. Oracle configuration: If you'll scan an Oracle database, provide its connection details here. Otherwise, leave this section blank.

  7. Connection String configuration, ConnectionOpts: Any additional parameters that your repository requires when the crawler connects.

  8. Networking and Lambda configuration:

    • ScheduleExpression: How frequently the crawler will run, expressed in cron notation.
    • VPC and Subnets: You must deploy the crawler to a VPC and subnet that has access to the internet.
    • RepoCrawlerCodeS3Bucket: Leave blank. This is only used for custom crawler deployments.
  9. Create the stack. The Cyral requirements checker Lambda will verify that the VPC provides all needed resources. If the checker fails, the deployment attempt will be rolled back.

Once deployed, the crawler will run automatically, based on the cron schedule you set in the ScheduleExpression field of the template.

To test your crawler, you can execute a manual test run in AWS and examine its logs in CloudWatch. To do this in the AWS console, navigate to your Repo Crawler Lambda and open the Test: Test event panel. Click Test to run the crawler.

Use your Repo Crawler for multiple databases

The default Repo Crawler deployment protects a single database, but you can deploy additional triggers to protect more databases using the same Repo Crawler. To do this, deploy the triggers using this CloudFormation template:

https://cyral-public-assets-<region>.s3.<region>.amazonaws.com/cyral-repo-crawler/cyral-repo-crawler-event-cft-latest.yaml

where region is one of us-east-1, us-east-2, us-west-1, or us-west-2.

Deploy this as you deployed the Repo Crawler above, but supply the following parameters:

  • Repo Crawler Lambda configuration
    • RepoCrawlerLambdaArn: The ARN of the Repo Crawler Lambda. (Note: must exist in the same account/region)
    • ScheduleExpression: How frequently the crawler will run, expressed in cron notation.
  • Repository Configuration
    • RepoName: The repository name in the Cyral Control Plane.
    • RepoType: The repository type in the Cyral Control Plane.
    • RepoHost: The hostname or host address of the database instance.
    • RepoPort: The port of the database service in the database instance.
    • RepoSecretArn: ARN of the entry in AWS Secrets Manager that stores the secret containing the credentials to connect to the repository. If empty, RepoUsername and RepoPassword must be provided, and a new secret will be created.
    • RepoUsername: The username to connect to the repository.
    • RepoPassword: The password to connect to the repository.
    • RepoDatabase: The database on the repository that the repo crawler will connect to.
    • SampleSize: Number of rows to sample from each table.
    • QueryTimeout: The maximum time any query can take before being canceled, as a duration string, e.g. 10s or 5m. If zero or negative, there is no timeout.
    • MaxOpenConns: Maximum number of open connections to the database.
    • MaxParallelDbs: Advanced option to configure the maximum number of databases to crawl in parallel. If zero, there is no limit.
    • MaxConcurrency: Advanced option to configure the maximum number of concurrent query goroutines. If zero, there is no limit. Applies on a per-database level. Each database crawled in parallel will have its own set of concurrent queries, bounded by this limit. If zero, there is no limit.
    • RepoIncludePaths: A comma-separated list of glob patterns, in the format <database>.<schema>.<table>, which represent paths to include in when crawling the database. If empty or "*" (default), all paths are included.
    • RepoExcludePaths: A comma-separated list of glob patterns, in the format <database>.<schema>.<table>, which represent paths to exclude in when crawling the database. If empty (default), no paths are excluded.
    • EnableDataClassification: Runs data classification mode, i.e., sample and classify data according to a set of existing labels.
    • EnableAccountDiscovery: Runs user account discovery mode, i.e., query and discover all existing user accounts in the database.
    • For Snowflake:
      • SnowflakeAccount: The Snowflake account. Omit if not configuring a Snowflake repo.
      • SnowflakeRole: The Snowflake role. Omit if not configuring a Snowflake repo.
      • SnowflakeWarehouse: The Snowflake warehouse. Omit if not configuring a Snowflake repo
    • For Oracle:
      • OracleServiceName: The Oracle service name. Omit if not configuring an Oracle repo.
    • For PostgreSQL-like databases:
      • ConnectionOpts: The database connection options, in comma-separated key/value format, for example: \"opt1=val1\",\"opt2=val2\". Note that quotes are necessary around pairs. Omit if not configuring a PostgreSQL-like repo, i.e. Redshift, Denodo, or PostgreSQL.

Next step

To start scanning, you must specify data patterns to match for Automatic Data Map.