API Reference

The OnCrawl REST API is used for accessing your crawl data as well as managing your projects and your crawls.

In order to use this API you need to have an OnCrawl account, an active subscription and an access token.

The current version of the web API is known as V2, although we don’t expect it to change too much it is still considered under development.

We try to keep breaking change as little as possible but this is not 100% guaranteed.

Requests

All API requests should be made to the /api/v2 prefix, and will return JSON as the response.

HTTP Verbs

When applicable, API tries to use the appropriate HTTP verb for each action:

Verb	Description
`GET`	Used for retrieving resources.
`POST`	Used for creating resources.
`PUT`	Used for updating resources.
`DELETE`	Used for deleting resources.

Parameters and Data

curl "https://app.oncrawl.com/api/v2/projects" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "project": {
            "name": "Project name",
            "start_url": "https://www.oncrawl.com"
        }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/projects", json={
  "project": {
    "name": "Project name",
    "start_url": "https://www.oncrawl.com"
  }
})

Any parameters not included in the URL should be encoded as JSON with a Content-Type of application/json.

Additional parameters are sometimes specified via the querystring, even for POST, PUT and DELETE requests.

When a complex object is required to be passed via the querystring, the rison encoding format is used.

Errors

Format an en error message

{
  "type": "error_type",
  "code": "error_code",
  "message": "Error message",
  "fields": [{
    "name": "parameter_name",
    "type": "field_error_type",
    "message": "Error message"
  }]
}

When an error occurs, the API returns a JSON object with the following properties:

Property	Description
`type`	An attribute that groups errors based on their nature.
`code` optional	A more specific attribute to let you handle specific errors.
`message` optional	A human readable message describing the error.
`fields` optional	List of field’s related errors.

Quota error message

{
 "type": "quota_error",
 "message": "Not enough quota" 
}

Forbidden error message

{
  "type": "forbidden",
  "code": "no_active_subscription"
}

Fields related errors

{
 "type": "invalid_request_parameters",
 "fields": [{
  "name": "start_url",
  "type": "required",
  "message": "The start URL is required."
 }]
}

Permissions errors

The following errors occurs if you are not allowed to perform a request.

Type	Description
`unauthorized`	Returned when the request is not authenticated. HTTP Code: 401
`forbidden`	Returned when the request is authenticated but the action is not allowed. HTTP Code: 403
`quota_error`	Returned the current quota does not allow the action to be performed. HTTP Code: 403

The forbidden error is usually accompanied with a code key:

unauthorized if the action is not authorized for the authenticated user, for example if you are not allowed to modify a resource.
feature_not_available if the current subscription does not allow the usage of a feature.
no_active_subscription if there are no active subscription.

Validations errors

The following errors are caused due to invalid request. In most cases it means the request won’t be able to complete unless the parameters are changed.

Type	Description
`invalid_request`	Returned when the request has incompatible values or does not match the API specification. HTTP Code: 400
`invalid_request_parameters`	Returned when the value does not meet the required specification for the parameter. HTTP Code: 400
`resource_not_found`	Returned when any of resource(s) referred in the request is not found. HTTP Code: 404
`duplicate_entry`	Returned when the request provides a duplicate value for an attribute that is specified as unique. HTTP Code: 400

Operation failure errors

There errors are returned when the request was valid but the requested operation could not be completed.

Type	Description
`invalid_state_for_request`	Returned when the requested operation is not allowed for current state of the resource. HTTP Code: 409
`internal_error`	Returned when the request couldn’t be completed due to a bug in OnCrawl side. HTTP Code: 500

Authentication

To authorize, use this code:

# With shell, you can just pass the correct header with each request
curl "https://app.oncrawl.com/api/v2/projects" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

response = requests.get("https://app.oncrawl.com/api/v2/projects",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

Make sure to replace {ACCESS_TOKEN} with your own access token.

OnCrawl uses access tokens to allow access to the API. You can create tokens from your settings panel if your subscription allows it.

OnCrawl expects the access token to be included in all API requests to the server.

An access token may be created with various scopes:

Scope	Description
`account:read`	Give read access to all account’s related data. Examples: Profile, invoices, subscription.
`account:write`	Give write access to all account’s related data. Examples: Close account, update billing information.
`projects:read`	Give read access to all project’s and crawl’s data. Examples: View crawl reports, export data.
`projects:write`	Give write access to all project’s and crawl’s data. Examples: Launch crawl, create project.

OnCrawl Query Language

OnCrawl provides a JSON-style language that you can use to execute queries.

This is referred to as the OQL for OnCrawl Query Language.

An OQL query has a tree-like structure composed of nodes.

A node can be terminal and is referred to as a leaf, or be a compound of other nodes.

An OQL query must start with a single root node.

Leaf nodes

Example of OQL using a field node:

{
  "field": [ "field_name", "filter_type", "filter_value" ]
}

Node	Description
`field`	Apply a filter on a `field`.

The value of a field node is an array with 3 values:

The field name to apply the filter to.
The type of filter to apply.
The value of the filter.

Compound nodes

Example OQL using an and node:

{
  "and": [ {
    "field": [ "field_name", "filter_type", "filter_value" ]
  }, {
    "field": [ "field_name", "filter_type", "filter_value" ]   
  }]
}

Node	Description
`and`	Execute a list of `nodes` using the logical operator `AND`.
`or`	Execute a list of `nodes` using the logical operator `OR`.

Common filters

OQL to retrieve pages found in the structure:

{
  "field": [ "depth", "has_value", "" ]
}

Filter type	Description
`has_no_value`	The field must have no value.
`has_value`	The field must have any value.

String filters

OQL to retrieve pages with “cars” in title

{
  "field": [ "title", "contains", "cars" ]
}

Filter type	Description
`contains`	The field’s value must contains the filter value.
`endswith`	The field’s value must ends with the filter value.
`startswith`	The field’s value must starts with the filter value.
`equals`	The field’s value must be strictly equals to the filter value.

Numeric filters

OQL to retrieve pages with less than 10 inlinks:

{
  "field": [ "follow_inlinks", "lt", "10" ]
}

OQL to retrieve pages between depth 1 and 4

{
  "field": [ "depth", "between", [ "1", "4" ]]
}

Filter type	Description
`gt`	The field’s value must be greater than the filter value.
`gte`	The field’s value must be greater or equal than the filter value.
`lt`	The field’s value must be lesser than the filter value.
`lte`	The field’s value must be lesser or equal than the filter value.
`between`	The field’s value must be between both filter values (lower inclusive, upper exclusive).

Filters options

OQL to retrieve urls within /blog/{year}/:

{
  "field": [ "urlpath", "startswith", "/blog/([0-9]+)/", { "regex": true } ]
}

The filters equals, contains, startswith and endswith can take options as the fourth parameter of the field node as a JSON object.

Property	Description
`ci` boolean	`true` is the match should be case insensitive.
`regex` boolean	`true` if the filter value is a regex.

The majority of endpoints returning resources such as projects and crawls are paginated.

HTTP request

Example of paginated query

curl "https://app.oncrawl.com/api/v2/projects?offset=50&limit=100&sort=name:desc" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

response = requests.get(
    "https://app.oncrawl.com/api/v2/projects?offset={offset}&limit={limit}&sort={sort}"
    .format(
        offset=50,
        limit=100,
        sort='name:desc'
    ),
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

The HTTP query expect the following parameters:

Parameter	Description
`offset` optional	The offset for matching items. Defaults to `0`.
`limit` optional	The maximum number of matching items to return. Defaults to `10`.
`sort` optional	How to sort matching items, order can be `asc` or `desc`. Natural ordering is from most recent to least recent.
`filters` optional	The OQL filters used for the query. Defaults to `null`.

Because filters is a JSON objects that need to be passed in the querystring, the rison encoding format is used.

The sort parameter is expected to be the following format {name}:{order} where:

{name} is the field’s name to sort on
{order} is the sort order, either asc or desc.

HTTP response

Example of paginated response

{
  "meta": {
    "offset": 0,
    "limit": 10,
    "total": 100,
    "filters": "<OQL>",
    "sort": [
      [ "name", "desc" ]
    ]
  },
  "projects": [ "..." ]
}

The HTTP response always follow the same pattern:

a key with the list of resources where the name depends on the paginated resource.
meta key containing info allowing the pagination.

The meta key returns a JSON object that allows you to easily paginate through the resources:

Property	Description
`offset`	The offset used for the query. Defaults to `0`.
`limit`	The limit used for the query. Defaults to `10`.
`total`	The total number of matching items.
`sort`	The sort used for the query. Defaults to `null`.
`filters`	The OQL filters used for the query. Defaults to `{}`.

Data API

The Data API allows you to explore, aggregate and export your data.

There are 3 main sources:

Crawl Report
Crawl over Crawl
Log monitoring

Each sources can have one or several data types behind them.

Data types

For Crawl Reports:

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

For Crawl over Crawl:

curl "https://app.oncrawl.com/api/v2/data/crawl_over_crawl/<coc_id>/<data_type>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/crawl_over_crawl/<coc_id>/<data_type>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

For Log Monitoring (events):

curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

For Log Monitoring (pages):

curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/<granularity>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/<granularity>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

A data type is the nature of the objects you are exploring, each data type has its own schema and purpose.

Source	Data type	Description
Crawl report	pages	Lists of crawled pages of your website.
Crawl report	links	Lists of all links of your website.
Crawl report	clusters	Lists of duplicate clusters of your website.
Crawl report	structured_data	Lists of structured data of your website.
Crawl over Crawl	pages	List of compared pages.
Logs monitoring	pages	Lists of all urls.
Logs monitoring	events	Lists of all events.

Pages: Represents an HTML page of the website.

Links: Represents a link between two pages.
Example: an ‘href’ link to another page.

Clusters: Represents a cluster of pages that are considered similar.
A cluster has a size and an average similarity ratio.

Structured data: Represents a structured data item found on a page.
Supported format are: JSON-LD, RDFa and microdata.

Events: Represents a single line of a log file.
Available only in logs monitoring.

Data granularity

A granularity, only available for pages in log monitoring, defines how the metrics will be aggregated for a page.

days: Data will be aggregated by days, a day field will be available with the format YYYY-MM.

weeks: Data will by aggregated by weeks, a week field will be available with the format YYYY-[W]WW.
A week may start on monday or sunday depending on project’s configuration.

months: Data will be aggregated by months, a month field will be available with the format YYYY-MM.

You can find more information on what is available using /metadata endpoint.

HTTP Request

Exemple of HTTP request

curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

fields = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

# For Logs Monitoring/span>
GET /api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata

HTTP Response

Example of HTTP response

{
   "bot_kinds": [
      "seo",
      "vertical"
   ],
   "dates": [
      {
         "from": "2018-09-21",
         "granularity": "days",
         "to": "2019-11-18"
      },
      {
         "from": "2018-09-16",
         "granularity": "weeks",
         "to": "2019-11-16"
      },
      {
         "from": "2018-06-01",
         "granularity": "months",
         "to": "2019-10-31"
      }
   ],
   "search_engines": [
      "google"
   ],
   "week_definition": "sunday_start"
}

Property	Description
`bot_kinds`	Bot kind can by `seo`, `sea` or `vertical`
`dates`	List of available granularities with their min/max date
`search_engines`	Search engine can by `google`
`week_definition`	Can be `sunday_start` or `iso`

Data Schema

HTTP Request

Example of field’s request

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>/fields" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

fields = requests.get("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>/fields",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json().get('fields', [])

HTTP Response

Example of HTTP response

{
  "fields": [{
    "name": "canonical_evaluation", 
    "type": "enum", 
    "arity": "one", 
    "values": [
         "matching", 
         "not_matching", 
         "not_set"
    ],
    "actions": [
     "has_no_value", 
     "not_equals", 
     "equals", 
     "has_value"
    ], 
    "agg_dimension": true, 
    "agg_metric_methods": [
     "value_count", 
     "cardinality"
    ], 
    "can_display": true, 
    "can_filter": true, 
    "can_sort": false,
    "user_select": true, 
    "category": "HTML Quality"
  }, "..."]
}

Property	Description
`name`	The name of the field
`type`	The field’s type (natural, float, hash, enum, bool, string, percentage, object, date, datetime, ratio)
`arity`	If the field is multivalued, can be `one` or `many`.
`values`	List of possible values for enum type.
`actions`	List of possible filters of this field.
`agg_dimension`	`true` if can be used as a dimension in aggregate queries.
`agg_metric_methods`	List of available aggregations methods for this field.
`can_display`	`true` if the field can be retrieved in search or export queries.
`can_filter`	`true` if the field can be used in filters queries.
`can_sort`	`true` if the field can be sort on in search or export queries.
`category` deprecated	Do not use.
`user_select` deprecated	Do not use.

Search Queries

The search queries allows you to explore your data by filtering, sorting and paginating.

HTTP Request

Search for crawled with with 301 or 404 HTTP status code.

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "offset": 0,
        "limit": 10,
        "fields": [ "url", "status_code" ],
        "sort": [
            { "field": "status_code", "order": "asc" }
        ],
        "oql": {
            "and":[
                {"field":["fetched","equals",true]},
                {"or":[
                    {"field":["status_code","equals",301]},
                    {"field":["status_code","equals",404]}
                ]}
            ]
        }
    }
EOF

import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "offset": 0,
      "limit": 10,
      "fields": [ "url", "status_code" ],
      "sort": [
          { "field": "status_code", "order": "asc" }
      ],
      "oql": {
        "and":[
            {"field":["fetched","equals",true]},
            {"or":[
                {"field":["status_code","equals",301]},
                {"field":["status_code","equals",404]}
            ]}
        ]}
      }
    }
).json()

The HTTP request is expected to be a JSON object as its payload with the following properties:

Property	Description
`limit` optional	Maximum number of matching result to return.
`offset` optional	An offset for the returned matching results.
`oql` optional	An OnCrawl Query Language object.
`fields` optional	List of fields to retrieve for each matching result.
`sort` optional	Ordering of the returned matching results.

The sort parameter is expected to be an array of object with the a field key and an order key where:

{name} is the field’s name to sort on
{order} is the sort order, either asc or desc.

HTTP response

{
  "meta": {
    "columns": [
      "url", 
      "inrank", 
      "status_code", 
      "meta_robots", 
      "fetched"
    ], 
    "total_hits": 1, 
    "total_pages": 1
  }, 
  "oql": {
    "and": [
      { "field": [ "fetched",  "equals",  true ] }, 
      {
        "or": [
          { "field": [ "status_code", "equals", 301 ] }, 
          { "field": [ "status_code", "equals", 404 ] }
        ]
      }
    ]
  }, 
  "urls": [
    {
      "fetched": true, 
      "inrank": 8, 
      "meta_robots": null, 
      "status_code": 301, 
      "url": "http://www.website.com/redirect/"
    }
  ]
}

The response will be a JSON object with an urls key, an oql key and a meta key.

The urls key will contains an array of matching results.

The oql key will contains the OnCrawl Query Language object used for filtering.

The meta key will contains as keys:

Property	Description
`columns`	List of returned fields. They are the keys used in `urls` objects.
`total_hits`	Total number of matching results.
`total_pages` deprecated	Total number of pages according to `limit` and `total_hits`.

Aggregate Queries

Average load time of crawled pages

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "aggs": [{
        "oql": {
          "field": ["fetched", "equals", "true"]
        },
        "value": "load_time:avg"
      }]
    }
EOF

import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [{
        "oql": {
          "field": ["fetched", "equals", "true"]
        },
        "value": "load_time:avg"
      }]
    }
).json()

The returned JSON looks like:

{
  "aggs": [
    {
      "cols": [
        "load_time:avg"
      ],
      "rows": [
        [
          183.41091954022988
        ]
      ]
    }
  ]
}

HTTP Request

This HTTP endpoint expect a JSON object as its payload with a single aggs key and an array of aggregate queries as its value.

An aggregate query is an object with the following properties:

Property	Description
`oql` optional	An OnCrawl Query Language object to match a set of items. By default it will match all items.
`fields` optional	Specify how to create buckets of matching items.
`value` optional	Specify how to aggregate matching items. By default it will return the number of matching items.

How to aggregate items

By default an aggregate request will return the count but you can also perform a different aggregation using the field parameter.

The expected format is <field_name>:<aggregation_type>.

For example:

inrank:avg will returns the the average Inrank.
weight:sum will returns the sum of all weights.

But not all fields can be aggregated and not all aggregations are available on all fields.

To know which aggregations are available on a field you can check the agg_metric_methods value returned by the Data Schema endpoint.

The available methods are:

min: Returns the minimal value for this field.

max: Returns the maximal value for this field.

avg: Returns the average value for this field.

sum: Returns the sum of all the values for this field.

value_count: Returns how many items have a value for this field.

cardinality: Returns the number of different values for this field.

How to create simple buckets

Average inrank by depth

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "aggs": [{
        "fields": [{
            "name": "depth"
        }],
        "value": "inrank:avg"
      }]
    }
EOF

import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [{
        "fields": [{
            "name": "depth"
        }],
        "value": "inrank:avg"
      }]
    }
).json()

Pages count by range of inlinks

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "aggs": [{
        "fields": [{
          "name": "nb_inlinks_range",
          "ranges": [
            {
              "name": "under_10",
              "to": 10
            },
            {
              "name": "10_50",
              "from": 10,
              "to": 51
            },
            {
              "name": "more_50",
              "from": 51
            }
          ]
        }]
      }]
    }
EOF

import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [{
        "fields": [{
          "name": "nb_inlinks_range",
          "ranges": [
            {
              "name": "under_10",
              "to": 10
            },
            {
              "name": "10_50",
              "from": 10,
              "to": 51
            },
            {
              "name": "more_50",
              "from": 51
            }
          ]
        }]
      }]
    }
).json()

When performing an aggregation, you can create buckets for your matching items using the fields parameter which takes an array of JSON objects.

The simplest way is to simply use the field’s name like so: {"name": "field_name"}.

It will returns the item’s count for all different values of field_name.

But not all fields can be used to create a bucket.

To know which fields are available as a bucket you can check the agg_dimension value returned by the Data Schema endpoint.

How to create ranges buckets

If the field_name returns too many different values it could be useful to group them as ranges.

To do so you can add a ranges key that takes an array of range. A range is a JSON object with the following expected keys:

Property	Description
`name` required	The name that will be returned in the JSON response for this range.
`from` optional	The lowest or equal value for this range.
`to` optional	The highest (not equal) value for this range.

Only numeric fields can be used with ranges buckets.

Export Queries

Export all pages from the structure.

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages?export=true" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
         "fields": ["url"],
         "oql": {
            "field":["depth","has_value", ""]
        }
    }
EOF > my_export.csv

import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages?export=true",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "fields": ["url"],
       "oql": {
          "field":["depth","has_value", ""]
      }
    }
)

An export query allows you to save as a csv file the result of your search query.

It does not suffer from the 10K items limitation and allows you to export all of the matching results.

To export the result of your search query as csv, simply add ?export=true within the URL.

Property	Description
`file_type` optional	Can be `csv` or `json` (exported as JSONL), defaults to `csv`.

HTTP response

The response of the query will be a streamed csv file.

Projects API

The Projects API allows you manage all your projects and your crawls.

With this API you can, for example:

Launch a new crawl
Pilot a crawl’s state (pause, resume, cancel)
Lists your projects
Create a new project
Check the progression of your crawl

Projects

List projects

Get list of projects.

curl "https://app.oncrawl.com/api/v2/projects" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

projects = requests.get("https://app.oncrawl.com/api/v2/projects",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The projects can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property	Description
`id`	The project ID.
`name`	The project’s name.
`start_url`	The project’s start URL.
`features`	The project’s enabled features.

HTTP Response

{
   "meta":{
      "filters":{},
      "limit":100,
      "offset":0,
      "sort":null,
      "total":1
   },
   "projects": [
      "<Project Object>",
      "<Project Object>"
   ]
}

A JSON object with a meta, described by the pagination section and a projects key with the list of project.

Get a project

Get a project.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

project = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
  "project": {
    "id": "592c1e1cf2c3a42743d14350",
    "name": "OnCrawl",
    "start_url": "http://www.oncrawl.com/",
    "user_id": "54dce0f264b65e1eef3ef61b",
    "is_verified_by": "google_analytics",
    "domain": "oncrawl.com",
    "features": [
        "at_internet",
        "google_search_console"
    ],
    "last_crawl_created_at": 1522330515000,
    "last_crawl_id": "5abceb9303d27a70f93151cb",
    "limits": {
        "max_custom_dashboard_count": null,
        "max_group_count": null,
        "max_segmentation_count": null,
        "max_speed": 100
    },
    "log_monitoring_data_ready": true,
    "log_monitoring_processing_enabled": true,
    "log_monitoring_ready": true,
    "crawl_config_ids": [
        "5aa80a1303d27a729113bb2d"
    ],
    "crawl_ids": [
        "5abceb9303d27a70f93151cb"
    ],
    "crawl_over_crawl_ids": [
        "5abcf43203d27a1ecf100b2c"
    ]
  },
  "crawl_configs": [
    "<CrawlConfig Object>"
  ],
  "crawls": [
    "<Crawl Object>"
  ]
}

The HTTP response is JSON object with three keys:

A project key with the project’s data
A crawl_configs key with a list of all the project’s crawl configurations.
A crawls key with a list of all project’s crawl.

The project’s properties are:

Property	Description
`id`	The project ID.
`name`	The project’s name.
`start_url`	The project’s start URL.
`user_id`	The ID of the project’s owner.
`is_verified_by`	Holds how the project’s ownership was verified. Can be `google_analytics`, `google_search_console`, `admin` or `null`.
`domain`	The start URL’s domain.
`features`	List of project’s enabled features.
`last_crawl_id`	The ID of the latest created crawl.
`last_crawl_created_at`	UTC timestamp of the latest created crawl, in milliseconds. Defaults to `null`.
`limits`	An object with customized limits for this project.
`log_monitoring_data_ready`	`true` if the project’s log monitoring index is ready to be searched.
`log_monitoring_processing_enabled`	`true` if the project’s files for the log monitoring are automatically processed.
`log_monitoring_ready`	`true` if the project’s log monitoring configuration was submitted.
`crawl_config_ids`	The list of Crawl over Crawl IDs attached to this project.
`crawl_ids`	The list of Crawl IDs for this project.
`crawl_config_ids`	The list of Crawl configurations IDs for this project.

Create a project

Create a project.

curl -X POST "https://app.oncrawl.com/api/v2/projects" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "project": {
            "name": "Project name",
            "start_url": "https://www.oncrawl.com"
        }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/projects", json={
      "project": {
        "name": "Project name",
        "start_url": "https://www.oncrawl.com"
      }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

Property	Description
`name` required	The project’s name, must be unique.
`start_url` required	The project’s start url starting by `http://` or `https://`.

HTTP Response

Examples of HTTP response

{
  "project": "<Project Object>"
}

An HTTP 200 status code is returned with the created project returned directly as the response within a project key.

Delete a project

Delete a project.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Scheduling

The scheduling of crawls allows you to start your crawl at a later date, run it periodically automatically or both.

Schedule your crawls to be run every week or every month and never think about it again.

List scheduled crawls

Get list of scheduled crawls.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

projects = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The scheduled crawls can be paginated using the parameters described in the pagination section.

There are not sort or filters available.

HTTP Response

{
   "meta":{
      "filters": {},
      "limit":50,
      "offset":0,
      "sort": null,
      "total":1
   },
   "scheduled_crawls": [
      {
         "config_id":"59f3048cc87b4428618d7c44",
         "id":"5abdeb0f03d27a69ef169c52",
         "project_id":"592c1e1cf2c3a42743d14350",
         "recurrence":"week",
         "start_date":1522482300000
      }
   ]
}

A JSON object with a meta, described by the pagination section and a scheduled_crawls key with the list of scheduled crawls for this project.

Create a scheduled crawl

HTTP request

Create a scheduled crawl.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "scheduled_crawl": {
            "config_id": "59f3048cc87b4428618d7c49",
            "recurrence": "week",
            "start_date": 1522482300000
        }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/projects", json={
    "scheduled_crawl": {
        "config_id": "59f3048cc87b4428618d7c49",
        "recurrence": "week",
        "start_date": 1522482300000
    }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

The request is expected to be a JSON object with a scheduled_crawl key and the following properties:

Property	Description
`config_id` required	The ID of the crawl configuration to schedule.
`recurrence` optional	Can be `day`, `week`, `2weeks` or `month`.
`start_date` required	An `UTC` timestamp in milliseconds for when to start the first crawl.

HTTP Response

Examples of HTTP response

{
   "scheduled_crawl":{
      "config_id":"59f3048cc87b4428618d7c29",
      "id":"5abdeb0f03d27a69ef169c53",
      "project_id":"592c1e1cf2c3a42743d14350",
      "recurrence":"week",
      "start_date":1522482300000
   }
}

An HTTP 200 status code is returned with the created scheduled crawl returned directly as the response within a scheduled_crawl key.

Delete a scheduled crawl

Delete a scheduled crawl.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls/<scheduled_crawl_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls/<scheduled_crawl_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Crawls

Launch a crawl

Launch a crawl.

curl -X POST "https://app.oncrawl.com/api/v2/projects/<project_id>/launch-crawl?configId=<crawl_config_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/launch-crawl?configId=<crawl_config_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

You have to pass a configId parameter in the query string with the ID of the crawl configuration you want to launch.

HTTP Response

Example of HTTP response

{
  "crawl": "<Crawl Object>"
}

Returns an HTTP 200 status code if successful with the created crawl returned directly as the response within a crawl key.

List crawls

Get list of crawls.

curl "https://app.oncrawl.com/api/v2/crawls" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

crawls = requests.get("https://app.oncrawl.com/api/v2/crawls",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The crawls can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property	Description
`id`	The crawl’s ID.
`user_id`	The crawl’s owner ID.
`project_id`	The crawl’s project ID.
`status`	The crawl’s status. Can be `running`, `done`, `cancelled`, `terminating`, `pausing`, `paused`, `archiving` `unarchiving`, `archived`.
`created_at`	The crawl’s creation date as `UTC` timestamp in milliseconds.

HTTP Response

{
   "meta":{
      "filters":{},
      "limit":100,
      "offset":0,
      "sort":null,
      "total":1
   },
   "crawls": [
      "<Crawl Object>",
      "<Crawl Object>"
   ]
}

A JSON object with a meta, described by the pagination section and a crawls key with the list of crawl.

Get a crawl

Get a crawl.

curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

project = requests.get("https://app.oncrawl.com/api/v2/crawls/<crawl_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
   "crawl": {
     "id":"5a57819903d27a7faa253683",
     "project_id":"592c1e1cf2c3a42743d14341",
     "user_id":"54dce0f264b65e1eef3ef61b",
     "link_status":"live",
     "status":"done",
     "created_at":1515684249000,
     "ended_at":1515685455000,
     "fetched_urls":10,  
     "end_reason":"max_url_reached",
     "features":[
        "at_internet"
     ],
      "crawl_config": "<CrawlConfig Object>",
      "cross_analysis_access_logs": null,
      "cross_analysis_at_internet": {
         "dates":{
            "from":"2017-11-26",
            "to":"2018-01-10"
         }
      },
      "cross_analysis_google_analytics": {
        "error": "No quota remaining."
      },
      "cross_analysis_majestic_back_links": {
         "stores":[
            {
               "name":"www.oncrawl.com",
               "success": true,
               "sync_date":"2017-10-27"
            }
         ],
         "tld":{
            "citation_flow":35,
            "name":"oncrawl.com",
            "trust_flow":29
         }
      }
   }
}

The HTTP response is JSON object with a single crawl key containing the crawl’s data:

The crawls’s properties are:

Property	Description
`id`	The crawl ID.
`project_id`	The crawl’s project ID.
`user_id`	The crawl’s owner ID.
`link_status`	The links index status. Can be `live` or `archived`.
`status`	The crawl’s status. Can be `running`, `done`, `cancelled`, `terminating`, `pausing`, `paused`, `archiving` `unarchiving`, `archived`.
`created_at`	Date of the crawl creation using an `UTC` timestamp in milliseconds.
`ended_at`	Date of the crawl termination using an `UTC` timestamp in milliseconds.
`fetched_urls`	Number of URLs that were fetched for this crawl.
`last_depth`	At what depth the crawl ended.
`end_reason`	A code describing the reason of why the crawl’s stopped. This value may not be present.
`features`	List of features available by this crawl.
`crawl_config`	The crawl configuration object used for this crawl.
`cross_analysis_access_logs`	Dates used by the Logs monitoring cross analysis. `null` if no cross analysis were done.
`cross_analysis_at_internet`	Dates used by the AT Internet cross analysis. `null` if no cross analysis were done.
`cross_analysis_google_analytics`	Dates used by the Google Analytics cross analysis. `null` if no cross analysis were done.
`cross_analysis_majestic_back_links`	Majestic cross analysis metadata. `null` if no cross analysis available.

End reasons

ok: All the URL of the structure have been crawled.

crawl_already_running: A crawl with the same configuration was already running.

quota_reached_before_start: When a scheduled crawl could not run because of missing quota.

quota_reached: When the URL quota was reached during the crawl.

max_url_reached: When the maximum number of URL defined in the crawl configuration was reached.

max_depth_reached: When the maximum depth defined in the crawl configuration was reached.

user_cancelled: When the crawl was manually cancelled.

user_requested: When the crawl was manually terminated and a partial crawl report was produced.

no_active_subscription: When no active subscription were available.

stopped_progressing: Technical end reason: at the end of the crawl, there are still unfetched urls, but for some reason the crawler is unable to fetch them. To prevent the crawler from iterating indefinitely, we abort the fetch phase when, after three attempts, he still has not managed to crawl those pages.

max_iteration_reached: Technical end reason: crawl evolved abnormally very slowly. It can happen, for example, when the website server is very busy with randomly dropped connections. We abort the fetch phase after 500 iterations when we detect this pathological server behavior.

Get a crawl progress

Get a crawl progress

curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>/progress" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

project = requests.get("https://app.oncrawl.com/api/v2/crawls/<crawl_id>/progress",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

You can call this endpoint for running crawls in order to follow its progression.

It allows you for example to monitoring if the crawler encounters an abnormal number of errors.

HTTP request

This endpoint takes no parameters.

HTTP Response

{
  "progress": {
    "crawler": {
      "by_depth": [
        {
          "depth": 1,
          "fetched_2xx": 1,
          "fetched_3xx": 1,
          "fetched_4xx": 0,
          "fetched_5xx": 0,
          "unfetched_exception": 0,
          "unfetched_robots_denied": 0
        },
        {
          "depth": 2,
          "fetched_2xx": 912,
          "fetched_3xx": 117,
          "fetched_4xx": 0,
          "fetched_5xx": 0,
          "unfetched_exception": 0,
          "unfetched_robots_denied": 161
        }
      ],
      "counts": {
        "fetched_2xx": 3,
        "fetched_3xx": 118,
        "fetched_4xx": 0,
        "fetched_5xx": 0,
        "queued_urls": 500,
        "unfetched_exception": 0,
        "unfetched_robots_denied": 161
      },
      "depth": 6,
      "samples": [
        {
          "error_code": "robots_denied",
          "fetch_date": "2022-08-20T06:58:26Z",
          "fetch_status": "unfetched_robots_denied",
          "url": "https://www.oncrawl.com/error"
        },
        {
          "fetch_date": "2022-08-20T06:58:14Z",
          "fetch_duration": 733,
          "fetch_status": "fetched_2xx",
          "status_code": 200,
          "url": "https://www.oncrawl.com/"
        }
      ],
      "status": "done"
    },
    "status": "running",
    "steps": [
      {
        "name": "fetch",
        "status": "done"
      },
      {
        "jobs": [
          {
            "name": "google_search_console",
            "status": "done"
          }
        ],
        "name": "connectors",
        "status": "done"
      },
      {
        "jobs": [
          {
            "name": "parse",
            "status": "running"
          },
          {
            "name": "inlinks",
            "status": "done"
          },
          {
            "name": "sitemaps",
            "status": "done"
          },
          {
            "name": "outlinks",
            "status": "done"
          },
          {
            "name": "redirect",
            "status": "done"
          },
          {
            "name": "scoring",
            "status": "done"
          },
          {
            "name": "top_ngrams",
            "status": "waiting"
          },
          {
            "name": "hreflang",
            "status": "waiting"
          },
          {
            "name": "duplicate_description",
            "status": "waiting"
          },
          {
            "name": "duplicate_title",
            "status": "waiting"
          },
          {
            "name": "duplicate_h1",
            "status": "waiting"
          },
          {
            "name": "duplicate_simhash",
            "status": "waiting"
          },
          {
            "name": "cluster_similarities",
            "status": "waiting"
          }
        ],
        "name": "analysis",
        "status": "running"
      },
      {
        "name": "cross_analysis",
        "status": "waiting"
      },
      {
        "name": "export",
        "status": "waiting"
      }
    ]
  },
  "timestamp": 1661159435000,
  "version": 3
}

The HTTP response is JSON object with a progress key containing the crawl’s progression.

The properties are:

Property	Description
`status`	The crawl’s status.
`crawler.counts`	Crawler’s fetch progression.
`crawler.by_depth`	A detailed progression per depth.
`crawler.depth`	The current crawler’s depth.
`crawler.status`	Crawler’s fetch status.
`crawler.samples`	A list of URL’s samples per status. It varies during the crawl and may not have a sample for a status.
`steps`	A detailed progression per step

Fetch statuses

fetched_2xx: Status code between 200 and 299

fetched_3xx: Status code between 300 and 399

fetched_4xx: Status code between 400 and 499

fetched_5xx: Status code between 500 and 599

unfetched_robots_denied: URL access denied by robots.txt

unfetched_exception: Unable to fetched an URL (ex: a server timeout.)

Update crawl state

HTTP request

Pause a running crawl

curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>/pilot" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "command": "pause"
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/crawls/<crawl_id>/pilot", json={
      "command": "pause"
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

You have to pass a JSON object with a command key and the desired state.

The crawl’s commands are:

Command	Description
`cancel`	Cancel the crawl. It won’t produce a report. Crawl must be `running` or `paused`.
`resume`	Resume a paused crawl. Crawl must be `paused`.
`pause`	Pause a crawl. Crawl must be `running`.
`terminate`	Terminate a crawl early. It will produce a report. Crawl must be `running` or `paused`.
`unarchive`	Un-archive all crawl’s data. Crawl must be `archived` or `links_status` must be `archived`.
`unarchive-fast`	Un-archive crawl’s data except links. Crawl must be `archived`.

HTTP Response

Example of HTTP response

{
  "crawl": "<Crawl Object>"
}

Returns an HTTP 200 status code if successful with the updated crawl returned directly as the response within a crawl key.

Delete a crawl

Delete a crawl.

curl -X DELETE "https://app.oncrawl.com/api/v2/crawls/<crawl_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.delete("https://app.oncrawl.com/api/v2/crawls/<crawl_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Crawls Configurations

List configurations

Get list of crawl configurations.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

crawl_configs = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The endpoint does not take any parameter.

HTTP Response

{
   "crawl_configs": [
      "<CrawlConfig Object>",
      "<CrawlConfig Object>"
   ]
}

A JSON object with a crawl_configs key with the list of crawl configuration.

Get a configuration

Get a configuration.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

crawl_config = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
  "crawl_config": {
       "agent_kind":"web",
       "ajax_crawling":false,
       "allow_query_params":true,
       "alternate_start_urls": [],
       "at_internet_params": {},
       "crawl_subdomains":false,
       "custom_fields":[],
       "dns": [],
       "extra_headers": {},
       "filter_query_params":false,
       "google_analytics_params":{},
       "google_search_console_params":{},
       "http_auth":{ },
       "id":"592c1f53973cb53b75287a79",
       "js_rendering":false,
       "majestic_params":{},
       "max_depth":15,
       "max_speed":10,
       "max_url":2000000,
       "name":"default",
       "notifications": {
         "email_recipients": [],
         "custom_webhooks": []
       },
       "query_params_list":"",
       "resource_checker":false,
       "reuse_cookies":false,
       "robots_txt":[],
       "scheduling_period":null,
       "scheduling_start_date":null,
       "scheduling_timezone":"Europe/Paris",
       "sitemaps":[],
       "start_url":"http://www.oncrawl.com/",
       "strict_sitemaps":true,
       "trigger_coc":false,
       "use_cookies":true,
       "use_proxy":false,
       "user_agent":"OnCrawl",
       "user_input_files":[],
       "watched_resources":[],
       "whitelist_params_mode":true
    }
}

The HTTP response is JSON object with the crawl configuration inside a crawl_config key.

The crawl base properties are:

Property	Description
`agent_kind`	The type of user agent. Values are `web` or `mobile`.
`ajax_crawling`	`true` if the website should be crawled as a pre-rendered JavaScript website, `false` otherwise.
`allow_query_params`	`true` if the crawler should follow URL with query parameters, `false` otherwise.
`alternate_start_urls`	List of alternate start URLs. All those URLs will start with a depth of 1. They must all belongs to the same domain.
`at_internet_params`	Configuration for AT Internet cross analysis. The AT Internet cross analysis feature is required.
`crawl_subdomains`	`true` if the crawler should follow links of all the subdomains. Example: http://blog.domain.com for http://www.domain.com.
`custom_fields`	Configuration for custom fields scraping. The Data Scraping feature is required.
`dns`	Override the crawler’s default DNS.
`extra_headers`	Defines additional headers for the HTTP requests done by the crawler.
`filter_query_params`	`true` if the query string of URLs should be stripped.
`google_analytics_params`	Configuration for the Google Analytics cross analysis. The Google Analytics cross analysis feature is required.
`google_search_console_params`	Configuration for the Google Search Console cross analysis. The Google Search Console cross analysis feature is required.
`http_auth`	Configuration for the HTTP authentication of the crawler.
`id`	The ID of this crawl configuration.
`js_rendering`	`true` if the crawler should render the crawled pages using JavaScript. The Crawl JS feature is required.
`majestic_params`	Configuration for the Majestic Back-links cross analysis. The Majestic Back-Links feature is required.
`max_depth`	The maximum depth after which the crawler will stop follow links.
`max_speed`	The maximum speed at each the crawler should go in number of URLs per second. Valid values are `0.1`, `0.2`, `0.5`, `1`, `2`, `5` then every multiple of 5 until your maximum allowed crawl speed. To crawl above 1 URL/s you need to verify the ownership of the project.
`max_url`	The maximum number of fetched URLs after which the crawler will stop.
`name`	The name of the configuration. Only used as a label to easily identify it.
`notifications`	The notification channels for crawls that ended or failed to start. By default the owner of the workspace will receive the notifications.
`query_params_list`	If `filter_query_params` is `true`, this is a list of comma separated name of query parameter to filter. The parameter `whitelist_params_mode` will define how to filter them.
`resource_checker`	`true` if the crawler should watch for requested resources during the crawl, `false` otherwise. This feature requires `js_rendering:true`.
`reuse_cookies` deprecated	Not used anymore.
`robots_txt`	List of configured virtual robots.txt. The project’s ownership must be verified to use this option.
`scheduling_period` deprecated	Not used anymore.
`scheduling_start_date` deprecated	Not used anymore.
`scheduling_timezone` deprecated	Not used anymore.
`sitemaps`	List of sitemaps URLs.
`start_url`	The start URL of the crawl. This URL should not be a redirection to another URL.
`strict_sitemaps`	`true` if the crawler should follow strictly the sitemaps protocol, `false` otherwise.
`trigger_coc`	`true` if the crawler should automatically generate a Crawl over Crawl at the end. The Crawl over Crawl feature is required.
`use_cookies`	`true` if the crawler should keep the cookies returned by the server between requests, `false` otherwise.
`use_proxy`	`true` if the crawler should use the OnCrawl proxy which allows it to keep a static range of IP addresses during its crawl.
`user_agent`	Name of the Crawler, this name will appears in the user agent sent by the crawler.
`user_input_files`	List of ingested data files IDs to use in this crawl. The Data Ingestion feature is required.
`watched_resources`	List of patterns to watch if `resource_checker` is set to `true`.
`webhooks` deprecated	List of webhooks V1 to call at the end of the crawl.
`whitelist_params_mode`	`true` if the `query_params_list` should be used as a whitelist, `false` if it should be used as a blacklist.

Create a configuration

Create a crawl configuration.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs", json={
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

The expected HTTP request is exactly the same format as the response when you retrieve a crawl configuration.

The id is automatically generated by the API for any new crawl configuration and must not be part of the payload.

The only required fields are name, start_url, user_agent and max_speed.

HTTP Response

Examples of HTTP response

{
  "crawl_config": "<CrawlConfig Object>"
}

An HTTP 200 status code is returned with the created crawl configuration returned directly as the response within a crawl_config key.

AT Internet

{
  "at_internet_params": {
    "api_key": "YOUR_API_KEY",
    "site_id": "YOUR_SITE_ID"
  }
}

A subscription with the AT Internet feature is required to use this configuration.

To can request an API Key in API Accounts within the settings area of your AT Internet homepage.

This API Key is necessary to allow OnCrawl to access your AT Internet data.

The site_id specify from which site we should collect the data.

The HTTP requests that you need to whitelist are:

Note: You must replace the {site_id} of both URLs with the actual site ID.

Without this, OnCrawl won’t be able to fetch the data.

Google Analytics

{
  "google_analytics_params": {
    "email": "local@domain.com",
    "account_id": "12345678",
    "website_id": "UA-12345678-9",
    "profile_id": "12345678"
  }
}

A subscription with the Google Analytics feature is required to use this configuration.

You have the provides the following properties:

Property	Description
`email`	Email of your Google account.
`account_id`	ID of your Google Analytics account.
`website_id`	ID of your website in Google Analytics.
`profile_id`	ID of the website’s profile to use for cross analysis.

To use a Google Account you must first give access to your analytics data to OnCrawl using OAuth2.

For now you must use the onCrawl web client to add your Google account.

Google Search Console

{
  "google_search_console_params": {
    "email": "local@domain.com",
    "websites": [
      "https://www.oncrawl.com"
    ],
    "branded_keywords": [
      "oncrawl",
      "on crawl",
      "oncrowl"
    ]
  }
}

A subscription with the Google Search Console feature is required to use this configuration.

You have the provides the following properties:

Property	Description
`email`	Email of your Google account.
`websites`	List of the websites URLs from your Google Search Console to use.
`branded_keywords`	List of keywords that the crawler should consider as part of a brand.

To use a Google Account you must first give access to your analytics data to OnCrawl using OAuth2.

For now you must use the onCrawl web client to add your Google account.

Majestic

{
  "majestic_params": {
    "access_token": "ABCDEF1234"
  }
}

A subscription with the Majestic feature is required to use this configuration.

You have the provides the following properties:

Property	Description
`access_token`	An access token that the crawler can use to access your data.

You can create an access token authorizing OnCrawl to access your Majestic data here.

Custom fields

Documentation not available yet.

Notifications

{
  "notifications": {
    "email_recipients": ["<email1>", "<email2>"],
    "custom_webhooks": [{
      "url": "<webhook1_url>"
    }, {
      "url": "<webhook2_url>",
      "secret": "<webhook_secret>"
    }]
  }
}

A notification is sent when:

A crawl for that crawl configuration has ended.
A scheduled crawl for that crawl configuration failed to start.

The supported notifications channels are:

Emails
Custom HTTP Webhooks

Emails

You can configure up to 10 recipients, if an empty list is provided, no emails are sent.

If you do not provide a notifications.email_recipients configuration then it send a mail to the
workspace owner by default.

Verify webhook payload signature

import hashlib
import hmac

def verify_signature(webhook_payload, webhook_secret, signature_header):
    if not signature_header:
        raise Exception(message="signature header is missing")
    hash_object = hmac.new(webhook_secret.encode('utf-8'), msg=webhook_payload, digestmod=hashlib.sha256)
    if not hmac.compare_digest(hash_object.hexdigest(), signature_header):
        raise Exception("Signatures didn't match")

Test a custom webhook endpoint

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/validate_custom_webhook" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -X POST \
    -d @- <<EOF
    {
        "url": "<webhook_url>",
        "secret": "<webhook_secret>"
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/validate_custom_webhook", json={
    "url": "<webhook_url>",
    "secret": "<webhook_secret>"
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

Custom HTTP Webhooks

You can configure up to 10 custom HTTP webhooks.

The URLS:

must be in HTTPS
must be unique within custom_webhooks

To protect your endpoint and verify that the request is coming from Oncrawl you can specify a secret with the webhook URL. This
secret can be any value between 8 and 100 characters.

By adding a secret a request HTTP header X-Oncrawl-Webhook-Signature will be sent with the payload to verify the authenticity of the request.

The API will not return the secret for the webhook URL in the crawl config but only if it has a secret configured or not using secret_enabled.

To remove the secret associated with the webhook, you have to pass null as value for secret.

If you change the webhook URL itself, any previously associated secret will be removed.

To test your webhook endpoint you can use the /validate_custom_webhook endpoint.

DNS

{
  "dns": [{
    "host": "www.oncrawl.com",
    "ips": [ "82.34.10.20", "82.34.10.21" ]
  }, {
    "host": "fr.oncrawl.com",
    "ips": [ "82.34.10.20" ]
  }]
}

The dns configuration allows you resolve one or several domains to another IP address than they normally would.

This can be useful to crawl a website in pre-production as if it was already deployed on the real domain.

Extra HTTP headers

{
  "extra_headers": {
    "Cookie": "lang=fr;",
    "X-My-Token": "1234"
  }
}

The extra_headers configuration allows you to inject custom HTTP headers to each of the crawl’s HTTP requests.

HTTP Authentication

{
  "http_auth": {
    "username": "user",
    "password": "1234",
    "scheme": "Digest",
    "realm": null
  }
}

The http_auth configuration allows you to crawl sites behind an authentication.

It can be useful to crawl a website in pre-production that is password protected before its release.

Property	Description
`username` required	Username to authenticate with.
`password` required	Password to authenticate with.
`scheme` required	How to authenticate. Available values are `Basic` `Digest` and `NTLM`.
`realm` optional	The authentication realm. To NTLM this correspond to the domain.

Robots.txt

{
  "robots_txt": [{
    "host": "www.oncrawl.com",
    "content": "CONTENT OF YOUR ROBOTS.TXT"
  }]
}

The robots_txt configuration allows you to override, for a given host, its robots.txt.

It can be use to:

allow the crawler to crawl normally denied pages by the robots.txt.
override a Crawl-Delay directive to allow the Crawler to crawl faster.

Because you can make the crawler ignore the robots.txt of a website, it is necessary to verify the ownership of this project to use this feature.

For now you can only verify the ownership using the OnCrawl application.

Update a configuration

Update a crawl configuration.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -X PUT \
    -d @- <<EOF
    {
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
    }
EOF

import requests

requests.put("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs", json={
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

It takes the same parameters as a crawl configuration creation except the name that cannot be modified and must be the same.

HTTP response

It returns the same response as a crawl configuration creation.

Delete a configuration

Delete a configuration.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Ingest data

Data ingestion is the process of integrating additional data for URLs in your analysis.

The ingest data API allows you to upload, delete and retrieve data files used by the Data Ingestion feature.

List ingest files

Get list of ingest files.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

projects = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The ingest files can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property	Description
`id`	The file’s ID.
`name`	The file’s name.
`status`	The file’s status, can be `UPLOADING`, `UPLOADED`, `PROCESSING`, `PROCESSED`, `ERROR`.
`kind`	The file’s kind, can be `ingest`, `seed`.
`created_at`	The file’s creation date.

HTTP Response

{
  "meta": {
    "filters": {},
    "limit": 20,
    "offset": 0,
    "sort": null,
    "total": 10
  },
  "user_files": [
      "<Ingest File Object>",
      "<Ingest File Object>"
  ]
}

A JSON object with a meta, described by the pagination section and a user_files key with the list of ingest files.

Get an ingest file

Get an ingest file.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

project = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
  "user_file": {
    "created_at": 1587473932000,
    "detailed_invalid_lines": {
      "invalid_url_format": 1
    },
    "error_message": null,
    "fields": {
      "median_position": "string",
      "sum_search_volume": "double",
      "total_keyword_count": "double"
    },
    "id": "5e9eee0c451c95288a2f8f8d",
    "invalid_lines": 1,
    "kind": "ingest",
    "lines_errors_messages": {
      "invalid_url_format": [
        "Error!"
      ]
    },
    "name": "some_ingest_file.zip",
    "project_id": "58fe0dd3451c9573f1d2adea",
    "size": 2828,
    "status": "PROCESSED",
    "valid_lines": 62
  }
}

The ingest file’s properties are:

Property	Description
`id`	The file’s ID.
`project_id`	The project ID.
`name`	The file’s name.
`status`	The file’s status, can be `UPLOADING`, `UPLOADED`, `PROCESSING`, `PROCESSED`, `ERROR`.
`kind`	The file’s kind, can be `ingest`, `seed`.
`created_at`	The file’s creation date.
`detailed_invalid_lines`	The detail of invalid lines.
`error_message`	The error message.
`fields`	The fields parsed in the file.
`valid_lines`	The amount of valid lines.
`invalid_lines`	The amount of invalid lines.
`lines_errors_messages`	A map splitting into categories error messages.
`size`	The amount of characters in the file.

Create an ingest file

Create an ingest file.

curl -X POST "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: multipart/form-data" \
    -F "file=@<file_path>"
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files", files={
    "file": <binary>
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

Property	Description
`file`	Binary data.

HTTP Response

Returns an HTTP 204 status code if successful.

Delete an ingest file

Delete an ingest file.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Ranking Performance API

RP Quotas

While you are using Ranking Performance, you will be subject to usage quotas. These quotas can prevent you from using the API whether you are using the web client, or whether you are requesting directly the API. Most users will not exceed load limits, but if you do, you will receive a “Ranking Performance query quota reached.” error message (403).

The quotas work this way: the more data you request, the more quota you will be using. For that reason, it is important to narrow down your requests as much as you can. A tight filter on field date can decrease a lot the amount of data that is fetched and will use a low quantity of your quota.

More generally, if you hit the quotas too often, you can try and use more specific filters for your queries.

The quota is project-based: that means that if you have reached it for one of your projects, you can still perform requests to the API for your other projects.

RP Data Schema

HTTP Request

Example of fields request

curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/fields" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

fields = requests.get("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/fields",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json().get('fields', [])

HTTP Response

Example of HTTP response

{
  "fields": [{
    "actions": [
      "equals",
      "not_equals"
    ],
    "agg_dimension": true,
    "agg_metric_methods": [
      "cardinality"
    ],
    "arity": "one",
    "can_display": true,
    "can_filter": true,
    "can_sort": true,
    "name": "device",
    "type": "enum"
  }, "..."]
}

Property	Description
`name`	The name of the field
`type`	The field’s type (natural, float, enum, bool, string, date)
`arity`	If the field is multivalued, can be `one` or `many`.
`values`	List of possible values for enum type.
`actions`	List of possible filters of this field.
`agg_dimension`	`true` if can be used as a dimension in aggregate queries.
`agg_metric_methods`	List of available aggregations methods for this field.
`can_display`	`true` if the field can be retrieved in queries.
`can_filter`	`true` if the field can be used in filters queries.
`can_sort`	`true` if the field can be sort on in queries.

RP Aggregate Queries

Sum of clicks for each url/query pair sorted on urls by alphabetical order

curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -d '{
      "aggs": [
        {
          "fields": [
            "url",
            "query"
          ],
          "value": [
            {
              "field": "clicks",
              "method": "sum",
              "alias": "clicks_sum"
            }
          ],
          "oql": {
            "field": [
              "date",
              "lt",
              "2022-10-01"
            ]
          },
          "sort": {
            "field": "url",
            "order": "asc"
          }
        }
      ]
    }'

import requests

response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [
        {
          "fields": [
            "url",
            "query"
          ],
          "value": [
            {
              "field": "clicks",
              "method": "sum",
              "alias": "clicks_sum"
            }
          ],
          "oql": {
            "field": [
              "date",
              "lt",
              "2022-10-01"
            ]
          },
          "sort": {
            "field": "url",
            "order": "asc"
          }
        }
      ]
    }
).json()

The returned JSON looks like:

{
  "aggs": [
        {
            "cols": [
                "url",
                "query",
                "clicks_sum"
            ],
            "rows": [
                [
                    "https://www.oncrawl.com/",
                    "analyse backlinks free",
                    0
                ],
                [
                    "https://www.oncrawl.com/",
                    "seo servers",
                    0
                ],
        ...
      ]
    }
  ]
}

Sum of impressions for each url where it is greater than 2000 sorted by descending order

curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -d '{
      "aggs": [
        {
          "fields": [
            "url"
          ],
            oql": {
            "field": [
              "date",
              "lt",
              "2022-10-01"
            ]
          },
          "value": [
            {
              "field": "impressions",
              "method": "sum",
              "alias": "impressions_sum"
            }
          ],
          "sort": {
            "field": "impressions_sum",
            "order": "desc"
          },
          "post_aggs_oql": {
            "field": [
              "impressions_sum",
              "gt",
              2000
            ]
          }
        }
      ]
    }'

import requests

response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [
        {
          "fields": [
            "url"
          ],
          "oql": {
            "field": [
              "date",
              "lt",
              "2022-10-01"
            ]
          },
          "value": [
            {
              "field": "impressions",
              "method": "sum",
              "alias": "impressions_sum"
            }
          ],
          "sort": {
            "field": "impressions_sum",
            "order": "desc"
          },
          "post_aggs_oql": {
            "field": [
              "impressions_sum",
              "gt",
              2000
            ]
          }
        }
      ]
    }
).json()

The returned JSON looks like:

{
    "aggs": [
        {
            "cols": [
                "url",
                "impressions_sum"
            ],
            "rows": [
                [
                    "https://www.oncrawl.com/oncrawl-seo-thoughts/12-great-tools-for-keyword-tracking-campaigns/",
                    2445803
                ],
                [
                    "https://www.oncrawl.com/technical-seo/submit-website-bing-webmaster-tools/",
                    2256877
                ],
        ...
      ]
    }
  ]
}

HTTP Request

This HTTP endpoint expects a JSON object as its payload with a single aggs key and an array of aggregate queries as its value.

An aggregate query is an object with the following properties:

Property	Description
`fields` optional	Specify how to create buckets of matching items.
`oql` optional	An OnCrawl Query Language object to filter on fields.
`value` optional	Specify how to aggregate matching items. By default it will return the number of matching items.
`sort` optional	Ordering of the returned matching results.
`post_aggs_oql` optional	An OnCrawl Query Language object to filter on metric aggregations.
`limit` optional	Maximum number of matching result to return.
`offset` optional	An offset for the returned matching results.

The value parameter is expected to be an array of objects with a field key, a method key and optionally an alias key where:
- {field} is the field’s name to perform the aggregation on
- {method} is the aggregation method
- {alias} is the alias for the metric aggregation.
The sort parameter is expected to be either an array of objects or an object. Those objects must have a field key, an order key where:
- {field} is the field’s name or the alias of a metric aggregation to sort on
- {order} is the sort order, either asc or desc.
The post_aggs_oql parameter is expected to be a regular OQL but you can only use aliases designating metric aggregations to filter on them.

How to aggregate on items

Url cardinality for each query by cardinalities in descending order and by query in alphabetical order

curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -d '{
      "aggs": [
        {
          "fields": [
            "query"
          ],
          "value": [
            {
              "field": "url",
              "method": "cardinality",
              "alias": "url_cardinality"
            }
          ],
          "oql": {
            "field": [
              "date",
              "lt",
              "2022-10-01"
            ]
          },
          "sort": [
            {
              "field": "url_cardinality",
              "order": "desc"
            },
            {
              "field": "query",
              "order": "asc"
            }
          ]
        }
      ]
    }'

import requests

response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [
        {
          "fields": [
            "query"
          ],
          "value": [
            {
              "field": "url",
              "method": "cardinality",
              "alias": "url_cardinality"
            }
          ],
          "oql": {
            "field": [
              "date",
              "lt",
              "2022-10-01"
            ]
          },
          "sort": [
            {
              "field": "url_cardinality",
              "order": "desc"
            },
            {
              "field": "query",
              "order": "asc"
            }
          ]
        }
      ]
    }
).json()

The returned JSON looks like:

{
    "aggs": [
        {
            "cols": [
                "query",
                "url_cardinality"
            ],
            "rows": [
                [
                    "site:oncrawl.com",
                    370
                ],
                [
                    "site:www.oncrawl.com",
                    279
                ],
        ...
      ]
    }
  ]
}

The expected format is {"field": <field_name>, "method": <aggregation_type>, "alias": <alias>}.

For example:

{"field": "clicks", "method": "sum", "alias": "clicks_sum"} will returns the sum of clicks.
{"field": "url", "method": "cardinality"} will returns the total amount of urls.

But not all fields can be aggregated and not all aggregations are available on all fields.

To know which aggregations are available on a field you can check the agg_metric_methods value returned by the Data Schema endpoint.

The available methods are:

sum: Returns the sum of all the values for this field.

cardinality: Returns the number of different values for this field.

weighted_average: Returns the weighted average for this field (available only for position).

ctr: Returns the ctr (available only on clicks).

When performing an aggregation, you can create buckets for your matching items using the fields parameter which takes an array of JSON objects.

The simplest way is to simply use the field’s name.

But not all fields can be used to create a bucket.

To know which fields are available as a bucket you can check the agg_dimension value returned by the Data Schema endpoint.

RP Export queries

HTTP Request

Example of async-export request

curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/async-export" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -d '{
    "request": {
        "oql": {
            "and": [
                {
                    "field": [
                        "date",
                        "between",
                        [
                            "2023-09-18",
                            "2023-11-02"
                        ]
                    ]
                },
                {
                    "field": [
                        "url",
                        "equals",
                        "https://www.oncrawl.com"
                    ]
                }
            ]
        },
        "post_aggs_oql": {
            "field": [
                "nb_of_ranking_queries",
                "gt",
                0
            ]
        },
        "value": [
            {
                "field": "url",
                "method": "cardinality",
                "alias": "nb_of_ranking_pages"
            },
            {
                "field": "query",
                "method": "cardinality",
                "alias": "nb_of_ranking_queries"
            }
        ],
        "fields": [
            {
                "name": "query"
            }
        ]
    },
    "output_format": "csv",
    "output_format_parameters": {
        "csv_delimiter": ","
    }
}'

import requests

response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/async-export",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
    "request": {
        "oql": {
            "and": [
                {
                    "field": [
                        "date",
                        "between",
                        [
                            "2023-09-18",
                            "2023-11-02"
                        ]
                    ]
                },
                {
                    "field": [
                        "url",
                        "equals",
                        "https://www.oncrawl.com"
                    ]
                }
            ]
        },
        "post_aggs_oql": {
            "field": [
                "nb_of_ranking_queries",
                "gt",
                0
            ]
        },
        "value": [
            {
                "field": "url",
                "method": "cardinality",
                "alias": "nb_of_ranking_pages"
            },
            {
                "field": "query",
                "method": "cardinality",
                "alias": "nb_of_ranking_queries"
            }
        ],
        "fields": [
            {
                "name": "query"
            }
        ]
    },
    "output_format": "csv",
    "output_format_parameters": {
        "csv_delimiter": ","
    }
}
).json()

The returned JSON looks like:

{
    "data_export": {
        "data_type": "keyword",
        "expiration_date": 1701526732000,
        "export_failure_reason": null,
        "id": "00000020f51bb4362eee2a4c",
        "output_format": "csv",
        "output_format_parameters": null,
        "output_row_count": null,
        "output_size_in_bytes": null,
        "requested_at": 1698934732000,
        "resource_id": "00000020f51bb4362eee2a4d",
        "status": "REQUESTED",
        "target": "download_center",
        "target_parameters": {}
    }
}

When performing an async-export request, the request oql must contain a filter on dates.

The available “request” properties can be found in the RP Aggregate Queries section.

The other properties are:

Property	Description
`output_format`	Specify the output format. Can be either `“csv”` or `“json”`.
`output_format_parameters` optional	Specify parameters for the output format. A property `csv_delimiter` can be defined. The supported values for that property are `“,”`, `“;”`, `“\t”`.

Account API

The Account API allows you to manage account settings and data.

With this API you can, for example:

Manage secrets to export your data to a cloud provider
Generate data exports

Secrets

List secrets

Get a list of secrets.

curl "https://app.oncrawl.com/api/v2/account/secrets" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

secrets = requests.get("https://app.oncrawl.com/api/v2/account/secrets",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The secrets can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property	Description
`name`	Name of the secret
`type`	Can be `gcs_credentials` or `s3_credentials`.

HTTP Response

{
   "meta":{
      "filters":{},
      "limit":100,
      "offset":0,
      "sort":null,
      "total":1
   },
   "secrets": [
      "<Secret Object>",
      "<Secret Object>"
   ]
}

A JSON object with a meta, described by the pagination section and a secrets key with the list of secrets.

Create a secret

Create a secret.

curl -X POST "https://app.oncrawl.com/api/v2/account/secrets" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "secret": {
        "name": "secret_name",
        "type": "gcs_credentials",
        "value": "secret value"
      }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/account/secrets", json={
    "secret": {
      "name": "secret_name",
      "type": "gcs_credentials",
      "value": "secret value"
    }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP Request

Property	Description
`name`	Must be unique and match `^[a-z-A-Z][a-zA-Z0-9_-]{2,63}$`.
`type`	Can be `gcs_credentials` or `s3_credentials`.
`value`	The secret’s value is a JSON string encoded in base64.

Secret value

For gcs_credentials, this is the key of a service account exported using JSON format, see documentation.
For s3_credentials, this is a JSON file such as:
{
“access_key”: “ACCESS_KEY”,
“secret_key”: “SECRET_KEY”
}

HTTP Response

Examples of HTTP response

{
  "secret": {
    "creation_date": 1623320422000,
    "id": "6039166bcde251bfdcf624aa",
    "name": "my_secret",
    "owner_id": "60c1e79b554c6c975f218bad",
    "type": "gcs_credentials"
  }
}

An HTTP 200 status code is returned with the created project returned directly as the response within a secret key.

Delete a secret

Delete a secret.

curl -X DELETE "https://app.oncrawl.com/api/v2/account/secrets/<secret_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/secrets/<secret_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Data exports

List data exports

Get a list of data exports.

curl "https://app.oncrawl.com/api/v2/account/data_exports" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

account = requests.get("https://app.oncrawl.com/api/v2/account/data_exports",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The data exports can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property	Description
`id`	The export ID.
`status`	The status of the export.
`requested_at`	The date of the export request.
`created_at`	The create date of the export.
`size_bytes`	The size of the export in bytes.
`row_count`	The amount of produced rows.
`data_type`	The data type of the export, can be `page`, `link`.
`output_format`	The output format, can be `json`, `csv`, `parquet`.
`resource_id`	The ID of crawl object that was exported.

HTTP Response

{
   "meta":{
      "filters":{},
      "limit":100,
      "offset":0,
      "sort":null,
      "total":1
   },
   "account": [
      "<Data Export Object>",
      "<Data Export Object>"
   ]
}

A JSON object with a meta, described by the pagination section and a account key with the list of data exports.

Get a data export

Get a data export.

curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"

import requests

export = requests.get("https://app.oncrawl.com/api/v2/account/data_exports/<data_export_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
  "data_export": {
    "data_type": "page",
    "export_failure_reason": null,
    "id": "6039166bcde251bfdcf624aa",
    "output_format": "parquet",
    "output_format_parameters": null,
    "output_row_count": 1881.0,
    "output_size_in_bytes": 517706.0,
    "requested_at": 1614354027000,
    "resource_id": "5feeac5c5567fd69557a1855",
    "status": "DONE",
    "target": "gcs",
    "target_parameters": {}
  }
}

The data_export’s properties are:

Property	Description
`id`	The unique identifier of the file.
`data_type`	The data type of the export, can be `page`, `link`.
`export_failure_reason`	The reason of the export failure.
`output_format`	The output format, can be `csv`, `json`, `parquet`.
`output_format_parameters`	A list of parameters containing at the moment the delimiter to be used when the format is CSV.
`output_row_count`	Number of items that were exported.
`output_size_in_bytes`	Total size of the exported data.
`requested_at`	The date of the export request.
`resource_id`	The unique identifier of the file.
`status`	The status of the export.
`target`	The destination, can be `gcs`, `s3`.
`target_parameters`	An object with the configuration for the selected target.

Create a data export

S3

Create an S3 data export.

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "data_export": {
        "data_type": "page",
        "resource_id": "666f6f2d6261722d71757578",
        "output_format": "csv",
        "output_format_parameters": {
          "csv_delimiter": ";"
        },
        "target": "s3",
        "target_parameters": {
          "s3_credentials": "secrets://60c1dc72d61c55b9a313e5b4/my_secret",
          "s3_bucket": "my-bucket",
          "s3_prefix": "some-prefix",
          "s3_region": "us-west-2"
        },
        "include_all_page_group_lists": true
      }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/account/data_exports", json={
  {
    "data_export": {
      "data_type": "page",
      "resource_id": "666f6f2d6261722d71757578",
      "output_format": "csv",
      "output_format_parameters": {
        "csv_delimiter": ";"
      },
      "target": "s3",
      "target_parameters": {
        "s3_credentials": "secrets://60c1dc72d61c55b9a313e5b4/my_secret"
        "s3_bucket": "my-bucket",
        "s3_prefix": "some-prefix",
        "s3_region": "us-west-2"
      },
      "include_all_page_group_lists": True
    }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

The target parameters for S3 buckets are:

Property	Description
`s3_credentials`	URI of the secret.
`s3_bucket`	Name of the bucket to upload data to.
`s3_region`	Valid S3 region where the bucket is located.
`s3_prefix`	Path on the bucket where the files will be uploaded.

GCS

To export data in a GCS bucket, you must allow our service account to write in the desired bucket.

Our service account is `oncrawl-data-transfer@oncrawl.iam.gserviceaccount.com`.

You MUST give the following roles to our service account: ( with IAM )

roles/storage.legacyBucketReader
- storage.objects.list
- storage.bucket.get
roles/storage.legacyObjectReader
- storage.objects.get

Create a GCS data export.

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "data_export": {
        "data_type": "page",
        "resource_id": "666f6f2d6261722d71757578",
        "output_format": "csv",
        "output_format_parameters": {
          "csv_delimiter": ";"
        },
        "target": "gcs",
        "target_parameters": {
          "gcs_bucket": "test_bucket",
          "gcs_prefix": "some_bucket_prefix"
        },
        "include_all_page_group_lists": true
      }
    }
EOF

import requests

requests.post("https://app.oncrawl.com/api/v2/account/data_exports", json={
  {
    "data_export": {
      "data_type": "page",
      "resource_id": "666f6f2d6261722d71757578",
      "output_format": "csv",
      "output_format_parameters": {
        "csv_delimiter": ";"
      },
      "target": "gcs",
      "target_parameters": {
        "gcs_bucket": "test_bucket",
        "gcs_prefix": "some_bucket_prefix"
      },
      "include_all_page_group_lists": True
    }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

The target parameters for GCS buckets are:

Property	Description
`gcs_bucket`	Name of the bucket to upload data to.
`gcs_prefix`	Path on the bucket where the files will be uploaded.

HTTP request

Property	Description
`data_type`	The data type of the export, can be `page`, `link`.
`output_format`	The output format, can be `csv`, `json`, `parquet`.
`output_format_parameters`	A list of parameters containing at the moment the delimiter to be used when the format is CSV.
`resource_id`	The unique identifier of the file.
`target`	The destination, can be `gcs`, `s3`.
`target_parameters`	An object with the configuration for the selected target.
`page_group_lists_included`	A list of segmentations to export
`include_all_page_group_lists`	Whether or not all segmentations should be exported, can be `true` or `false` (overrides `page_group_lists_included` if `true`, defaults to `false`)

HTTP Response

If successful, the API will respond with the same output as Get a single data export

Fields

This is the list of OnCrawl fields that are exported while using our Data Studio connector.

They are listed below by category. For each field you’ll find the following information: name, definition, type and arity.

The OnCrawl field type can be one of the following:

Type	Definition
`integer`	integer number
`natural`	non-negative integer (>= 0)
`float`	floating-point number
`percentage`	floating-point number between 0 and 1
`string`	a sequence of characters, text
`enum`	a string from a defined list of values
`bool`	boolean (`true` or `false`)
`datetime`	an timestamp, in the following format: `yyyy/MM/dd HH:mm:ss z`
`date`	a date in the following format: `yyyy-MM-dd`
`object`	a raw JSON object
`hash`	a hashed string (the output of a hash function)

Note: These types are the ones exposed by the OnCrawl API. The underlying storage for the Data Studio connector may use slightly different type names or date/time formats.

There are two possible values for arity:

Arity	Definition
`one`	the field holds a single value
`many`	the field holds a list of values

Content

Field name	Definition	Type	Arity
`language`	Language code, in ISO 639 two-letter format. Either parsed from the HTML or detected from the text.	string	one
`text_to_code`	Number of text characters divided by the total number of characters in the HTML.	percentage	one
`word_count`	The number of words on the page.	natural	one

Duplicate content

Field name	Definition	Type	Arity
`clusters`	OnCrawl IDs of the groups of URLs with similar content that this URL belongs to.	hash	many
`nearduplicate_content`	Whether this page's content is very similar to another page, according to our SimHash-based algorithm.	bool	one
`nearduplicate_content_similarity`	Highest ratio of content similarity, compared to other pages in the cluster.	percentage	one
`duplicate_description_status`	Status of duplication issues for the group of pages with the same meta description as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations)	enum	one
`duplicate_h1_status`	Status of duplication issues for the group of pages with the same H1 as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations)	enum	one
`duplicate_title_status`	Status of duplication issues for the group of pages with the same title tag as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations)	enum	one
`has_duplicate_description_issue`	Whether there are duplications of this page's meta description on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication.	bool	one
`has_duplicate_h1_issue`	Whether there are duplications of this page's H1 on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication.	bool	one
`has_duplicate_title_issue`	Whether there are duplications of this page's title tag on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication.	bool	one
`has_nearduplicate_issue`	Whether the similarity of this page with other pages is handled using canonical or hreflang descriptions. 'true' if problems remain and 'false' if correctly handled or if there's no duplication.	bool	one
`nearduplicate_status`	Status of duplication issues for the group of pages with similar content to this one: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations)	enum	one

Hreflang errors

Field name	Definition	Type	Arity
`hreflang_cluster_id`	OnCrawl ID of the group of pages that reference one another through hreflang declarations.	hash	one

Indexability

Field name	Definition	Type	Arity
`meta_robots`	List of values in the meta robots tag.	string	many
`meta_robots_follow`	Whether the links on the page should be followed (true) or not (false) according to the meta robots.	bool	one
`meta_robots_index`	Whether the the page should be indexed (true) or not (false) according to the meta robots.	bool	one
`robots_txt_denied`	Whether the crawler was denied by the robots.txt file while visiting this page.	bool	one

Linking & popularity

Field name	Definition	Type	Arity
`depth`	Page depth in number of clicks from the crawl's Start URL.	natural	one
`external_follow_outlinks`	Number of followable outlinks to pages on other domains.	natural	one
`external_nofollow_outlinks`	Number of nofollow outlinks to pages on other domains.	natural	one
`external_outlinks`	Number of outlinks to other domains.	natural	one
`external_outlinks_range`	Range of the number of outlinks to pages on other domains: 0-50, 50-100, 100-150, 150-200, >200	enum	one
`follow_inlinks`	Number of followable links pointing to a URL from other pages on the same site.	natural	one
`inrank`	Whole number from 0-10 indicating the URL's relative PageRank within the site. Higher numbers indicate better popularity.	natural	one
`inrank_decimal`	Decimal number from 0-10 indicating the URL's relative PageRank within the site. Higher numbers indicate better popularity.	float	one
`internal_follow_outlinks`	Number of followable links from this page to other pages on the same site.	natural	one
`internal_nofollow_outlinks`	Number of nofollow links from this page to other pages on the same site.	natural	one
`internal_outlinks`	Number of outlinks to pages on the same site.	natural	one
`internal_outlinks_range`	Range of the number of outlinks to pages on the same site: 0-50, 50-100, 100-150, 150-200, >200	enum	one
`nb_inlinks`	Number of links pointing to this page from other pages on this site.	natural	one
`nb_inlinks_range`	Range of values that the number of links to this page from other pages on the site falls into. Ranges are: 0-50, 50-100, 100-150, 150-200, >200	enum	one
`nb_outlinks_range`	Range of values that the number of links from this page falls into. Ranges are: 0-50, 50-100, 100-150, 150-200, >200	enum	one
`nofollow_inlinks`	Number of links pointing to this page with a rel="nofollow" tag.	natural	one

OnCrawl bot

Field name	Definition	Type	Arity
`fetch_date`	Date on which the OnCrawl bot obtained the URL's source code expressed as yyyy/MM/dd HH:mm:ss z	datetime	one
`fetch_status`	Whether the OnCrawl bot successfully obtained the URL's source code. Indicates "success" when true.	string	one
`fetched`	Whether the OnCrawl bot obtained the URL's source code (true) or not (false).	bool	one
`parsed_html`	Whether the OnCrawl bot was able to obtain an HTTP status and textual content for this page.	bool	one
`sources`	List of sources for this page. Sources may be: OnCrawl bot, at_internet, google_analytics, google_search_console, ingest_data, logs_cross_analysis, majestic, adobe_analytics, sitemaps	string	many

Payload

Field name	Definition	Type	Arity
`load_time`	Time (in milliseconds) it took to fetch the entire HTML of the page, excluding external resources. Also known as "time to last byte" (TTLB).	natural	one
`weight`	The size of the page in KB, excluding resources.	natural	one

Core Web Vitals

Field name	Definition	Type	Arity
`cwv_bytes_saving`	The number of bytes which can be saved by optimising the page	natural	one
`cwv_cls`	Cumulative Layout Shift reported by Lighthouse	float	one
`cwv_fcp`	First Contentful Paint reported by Lighthouse	natural	one
`cwv_lcp`	Largest Contentful Paint reported by Lighthouse	natural	one
`cwv_performance_score`	Performance score reported by Lighthouse	float	one
`cwv_si`	Speed index reported by Lighthouse	natural	one
`cwv_tbt`	Total blocking time reported by Lighthouse	natural	one
`cwv_tti`	Time to interactive reported by Lighthouse	natural	one
`cwv_time_saving`	The time which can be saved by optimising the page	natural	one

Redirect chains & loops

Field name	Definition	Type	Arity
`final_redirect_location`	Final URL reached after following a chain of one or more 3xx redirects.	string	one
`final_redirect_status`	HTTP status code of the final URL reached after following a chain of one or more 3xx redirects.	natural	one
`is_redirect_loop`	Whether the chain of redirects loops back to a URL in the chain.	bool	one
`is_too_many_redirects`	Whether the chain contains more than 16 redirects.	bool	one
`redirect_cluster_id`	The OnCrawl ID of this page's redirect cluster. The redirect cluster is the group of pages found in all branches of a redirect chain or loop.	hash	one
`redirect_count`	Number of redirects needed from this page to reach the final target in the redirect chain.	natural	one

Rel alternate

Field name	Definition	Type	Arity
`canonical_evaluation`	Canonical status of the URL: matching (declares itself as canonical), not_matching (declares a different page as canonical), not_set (has no canonical declaration)	enum	one
`rel_canonical`	URL declared in the rel canonical tag.	string	one
`rel_next`	URL declared in the rel next tag.	string	one
`rel_prev`	URL declared in the rel prev tag.	string	one

Scraping

Field name	Definition	Type	Arity
`custom_qsdd`	Custom field created through user-defined scraping rules.	string	one

SEO tags

Field name	Definition	Type	Arity
`description_evaluation`	Duplication status of the page's meta description: unique, duplicated (another URL has the same meta description), not_set	enum	one
`description_length`	Length of the URL's meta description in number of characters.	natural	one
`description_length_range`	Evaluation of the URL's meta description length: perfect (135-159), good (110-134 or 160-169), too short (<110), too long (>=170)	enum	one
`h1`	First H1 on the page.	string	one
`h1_evaluation`	Duplication status of the page's H1 text: unique, duplicated (another URL has the same H1), not_set	enum	one
`meta_description`	Meta description for this page.	string	one
`num_h1`	Number of H1 tags on this page.	natural	one
`num_h2`	Number of H2 tags on this page.	natural	one
`num_h3`	Number of H3 tags on this page.	natural	one
`num_h4`	Number of H4 tags on this page.	natural	one
`num_h5`	Number of H5 tags on this page.	natural	one
`num_h6`	Number of H6 tags on this page.	natural	one
`num_img`	Number of images on this page.	natural	one
`num_img_alt`	Number of image 'alt' attributes on this page.	natural	one
`num_img_range`	Whether the page contains no images, one image, or more than one.	enum	one
`num_missing_alt`	Number of missing 'alt' attributes for images on this page.	natural	one
`semantic_item_count`	Number of semantic tags on the page.	natural	one
`semantic_types`	List of semantic tags found on the page.	string	many
`title`	Page title found in the <title> tag.	string	one
`title_evaluation`	Duplication status of the title tag: unique, duplicated (another page has the same title), not_set.	enum	one
`title_length`	Length of the title tag in characters.	natural	one

Sitemaps

Field name	Definition	Type	Arity
`sitemaps_file_origin`	List of URLs of the Sitemaps files where this page was found.	string	many
`sitemaps_num_alternate`	Number of alternates to this page that were found in the sitemaps.	natural	one
`sitemaps_num_images`	Number of images for this page that were found in the sitemaps.	natural	one
`sitemaps_num_news`	Number of news publications for this page that were found in the sitemaps.	natural	one
`sitemaps_num_videos`	Number of videos for this page that were found in the sitemaps.	natural	one

Status code

Field name	Definition	Type	Arity
`redirect_location`	URL this page redirects to.	string	one
`status_code`	HTTP status code returned by the server when crawling the page.	natural	one
`status_code_range`	HTTP status code class. Classes are: ok, redirect, client_error, server_error.	enum	one

URL

Field name	Definition	Type	Arity
`querystring_key`	List of keys found in the querystring of this page's URL.	string	one
`querystring_keyvalue`	List of key-value pairs found in the querystring of this page's URL.	string	one
`url`	Full URL including the protocol (https://).	string	one
`url_ext`	URL's file extension.	string	one
`url_first_path`	First directory following the URL's domain, or / if there is no directory	string	one
`url_has_params`	Whether the URL has query parameters.	bool	one
`url_host`	Hostname or subdomain found in the URL.	string	one
`urlpath`	URL path.	string	one

API Reference

Requests

HTTP Verbs

Parameters and Data

Errors

Permissions errors

Validations errors

Operation failure errors

Authentication

OnCrawl Query Language

Leaf nodes

Compound nodes

Common filters

String filters

Numeric filters

Filters options

Pagination

HTTP request

HTTP response

Data API

Data types

Data granularity

HTTP Request

HTTP Response

Data Schema

HTTP Request

HTTP Response

Search Queries

HTTP Request

HTTP response

Aggregate Queries

HTTP Request

How to aggregate items

How to create simple buckets

How to create ranges buckets

Export Queries

HTTP response

Projects API

Projects

List projects

HTTP Request

HTTP Response

Get a project

HTTP Response

Create a project

HTTP request

HTTP Response

Delete a project

HTTP request

HTTP Response

Scheduling

List scheduled crawls

HTTP Request

HTTP Response

Create a scheduled crawl

HTTP request

HTTP Response

Delete a scheduled crawl

HTTP request

HTTP Response

Crawls

Launch a crawl

HTTP request

HTTP Response

List crawls

HTTP Request

HTTP Response

Get a crawl

HTTP Response

End reasons

Get a crawl progress

HTTP request

HTTP Response

Fetch statuses

Update crawl state

HTTP request

HTTP Response

Delete a crawl

HTTP request

HTTP Response