Private cloud integration

Before adding your cloud data to a dataset, you need to integrate your cloud storage with Encord. Please see the Data integrations section to learn how to create integrations for AWS S3 , Azure blob, GCP storage or Open Telekom Cloud.

To add your cloud data to a dataset:

Turn on the Private cloud toggle in the 'Create dataset' part of the data creation flow when creating a new dataset.
Select the relevant integration using the Select integration drop down.

Upload an appropriately formatted JSON or CSV file specifying the data you would like to add to the dataset.

note

Please see below on how to format an appropriate JSON or CSV file.

Click the upload area, or drag-and-drop your files from your storage container. Your stored objects may contain files which are not supported by Encord and which may produce errors on upload - toggle the Ignore individual file errors toggle to ignore these. Click Add data when you're ready.

note

The data will now be fetched from your cloud storage and processed asynchronously. This processing involves fetching appropriate metadata and other file information to help us render the files appropriately and to check for any framerate inconsistencies. We do not store your files in any way.

Checking upload status

You can check the progress of the processing job by clicking in the top right. A spinning progress indicator will indicate the processing job is still in progress.

If successful, the processing will complete with a icon.
If unsuccessful, there will be a icon, as seen below.

If this is the case, please check that your provider permissions have been set correctly, that the object data format is supported, and that the JSON or CSV file is correctly formatted.

Check which files failed to upload by clicking the icon to download a CSV log file. Every row in the CSV will correspond to a file which failed to be uploaded.

note

You will only see failed uploads if the Ignore individual file errors toggle wasn't enabled when uploading your data.

JSON format

The JSON file format is a JSON object with top-level keys specifying the type of data and object URLs of the content you wish to add to the dataset. Object URLs must not contain any whitespace. You can add one data type at a time, or combine multiple data types in one JSON file according to your preferences or development flows. The supported top-level keys are: videos, image_groups, image_sequences, images, and dicom_series. The details for each data format are given in the sections below.

Videos

Each object in the videos array is a JSON object with the key objectUrl specifying the full URL of where to find the video resource. The title field is optional. If not specified, the video's file name will be used.

Video metadata (separate from client metadata) may be specified for videos. Click here to read more.
If skip_duplicate_urls is set to true, image groups where all object URLs exactly match an existing image group in the dataset will be skipped.

Key or Flag	Required?	Default value
"objectUrl"	Yes
"title"	No	<`file title`>
"clientMetadata"	No
"skip_duplicate_urls"	No	false
"createVideo"	No	false

See the sample below.

{
  "videos": [
    {
      "objectUrl": "<object url_1>"
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-custom-video-title.mp4",
      "clientMetadata": {"optional": "metadata"}
    }
  ],
  "skip_duplicate_urls": true
}

Single images

Single Images

The JSON structure for single images parallels that of videos.

The title field is optional.
If not specified, the file name of the image will be used.
If skip_duplicate_urls is set to true, images that have been previously uploaded to the dataset with the same object URL will be skipped.

Key or Flag	Required?	Default value
"objectUrl"	Yes
"title"	No	<`file title`>
"clientMetadata"	No
"skip_duplicate_urls"	No	false
"createVideo"	No	false

See the sample below.

{
  "images": [
    {
      "objectUrl": "<object url>"
    },
    {
      "objectUrl": "<object url>",
      "title": "my-custom-image-title.jpeg",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

Image groups

Image groups are collections of images that are processed as one annotation task.
Images within image groups remain unaltered, meaning that images of different sizes and resolutions can form an image group without the loss of data.
Image groups do not require 'write' permissions to your cloud storage.
Custom client metadata is defined per image group, not per image.
If skip_duplicate_urls is set to true, image groups where all object URLs exactly match an existing image group in the dataset will be skipped.

Key or Flag	Required?	Default value
"objectUrl"	Yes
"title"	Yes	<`file title`>
"clientMetadata"	No
"skip_duplicate_urls"	No	false
"createVideo"	No	false

note

The position of each image within the sequence needs to be specified in the key - e.g. objectUrl_{position_number} as seen in the example below. Images will be stored in reverse order to the position numbers they are assigned.

{
  "image_groups": [
    {
      "title": "<title 1>",
      "createVideo": false,
      "objectUrl_0": "<object url>"
    },
    {
      "title": "<title 2>",
      "createVideo": false,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

Image sequences

Image sequences are collections of images that are processed as one annotation task and represented as a video.
Images within image sequences may be altered as images of varying sizes are resolutions are made to match that of the first image in the sequence.
Creating Image sequences from cloud storage requires 'write' permissions, as new files have to be created in order to be read as a video.
Each object in the image_sequences array represents a single image sequence.
Custom client metadata is defined per image sequence, not per image.
If skip_duplicate_urls is set to true, image groups where all object URLs exactly match an existing image group in the dataset will be skipped.
Key or Flag Required? Default value
"objectUrl" Yes
"title" Yes <file title>
"clientMetadata" No
"skip_duplicate_urls" No false
"createVideo" Yes false

note

The position of each image within the sequence needs to be specified in the key - e.g objectUrl_{position_number}. See the example below.

{
  "image_sequences": [
    {
      "title": "<title 1>",
      "createVideo": true,
      "objectUrl_0": "<object url>"
    },
    {
      "title": "<title 2>",
      "createVideo": true,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

DICOM

Like image_groups and image_sequences, the dicom_series elements require a title and a sequenced object URL. See the sample below.
Custom client metadata is defined per DICOM series
If skip_duplicate_urls is set to true, image groups where all object URLs exactly match an existing image group in the dataset will be skipped.
Key or Flag Required? Default value
"objectUrl" Yes
"title" Yes <file title>
"clientMetadata" No
"skip_duplicate_urls" No false
"createVideo" Yes false

{
  "dicom_series": [
    {
      "title": "<title 1>",
      "objectUrl_0": "<object url>"
    },
    {
      "title": "<title 2>",
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

Multiple file types

You can upload multiple file types using a single JSON file. The example below shows 1 image, 2 videos, 2 image sequences, and 1 image group.

{
  "images": [
    {
      "objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Image1.png"
    }
  ],
  "videos": [
    {
      "objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Cooking.mp4"
    },
    {
      "objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Oranges.mp4"
    }
  ],
  "image_sequences": [
    {
      "title": "apple-samsung-light",
      "createVideo": true,
      "objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(32).jpg",
      "objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(33).jpg",
      "objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(34).jpg",
      "objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(35).jpg"
    },
    {
      "title": "apple-samsung-dark",
      "createVideo": true,
      "objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(32).jpg",
      "objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(33).jpg",
      "objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(34).jpg",
      "objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(35).jpg"
    }
  ],
  "image_groups": [
    {
      "title": "apple-ios-light",
      "createVideo": false,
      "objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(32).jpg",
      "objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(33).jpg"
    }
  ]
}

Specifying video metadata

The JSON format allows you to specify metadata for video files. Metadata is essential information used by the label editor and crucial for aligning annotations to the correct frame.

Example JSON including video metadata

{
    "videos": [
      {
        "objectUrl": "video_file.mp4",
        "videoMetadata": {
            "fps": 23.98,
            "duration": 29.09,
            "width": 1280,
            "height": 720,
            "file_size": 5468354,
            "mime_type": "video/mp4"
        }
      }
    ]
  }

fps: Frames per second.
duration: Duration of the video (in seconds).
width / height: Dimensions of the video (in pixels).
file_size: The size of the file (in bytes).
mime_type: Specifies the file type extension according to the MIME standard.

When videos are supplied with video metadata, Encord assumes the metadata to be correct and our servers will neither download nor pre-process your data. This may be a particularly useful feature for customers with strict data compliance concerns.

One way to find the necessary metadata is shown below. Run the following commands in your terminal.

ffmpeg -i 'video_title.mp4' to retrieve fps, duration, width, and height - as highlighted below.

ls -l 'video_title.mp4' to retrieve the file size - as highlighted below.

note

You can optionally add some custom client metadata per data item in the clientMetadata field. See examples below on how to add this. Note that client metadata is separate from video metadata above, and is intended as an arbitrary store of data you would like to associate with any particular file.

It is important to know that we enforce a 10MB limit on the client metadata per data item. Also, this metadata is being stored as a PostgreSQL jsonb type internally. Please read the relevant PostgreSQL docs about the jsonb type and its behaviours. For example, jsonb type will not preserve key order or duplicate keys.

note

Add the "skip_duplicate_urls": true flag at the top level to make the uploads idempotent. Skipping URLs already in the dataset can help complete large upload operations, which may have been interrupted due to unstable network, etc. Since previously processed assets don't have to be uploaded again, you can simply retry the failed operation without editing the upload specification file. The default of this flag is set to false. The default of this flag is set to false.

This feature is currently only supported for the JSON uploads.

When using a Multi-Region Access Point

When using a Multi-Region Access Point for your AWS S3 buckets the JSON file will have to be slightly different than the examples provided. Instead of an object's URL, objects are specified using the ARN of the Multi-Region Access Point followed by the object name. The example below shows how video files from a Multi-Region Access Point would be specified.

{
  "videos": [
    {
      "objectUrl": "Multi-Region-Access-Point-ARN + <object name_1>"
    },
    {
      "objectUrl": "Multi-Region-Access-Point-ARN + <object name_2>",
      "title": "my-custom-video-title.mp4",
      "clientMetadata": {"optional": "metadata"}
    }
  ],
  "skip_duplicate_urls": true
}

CSV format

The CSV file format is a CSV file where the columns specify the type of data and object URLs of the content you wish to add to the dataset. Object URLs must not contain any whitespace. You can add one data type at a time, or combine multiple data types in one CSV file according to your preferences or development flows.

The details for each data format are given in the sections below.

note

We do not currently support DICOM, or single image files in the CSV format.

Videos

A CSV file containing image groups should be structured with three columns with the following headings: 'ObjectURL', 'Title', and 'Create video'.

The 'ObjectURL' column containing the objectUrl. It specifies the full URL of where to find the video resource.
The Title column containing the video_title. Entering a video_title is optional, and if left blank will default to the video's title.
The 'Create video' column containing the value false. If left blank it will default to false.

ObjectUrl	Title	Create video
`<object url>`	`<video_title>`	`false`

Image groups

A CSV file containing image groups should be structured with three columns with the following headings: 'ObjectURL', 'Title', and 'Create video'.

The 'ObjectURL' column containing the objectUrl. It specifies the full URL of where to find the image resource.
Entering an Image group title is mandatory, as it will signify the name of the image group and image should be assigned to.
The 'Create video' column containing the value false. If left blank it will default to false.

ObjectUrl	Image group title	Create video
`<object url>`	`<image_group_title>`	`false`

note

Image groups do not require 'write' permissions.

Image sequences

A CSV file containing image sequences should be structured with three columns with the following headings: 'ObjectURL', 'Title', and 'Create video'.

The 'ObjectURL' column containing the objectUrl. It specifies the full URL of where to find the image resource.
Entering an Image sequence title is mandatory, as it will signify the name of the image group and image should be assigned to.
The 'Create video' column containing the value true. This is the only file type where for which you can't leave this column blank.

ObjectUrl	Image sequence title	Create video
`<object url>`	`<image_sequence_title>`	`true`

note

Image sequences require 'write' permissions against your storage bucket to save the compressed video.

Multiple file types

You can upload multiple file types with a single CSV file. The example below shows 2 videos, 1 single image, 2 image sequences, and an image group.

ObjectUrl	Title	Create video
`<object url>`	`<video_title 1>`
`<object url>`	`<video_title 2>`
`<object url>`	`<image_title 1>`
`<object url>`	`<image_sequence_title 1>`	`true`
`<object url>`	`<image_sequence_title 2>`	`true`
`<object url>`	`<image_group_title 1>`	`false`

Helpful Scripts and Examples

Use the following examples and helpful scripts to quickly learn how to create JSON and CSV files formatted for the dataset creation process, by constructing the URLs from the specified path in your private storage.

AWS S3

AWS S3 object URLs can follow a few set patterns:

Virtual-hosted style: https://<bucket-name>.s3.<region>.amazonaws.com/<key-name>
Path-style: https://s3.<region>.amazonaws.com/<bucket-name>/<key-name>
S3 protocol: S3://<bucket-name>/<key-name>
Legacy: those without regions or those with S3-<region> in the URL

AWS best practice is to use Virtual-hosted style. Path-style is planned to be deprecated and the legacy URLs are already deprecated.

We support Virtual-hosted style, Path-style and S3 protocol object URLs. We recommend you use Virtual-hosted style object URLs wherever possible.

Object URLs can be found in the Properties tab of the object in question. Navigate to AWS S3 > bucket > object > Properties to find the Object URL.

Here's a python script which creates a JSON file for single images by constructing the URLs from the specified path in a given S3 bucket. You'll need to configure the following variables to match your setup.

region: needs to be the AWS resource region you intend to use. For S3, it's the region where your bucket is.
aws_profile: is the name of the profile in the AWS ~/.aws/credentials file. See AWS Credentials Documentation to set up the credentials file properly.
bucket_name: the name of your S3 bucket you want to pull files from.
s3_directory: the path to the directory where your files are stored inside the S3 bucket. Include all slashes but final slash. For example:

# my file is at my-bucket/some_top_level_dir/video_files/my_video.mp4
# then set s3 directory as follows
s3_directory = 'some_top_level_dir/video_files'

And the script itself:

import boto3
import logging
import sys
import json
from botocore.config import Config

region = 'FILL_ME_IN'
aws_profile = 'FILL_ME_IN'
bucket_name = 'FILL_ME_IN'
s3_directory = 'FILL_ME_IN'

domain = f's3.{region}.amazonaws.com'
root_url = f'https://{domain}/{bucket_name}'
session = boto3.Session(profile_name=aws_profile)
sandbox_s3_client = session.client('s3')
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)

images = []
for object_summary in bucket.objects.all():
    split_key = object_summary.key.split('/')

    if len(split_key) >= 2 and '/'.join(split_key[0:-1]) == s3_directory:
        object_url = f'{root_url}/{object_summary.key}'
        images.append({'objectUrl': object_url})

outer_json_dict = {
    "images": images
}

output_file = open(f'upload_images_{s3_directory}.json', 'w')
json.dump(outer_json_dict, output_file, indent=4)
output_file.close()

Azure blob

{
    "videos": [
        {
            "objectUrl": "https://myaccount.blob.core.windows.net/myblob"
        },
        {
            "objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblob.jpg"
        },
        {
            "objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblobs/myblob.jpg"
        }
    ],
    "image_groups": [
      {
        "title": "image_group_1",
        "objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob1.jpg",
        "objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob2.jpg"
      },
      {
        "title": "image_group2",
        "objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob3.jpg",
        "objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob4.jpg"
      }
    ]
}

GCP storage

{
    "videos": [
        {
            "objectUrl": "gs://example-url/object.mp4"
        }
    ],
    "image_groups": [
      {
        "title": "image_group_1",
        "objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_1.jpg",
        "objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_2.jpg"
        
      },
      {
        "title": "image_group_2",
        "objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_3.jpg",
        "objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_4.jpg"
      }
    ]
}

Open Telekom Cloud OSS

{
  "dicom_series": [
    {
      "title": "OPEN_TELEKOM_DICOM_SERIES",
      "objectUrl_0": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-0",
      "objectUrl_1": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-1",
      "objectUrl_2": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-2",
      "objectUrl_3": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-3"
    }
  ]
}