Markdown Converter
Agent skill for markdown-converter
Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
Sign in to like and favorite skills
Use the
idc-index Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
Primary tool:
idc-index (GitHub)
Check current data scale for the latest version:
from idc_index import IDCClient client = IDCClient() # get IDC data version print(client.get_idc_version()) # Get collection count and total series stats = client.sql_query(""" SELECT COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)
Core workflow:
client.sql_query()client.download_from_selection()client.get_viewer_URL(seriesInstanceUID=...)IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
tcga_luad, nlst). A patient belongs to exactly one collection.Use
collection_id to find original imaging data, may include annotations deposited along with the images; use analysis_result_id to find AI-generated or expert annotations.
Key identifiers for queries:
| Identifier | Scope | Use for |
|---|---|---|
| Dataset grouping | Filtering by project/study |
| Patient | Grouping images by patient |
| DICOM study | Grouping of related series, visualization |
| DICOM series | Grouping of related series, visualization |
The
idc-index package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
Important: Use
client.indices_overview to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
| Table | Row Granularity | Loaded | Description |
|---|---|---|---|
| 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
| 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
| 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
| 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
| 1 row = 1 clinical data column | fetch_index() | Dictionary mapping clinical table columns to collections |
| 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
| 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
| 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
Auto = loaded automatically when
IDCClient() is instantiated
fetch_index() = requires client.fetch_index("table_name") to load
Key columns are not explicitly labeled, the following is a subset that can be used in joins.
| Join Column | Tables | Use Case |
|---|---|---|
| index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data |
| index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details |
| index, prior_versions_index | Link studies across current and historical data |
| index, prior_versions_index | Link patients across current and historical data |
| index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) |
| index, analysis_results_index | Link by publication DOI |
| index, prior_versions_index | Link by CRDC unique identifier |
| index, prior_versions_index | Filter by imaging modality |
| index, seg_index | Link segmentation series to its index metadata |
| seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
Note:
Subjects, Updated, and Description appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
Example joins:
from idc_index import IDCClient client = IDCClient() # Join index with collections_index to get cancer types client.fetch_index("collections_index") result = client.sql_query(""" SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE i.Modality = 'MR' LIMIT 10 """) # Join index with sm_index for slide microscopy details client.fetch_index("sm_index") result = client.sql_query(""" SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf FROM index i JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID LIMIT 10 """) # Join seg_index with index to find segmentations and their source images client.fetch_index("seg_index") result = client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, src.collection_id, src.Modality as source_modality, src.BodyPartExamined FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE s.AlgorithmType = 'AUTOMATIC' LIMIT 10 """)
Via SQL (recommended for filtering/aggregation):
from idc_index import IDCClient client = IDCClient() # Query the primary index (always available) results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10") # Fetch and query additional indices client.fetch_index("collections_index") collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index") client.fetch_index("analysis_results_index") analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
As pandas DataFrames (direct access):
# Primary index (always available after client initialization) df = client.index # Fetch and access on-demand indices client.fetch_index("sm_index") sm_df = client.sm_index
The
indices_overview dictionary contains complete schema information for all tables. Always consult this when writing queries or exploring data structure.
DICOM attribute mapping: Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like
PatientID, StudyInstanceUID, Modality, BodyPartExamined work as expected.
from idc_index import IDCClient client = IDCClient() # List all available indices with descriptions for name, info in client.indices_overview.items(): print(f"\n{name}:") print(f" Installed: {info['installed']}") print(f" Description: {info['description']}") # Get complete schema for a specific index (columns, types, descriptions) schema = client.indices_overview["index"]["schema"] print(f"\nTable: {schema['table_description']}") print("\nColumns:") for col in schema['columns']: desc = col.get('description', 'No description') # Description indicates if column is from DICOM attribute print(f" {col['name']} ({col['type']}): {desc}") # Find columns that are DICOM attributes (check description for "DICOM" reference) dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] print(f"\nDICOM-sourced columns: {dicom_cols}")
Alternative: use
method:get_index_schema()
schema = client.get_index_schema("index") # Returns same schema dict: {'table_description': ..., 'columns': [...]}
index TableMost common columns for queries (use
indices_overview for complete list and descriptions):
| Column | Type | DICOM | Description |
|---|---|---|---|
| STRING | No | IDC collection identifier |
| STRING | No | If applicable, indicates what analysis results collection given series is part of |
| STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
| STRING | Yes | Patient identifier |
| STRING | Yes | DICOM Study UID |
| STRING | Yes | DICOM Series UID — use for downloads/viewing |
| STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
| STRING | Yes | Anatomical region |
| STRING | Yes | Description of the series |
| STRING | Yes | Equipment manufacturer |
| STRING | Yes | Date study was performed |
| STRING | Yes | Patient sex |
| STRING | Yes | Patient age at time of study |
| STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
| FLOAT | No | Size of series in megabytes |
| INTEGER | No | Number of DICOM instances in series |
DICOM = Yes: Column value extracted from the DICOM attribute with the same name. Refer to the DICOM standard for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
# Fetch clinical index (also downloads clinical data tables) client.fetch_index("clinical_index") # Query clinical index to find available tables and their columns tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index") # Load a specific clinical table as DataFrame clinical_df = client.get_clinical_table("table_name")
See
references/clinical_data_guide.md for detailed workflows including value mapping patterns and joining clinical data with imaging.
| Method | Auth Required | Best For |
|---|---|---|
| No | Key queries and downloads (recommended) |
| IDC Portal | No | Interactive exploration, manual selection, browser-based download |
| BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata |
| DICOMweb proxy | No | Tool integration via DICOMweb API |
| Cloud storage (S3/GCS) | No | Direct file access, bulk downloads, custom pipelines |
Cloud storage organization
IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.
| Bucket (AWS / GCS) | License | Content |
|---|---|---|
/ | No commercial restriction | >90% of IDC data |
/ | No commercial restriction | Collections with potential head scans |
/ | Commercial use restricted (CC BY-NC) | ~4% of data |
Files are stored as
<crdc_series_uuid>/<crdc_instance_uuid>.dcm. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use series_aws_url column from the index for S3 URLs; GCS uses the same path structure.
See
references/cloud_storage_guide.md for bucket details, access commands, UUID mapping, and versioning.
DICOMweb access
IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.
| Endpoint | Auth | Use Case |
|---|---|---|
| Public proxy | No | Testing, moderate queries, daily quota |
| Google Healthcare | Yes (GCP) | Production use, higher quotas |
See
references/dicomweb_guide.md for endpoint URLs, code examples, supported operations, and implementation details.
Required (for basic access):
pip install --upgrade idc-index
Important: New IDC data release will always trigger a new version of
idc-index. Always use --upgrade flag while installing, unless an older version is needed for reproducibility.
Tested with: idc-index 0.11.7 (IDC data version v23)
Optional (for data analysis):
pip install pandas numpy pydicom
Discover what imaging collections and data are available in IDC:
from idc_index import IDCClient client = IDCClient() # Get summary statistics from primary index query = """ SELECT collection_id, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(series_size_MB) as size_mb FROM index GROUP BY collection_id ORDER BY patients DESC """ collections_summary = client.sql_query(query) # For richer collection metadata, use collections_index client.fetch_index("collections_index") collections_info = client.sql_query(""" SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData FROM collections_index """) # For analysis results (annotations, segmentations), use analysis_results_index client.fetch_index("analysis_results_index") analysis_info = client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities FROM analysis_results_index """)
provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.collections_index
lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.analysis_results_index
Query the IDC mini-index using SQL to find specific datasets.
First, explore available values for filter columns:
from idc_index import IDCClient client = IDCClient() # Check what Modality values exist modalities = client.sql_query(""" SELECT DISTINCT Modality, COUNT(*) as series_count FROM index GROUP BY Modality ORDER BY series_count DESC """) print(modalities) # Check what BodyPartExamined values exist for MR modality body_parts = client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count FROM index WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY series_count DESC LIMIT 20 """) print(body_parts)
Then query with validated filter values:
# Find breast MRI scans (use actual values from exploration above) results = client.sql_query(""" SELECT collection_id, PatientID, SeriesInstanceUID, Modality, SeriesDescription, license_short_name FROM index WHERE Modality = 'MR' AND BodyPartExamined = 'BREAST' LIMIT 20 """) # Access results as pandas DataFrame for idx, row in results.iterrows(): print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")
To filter by cancer type, join with
:collections_index
client.fetch_index("collections_index") results = client.sql_query(""" SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE c.CancerTypes LIKE '%Breast%' AND i.Modality = 'MR' LIMIT 20 """)
Available metadata fields (use
client.indices_overview for complete list):
Note: Cancer type is in
collections_index.CancerTypes, not in the primary index table.
Download imaging data efficiently from IDC's cloud storage:
Download entire collection:
from idc_index import IDCClient client = IDCClient() # Download small collection (RIDER Pilot ~1GB) client.download_from_selection( collection_id="rider_pilot", downloadDir="./data/rider" )
Download specific series:
# First, query for series UIDs series_df = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND collection_id = 'nlst' LIMIT 5 """) # Download only those series client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/lung_ct" )
Custom directory structure:
Default
dirTemplate: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID
# Simplified hierarchy (omit StudyInstanceUID level) client.download_from_selection( collection_id="tcga_luad", downloadDir="./data", dirTemplate="%collection_id/%PatientID/%Modality" ) # Results in: ./data/tcga_luad/TCGA-05-4244/CT/ # Flat structure (all files in one directory) client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/flat", dirTemplate="" ) # Results in: ./data/flat/*.dcm
The
idc download command provides command-line access to download functionality without writing Python code. Available after installing idc-index.
Auto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
# Download entire collection idc download rider_pilot --download-dir ./data # Download specific series by UID idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data # Download multiple items (comma-separated) idc download "tcga_luad,tcga_lusc" --download-dir ./data # Download from manifest file (auto-detected) idc download manifest.txt --download-dir ./data
Options:
| Option | Description |
|---|---|
| Output directory (default: current directory) |
| Directory hierarchy template (default: ) |
| Verbosity: debug, info, warning, error, critical |
Manifest files:
Manifest files contain S3 URLs (one per line) and can be:
Format (one S3 URL per line):
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
Example: Generate manifest from Python query:
from idc_index import IDCClient client = IDCClient() # Query for series URLs results = client.sql_query(""" SELECT series_aws_url FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' """) # Save as manifest file with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n')
Then download:
idc download ct_manifest.txt --download-dir ./ct_data
View DICOM data in browser without downloading:
from idc_index import IDCClient import webbrowser client = IDCClient() # First query to get valid UIDs results = client.sql_query(""" SELECT SeriesInstanceUID, StudyInstanceUID FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' LIMIT 1 """) # View single series viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID']) webbrowser.open(viewer_url) # View all series in a study (useful for multi-series exams like MRI protocols) viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) webbrowser.open(viewer_url)
The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).
Check data licensing before use (critical for commercial applications):
from idc_index import IDCClient client = IDCClient() # Check licenses for all collections query = """ SELECT DISTINCT collection_id, license_short_name, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM index GROUP BY collection_id, license_short_name ORDER BY collection_id """ licenses = client.sql_query(query) print(licenses)
License types in IDC:
Important: Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.
The
source_DOI column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use citations_from_selection() to generate properly formatted citations:
from idc_index import IDCClient client = IDCClient() # Get citations for a collection (APA format by default) citations = client.citations_from_selection(collection_id="rider_pilot") for citation in citations: print(citation) # Get citations for specific series results = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE collection_id = 'tcga_luad' LIMIT 5 """) citations = client.citations_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values) ) # Alternative format: BibTeX (for LaTeX documents) bibtex_citations = client.citations_from_selection( collection_id="tcga_luad", citation_format=IDCClient.CITATION_FORMAT_BIBTEX )
Parameters:
collection_id: Filter by collection(s)patientId: Filter by patient ID(s)studyInstanceUID: Filter by study UID(s)seriesInstanceUID: Filter by series UID(s)citation_format: Use IDCClient.CITATION_FORMAT_* constants:
CITATION_FORMAT_APA (default) - APA styleCITATION_FORMAT_BIBTEX - BibTeX for LaTeXCITATION_FORMAT_JSON - CSL JSONCITATION_FORMAT_TURTLE - RDF TurtleBest practice: When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.
Process large datasets efficiently with filtering:
from idc_index import IDCClient import pandas as pd client = IDCClient() # Find chest CT scans from GE scanners query = """ SELECT SeriesInstanceUID, PatientID, collection_id, ManufacturerModelName FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND Manufacturer = 'GE MEDICAL SYSTEMS' AND license_short_name = 'CC BY 4.0' LIMIT 100 """ results = client.sql_query(query) # Save manifest for later results.to_csv('lung_ct_manifest.csv', index=False) # Download in batches to avoid timeout batch_size = 10 for i in range(0, len(results), batch_size): batch = results.iloc[i:i+batch_size] client.download_from_selection( seriesInstanceUID=list(batch['SeriesInstanceUID'].values), downloadDir=f"./data/batch_{i//batch_size}" )
For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.
Quick reference:
bigquery-public-data.idc_current.*dicom_all (combined metadata)dicom_metadata (all DICOM tags)OtherElements column (vendor-specific tags like diffusion b-values)See
references/bigquery_guide.md for setup, table schemas, query patterns, private element access, and cost optimization.
| Task | Tool | Reference |
|---|---|---|
| Programmatic queries & downloads | | This document |
| Interactive exploration | IDC Portal | https://portal.imaging.datacommons.cancer.gov/ |
| Complex metadata queries | BigQuery | |
| 3D visualization & analysis | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser |
Default choice: Use
idc-index for most tasks (no auth, easy API, batch downloads).
Integrate IDC data into imaging analysis workflows:
Read downloaded DICOM files:
import pydicom import os # Read DICOM files from downloaded series series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..." dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) if f.endswith('.dcm')] # Load first image ds = pydicom.dcmread(dicom_files[0]) print(f"Patient ID: {ds.PatientID}") print(f"Modality: {ds.Modality}") print(f"Image shape: {ds.pixel_array.shape}")
Build 3D volume from CT series:
import pydicom import numpy as np from pathlib import Path def load_ct_series(series_path): """Load CT series as 3D numpy array""" files = sorted(Path(series_path).glob('*.dcm')) slices = [pydicom.dcmread(str(f)) for f in files] # Sort by slice location slices.sort(key=lambda x: float(x.ImagePositionPatient[2])) # Stack into 3D array volume = np.stack([s.pixel_array for s in slices]) return volume, slices[0] # Return volume and first slice for metadata volume, metadata = load_ct_series("./data/lung_ct/series_dir") print(f"Volume shape: {volume.shape}") # (z, y, x)
Integrate with SimpleITK:
import SimpleITK as sitk from pathlib import Path # Read DICOM series series_path = "./data/ct_series" reader = sitk.ImageSeriesReader() dicom_names = reader.GetGDCMSeriesFileNames(series_path) reader.SetFileNames(dicom_names) image = reader.Execute() # Apply processing smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5) # Save as NIfTI sitk.WriteImage(smoothed, "processed_volume.nii.gz")
Objective: Build training dataset of lung CT scans from NLST collection
Steps:
from idc_index import IDCClient client = IDCClient() # 1. Query for lung CT scans with specific criteria query = """ SELECT PatientID, SeriesInstanceUID, SeriesDescription FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' AND license_short_name = 'CC BY 4.0' ORDER BY PatientID LIMIT 100 """ results = client.sql_query(query) print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients") # 2. Download data organized by patient client.download_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values), downloadDir="./training_data", dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" ) # 3. Save manifest for reproducibility results.to_csv('training_manifest.csv', index=False)
Objective: Compare image quality across different MRI scanner manufacturers
Steps:
from idc_index import IDCClient import pandas as pd client = IDCClient() # Query for brain MRI grouped by manufacturer query = """ SELECT Manufacturer, ManufacturerModelName, COUNT(DISTINCT SeriesInstanceUID) as num_series, COUNT(DISTINCT PatientID) as num_patients FROM index WHERE Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' GROUP BY Manufacturer, ManufacturerModelName HAVING num_series >= 10 ORDER BY num_series DESC """ manufacturers = client.sql_query(query) print(manufacturers) # Download sample from each manufacturer for comparison for _, row in manufacturers.head(3).iterrows(): mfr = row['Manufacturer'] model = row['ManufacturerModelName'] query = f""" SELECT SeriesInstanceUID FROM index WHERE Manufacturer = '{mfr}' AND ManufacturerModelName = '{model}' AND Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' LIMIT 5 """ series = client.sql_query(query) client.download_from_selection( seriesInstanceUID=list(series['SeriesInstanceUID'].values), downloadDir=f"./quality_study/{mfr.replace(' ', '_')}" )
Objective: Preview imaging data before committing to download
from idc_index import IDCClient import webbrowser client = IDCClient() series_list = client.sql_query(""" SELECT SeriesInstanceUID, PatientID, SeriesDescription FROM index WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT' LIMIT 10 """) # Preview each in browser for _, row in series_list.iterrows(): viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") print(f" View at: {viewer_url}") # webbrowser.open(viewer_url) # Uncomment to open automatically
For additional visualization options, see the IDC Portal getting started guide or SlicerIDCBrowser for 3D Slicer integration.
Objective: Download only CC-BY licensed data suitable for commercial applications
Steps:
from idc_index import IDCClient client = IDCClient() # Query ONLY for CC BY licensed data (allows commercial use with attribution) query = """ SELECT SeriesInstanceUID, collection_id, PatientID, Modality FROM index WHERE license_short_name LIKE 'CC BY%' AND license_short_name NOT LIKE '%NC%' AND Modality IN ('CT', 'MR') AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') LIMIT 200 """ cc_by_data = client.sql_query(query) print(f"Found {len(cc_by_data)} CC BY licensed series") print(f"Collections: {cc_by_data['collection_id'].unique()}") # Download with license verification client.download_from_selection( seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), downloadDir="./commercial_dataset", dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" ) # Save license information cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
license_short_name field and respect licensing terms (CC BY vs CC BY-NC)citations_from_selection() to get properly formatted citations from source_DOI values; include these in publicationsLIMIT clause when exploring to avoid long downloads and understand data structure%collection_id/%PatientID/%ModalityIssue: ModuleNotFoundError: No module named 'idc_index'
pip install --upgrade idc-indexIssue: Download fails with connection timeout
dirTemplate to organize downloads by batchIssue:
or billing errorsBigQuery quota exceeded
references/bigquery_guide.md for cost optimization tipsIssue: Series UID not found or no data returned
LIMIT 5 to test query firstIssue: Downloaded DICOM files won't open
pydicom.dcmread(file, force=True)Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
# What modalities exist? client.sql_query("SELECT DISTINCT Modality FROM index") # What body parts for a specific modality? client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as n FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY n DESC """) # What manufacturers for MR? client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """)
Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
# Find ALL segmentations and structure sets by DICOM Modality # SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set client.sql_query(""" SELECT collection_id, Modality, COUNT(*) as series_count FROM index WHERE Modality IN ('SEG', 'RTSTRUCT') GROUP BY collection_id, Modality ORDER BY series_count DESC """) # Find segmentations for a specific collection (includes non-analysis-result items) client.sql_query(""" SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id FROM index WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' """) # List analysis result collections (curated derived datasets) client.fetch_index("analysis_results_index") client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Collections, Modalities FROM analysis_results_index """) # Find analysis results for a specific source collection client.sql_query(""" SELECT analysis_result_id, analysis_result_title FROM analysis_results_index WHERE Collections LIKE '%tcga_luad%' """) # Use seg_index for detailed DICOM Segmentation metadata client.fetch_index("seg_index") # Get segmentation statistics by algorithm client.sql_query(""" SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count FROM seg_index WHERE AlgorithmName IS NOT NULL GROUP BY AlgorithmName, AlgorithmType ORDER BY seg_count DESC LIMIT 10 """) # Find segmentations for specific source images (e.g., chest CT) client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, s.segmented_SeriesInstanceUID as source_series FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' LIMIT 10 """) # Find TotalSegmentator results with source image context client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """)
# sm_index has detailed metadata; join with index for collection_id client.fetch_index("sm_index") client.sql_query(""" SELECT i.collection_id, COUNT(*) as slides, MIN(s.min_PixelSpacing_2sf) as min_resolution FROM sm_index s JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID GROUP BY i.collection_id ORDER BY slides DESC """)
# Size for specific criteria client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """)
client.fetch_index("clinical_index") # Find collections with clinical data and their tables client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """)
See
references/clinical_data_guide.md for complete patterns including value mapping and patient cohort selection.
The following skills complement IDC workflows for downstream analysis and visualization:
Always use
for current column schemas. This ensures accuracy with the installed idc-index version:client.indices_overview
# Get all column names and types for any table schema = client.indices_overview["index"]["schema"] columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
idc download, idc download-from-manifest, idc download-from-selection)This skill version is available in skill metadata. To check for updates: