CNVfinder’s code documentation

cnvdetector

class cnvfinder.cnvdetector.CNVTest(nrrtest: cnvfinder.nrrhandler.nrrhandler.NRRTest = None, vcftest: cnvfinder.vcfhandler.vcfhandler.VCFTest = None, path: str = 'results')[source]

Detect and call CNVs based on read depth and/or variant info

Parameters:
  • nrrtest (NRRTest) – read depth data. When it’s None, analysis employing “read depth” won’t be performed
  • vcftest – variant data. When it’s None, analysis employing “read depth” won’t be performed
  • path – path to output directory
analyze_plot()[source]

Detect CNVs applying ratio and/or BAF data. Output (list and plots) will be saved at self.path

create_ideogram()[source]

Create ideograms for the current test

save()[source]

Save list of detected CNVs

summary()[source]

Write a summary of the tests at self.path/summary.txt

class cnvfinder.cnvdetector.CNVConfig(filename: str, ratio: bool = True, variant: bool = True)[source]

A wrapper for CNVTest. This class loads test parameters from a configuration file

Parameters:
  • filename (str) – path to configuration file
  • ratio (bool) – specify the usage of read depth data
  • variant (bool) – specify the usage of variant data

nrrhandler

class cnvfinder.nrrhandler.NRR(bedfile: str = None, bamfile: str = None, region: str = None, counters: list = [], bed: Union[cnvfinder.bedloader.bedloader.ROI, cnvfinder.tsvparser.tsvparser.CoverageFileParser] = None, parallel: bool = True, to_label: bool = False, covfile: str = None)[source]

NRR stands for “Number of Reads in Region” loaded from a BAM file

Parameters:
  • bedfile (str) – path to bedfile where amplicons are listed in
  • bamfile (str) – path to alignment file (bam format)
  • region (str) – limit target definition to a given region. It should be in the form: chr1:10000-90000
  • counters (list) – list of read depth counters
  • bed (ROI) – amplicons already loaded in memory
  • parallel (bool) – whether to count target read depth in parallel
  • to_label (bool) – whether to label targets regarding each target read depth in comparison to the mean
  • covfile (str) – path to amplicon.cov file
count(cores: int = None, parallel: bool = False)[source]

For each defined target, count read depth

Parameters:
  • cores (int) – number of cores to be used when counting in parallel mode. Default: all available
  • parallel (bool) – whether to count read depth in parallel
count_label_by_pool() → pandas.core.frame.DataFrame[source]

Count the number of targets arranged by label in each pool

Returns:ldf number of targets considering pools x labels
load(filename: str) → Optional[int][source]

Load a single NRR from a text file

Parameters:filename (str) – path to count file. Normally something like: bamfile.bam.txt
Returns:1 when data loading is successful, None otherwise
save(filename: str = None)[source]

Save a single NRR on a text file

Parameters:filename (str) – path to output
class cnvfinder.nrrhandler.NRRList(bedfile: str = None, bamfiles: list = None, region: str = None, bed: cnvfinder.bedloader.bedloader.ROI = None, parallel: bool = True, to_classify: bool = False, covfiles: list = None)[source]

NRR object list management

Parameters:
  • bedfile (str) – path to file (BED) where the amplicons are listed in
  • bamfiles (list) – list of paths to alignment (BAM) files
  • region (str) – limit target definition to a given region. It should be in the form: chr1:10000-90000
  • bed (ROI) – amplicons already loaded into memory
  • parallel (bool) – whether to count defined targets read depth in parallel
  • to_classify (bool) – whether to classify defined targets regarding their read depth
  • covfiles (list) – list of paths to amplicon.cov files
compute_metrics()[source]

Compute baseline read depth median and iqr for each defined target

make_mean()[source]

Compute baseline read depth mean for each defined target

class cnvfinder.nrrhandler.NRRTest(baseline: cnvfinder.nrrhandler.nrrhandler.NRRList, sample: cnvfinder.nrrhandler.nrrhandler.NRR, path: str = 'results', size: int = 200, step: int = 10, metric: str = 'IQR', interval_range: float = 1.5, minread: int = 25, below_cutoff: float = 0.7, above_cutoff: float = 1.3, maxdist: int = 15000000, cnv_like_range: float = 0.7, bins=500, method='chr_group')[source]

Hold information about tests between a NRR baseline (NRRList) and a NRR test sample

Parameters:
  • baseline (NRRList) – represents bamfiles of the baseline
  • sample (NRR) – represents the bamfile of a test sample
  • path (str) – output directory path
  • size (int) – block’s size when sliding window
  • step (int) – step’s size when sliding window
  • metric (str) – param used to define which metric should be used ‘std’ or ‘IQR’
  • interval_range (float) – value to multiply metric by
  • minread (int) – minimum number of reads used to filter targets
  • below_cutoff (float) – filter out data (ratios) below this cutoff
  • above_cutoff (float) – filter out data (ratios) above this cutoff
  • maxdist (int) – maximum distance allowed of a cnv-like block, to its closest cnv block, for it be a cnv as well
  • cnv_like_range (float) – value to multiply interval_range by in order to detect cnv-like
  • bins (int) – number of bins to use when plotting ratio data
  • method (str) – method used in order to group rations when plotting
filter(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Filter dataframe by mean >= minread

Parameters:df (DataFrame) – dataframe to be filtered
Returns:filtered dataframe
make_ratio()[source]

Compute ratio between the standardized read depth values from a test sample and the mean baseline read depth

merge(potential_cnvs: list) → list[source]

Merge CNV blocks

Parameters:potential_cnvs (list) – list of potential CNV blocks
Returns:list of CNV blocks merged
plot()[source]

Plot all defined target ratios

summary(npools: int = 12) → pandas.core.frame.DataFrame[source]

Create a summary of NRRTest samples (baseline and test)

Parameters:npools (int) – number of pools
Returns:summary as a dataframe
class cnvfinder.nrrhandler.NRRConfig(filename: str)[source]

Detect CNVs based on read depth data

Parameters:filename (str) – path to configuration file
nrrhandler.readcount(filename: str) → list

Count the number of reads in regions using pysam.AlignmentFile.count method

Parameters:
  • region_list (list) – regions in the format: ‘chr1:1000-10000’
  • filename (str) – path to bamfile
Return counters:
 

list of number of reads in regions

vcfhandler

class cnvfinder.vcfhandler.VCF(vcffile: str)[source]

Hold information on variants loaded from a VCF file using pysam.VariantFile

Parameters:vcffile (str) – path to vcf file
static calc_baf(info: pandas.core.series.Series) → float[source]

Compute the frequency of the first alternative allele

Parameters:info (Series) – single variant data
Returns:computed B-allele frequency
class cnvfinder.vcfhandler.VCFList(vcffiles: list)[source]

VCF object list management

Parameters:vcffiles (list) – list of paths to variant (VCF) files
class cnvfinder.vcfhandler.VCFTest(baseline: cnvfinder.vcfhandler.vcfhandler.VCFList, sample: cnvfinder.vcfhandler.vcfhandler.VCF, metric: str = 'IQR', interval_range: float = 1.5, size: int = 400, step: int = 40, cnv_like_range: float = 0.7, maxdist: int = 15000000, path: str = 'results')[source]

Hold information about tests between a VCF baseline (VCFList) and a VCF test sample

Parameters:
  • baseline (VCFList) – represents baseline’s VCF files
  • sample (VCF) – represents sample’s VCF file
  • metric (str) – param used to define which metric should be used. For instance, only ‘IQR’ is available for BAF
  • interval_range (float) – value to multiply metric by
  • size (int) – block’s size when sliding window
  • step (int) – block’s size when sliding window
  • cnv_like_range (float) – value to multiply interval_range by in order to detect cnv-like
  • maxdist (int) – maximum distance allowed of a cnv-like block, to its closest cnv block, for it be a cnv as well
  • path (str) – output directory path
cluster_baf(region: str = None, mode: str = None, n_clusters: int = 2) → tuple[source]

Cluster BAF values in ‘region’, considering ‘mode’, and applying the KMeans method.

Parameters:
  • region (str) – it should be in the form: chr1:10000-90000
  • mode – homozygous or heterozygous
  • n_clusters (int) – number of expected clusters
Returns:

labels (defaultdict), cluster centers (list), and silhouette score (float)

compare(nmin: int = None, maxdiff: float = 1e-05)[source]

Apply filters on sample variants considering variants from baseline files. Current filters:

  • similar ratios: given a variant, filter it out if BAF values are identical considering all or ‘nmin’ samples
Parameters:
  • nmin (int) – minimum number of baseline samples to one variant to be considered a false positive
  • maxdiff (float) – max difference between numbers for them to be equal
eliminate_vars()[source]

Filter out the given variants from self.df

filter(mindp: int = 60)[source]

Apply filters on variants located at self.df

Parameters:mindp (int) – minimum number of reads to pass the filter
getin(region: str = None, column: str = None, mode: str = None) → Union[list, None, pandas.core.frame.DataFrame][source]

Get data in ‘column’ for variants that are located in region considering ‘mode’. Available columns:

  • chrom
  • start
  • stop
  • alleles
  • alts
  • contig
  • id
  • info
  • pos
  • qual
  • ref
  • rid
  • rlen
  • dp
  • baf
Parameters:
  • region (str) – it should be in the form: chr1:10000-90000
  • column (str) – column’s name
  • mode – define which view should be returned: homozygous or heterozygous
Returns:

dataframe containing the view, or a list if ‘column’ is passed

getview(mode: str = None) → pandas.core.frame.DataFrame[source]

Create a view of self.df depending on the mode. Available modes:

  • homozygous: create view with homozygous variants
  • heterozygous: create a view with heterozygous variants
Parameters:mode (str) – view mode
Returns:dataframe’s view or dataframe, if mode = None
iterchroms(mode: str = None, df: pandas.core.frame.DataFrame = None) → list[source]

Group variants by chromosome

Parameters:
  • mode (str) – homozygous or heterozygous
  • df (Dataframe) – dataframe to split
Returns:

list of dictionary groups {‘id’: ‘chrom:chromStart-chromEnd’, ‘df’: dataframe}

load_filtered_out()[source]

Load filtered out variants from file. Path to file is defined as: self.vcffile + ‘.txt’

save_filtered_out()[source]

Save filtered out variants of the test sample. Output file is defined as: self.vcffile + ‘.txt’

split()[source]

Split self.df in two dataframes

  • self.het_vars for heterozygous variants and
  • self.hom_vars for homozygous variants
vcfplot(region: str = None, mode: str = None, filename: str = None, auto_open: bool = False)[source]

Plot variants within region

Parameters:
  • region – it should be in the form: chr1:10000-90000
  • mode (str) – homozygous or heterozygous
  • filename (str) – path to output file
  • auto_open (bool) – whether to automatically open the resulting plot
class cnvfinder.vcfhandler.VCFConfig(filename: str, tofilter: bool = True)[source]

Detect CNVs based on read B-allele frequency (BAF) data

Parameters:
  • filename (str) – path to configuration file
  • tofilter (bool) – whether to apply filters on variants

bedloader

class cnvfinder.bedloader.ROI(bedfile: str, spacing: int = 20, mindata: int = 50, maxpool: int = None, lines: list = None)[source]

This class stores “regions of interest” in the .bed file

  • self.amplicons is a list of Amplicon objects
  • self.targets is a DataFrame of valid targets for CNV analysis
Parameters:
  • bedfile (str) – path to file in which sequencing amplicons are listed (BED)
  • spacing (int) – is the number of nucleotides to ignore at amplicon start and end, to avoid overlapping reads
  • mindata (int) – is the minimum number of nucleotides for a valid target
  • maxpool (int) – is the maximum number of origin pools allowed for a valid target
  • lines (list) – optional list of lines containing beddata
define_targets(spacing, min_data) → pandas.core.frame.DataFrame[source]

Return valid targets for further analysis. Each target is in the form: (chromosome, start, end, Amplicon)

Parameters:
  • spacing (int) – number of nucleotides to ignore at amplicon start and end, to avoid overlapping reads
  • min_data (int) – minimum number of nucleotides for a valid target
Returns:

a dataframe of targets

static load_amplicons(lines: list, maxpool: int, pool_loc: int) → list[source]

Return a sorted list of amplicons

Parameters:
  • pool_loc – location of pool data in each line
  • lines (list) – list of lines loaded from bed file
  • maxpool (int) – maximum number of origin pools allowed for a valid target
Returns:

sorted list of amplicons

tsvparser

class cnvfinder.tsvparser.CoverageFileParser(filename: str)[source]

Parses a amplicon coverage file loaded by BedFileLoader

Parameters:filename (str) – path to amplicon.cov file
static create_column_map(columns) → collections.defaultdict[source]

Create a dict based on columns

Parameters:columns (list) – list of columns
Returns:a dictionary mapping ‘columns’ values and indexes
define_targets(lines, columns) → tuple[source]

Extract columns from lines based on columns of interest and split them in two entities: one DataFrame representing actual targets [chrom, chromStart, chromEnd…] and a list representing the number of reads for each target.

Parameters:
  • lines (list) – actual data
  • columns (list) – list of columns of interest
Returns:

a DataFrame describing the targets and a list of counters

ideogram

class cnvfinder.ideogram.Ideogram(file: str = None, chroms: list = None, chrom_height: float = 1, chrom_spacing: float = 1.5, fig_size: tuple = None, colors: dict = None, to_log=False)[source]

Create ideograms

Parameters:
  • file (str) – file to load chromosome bands data. Default: ‘cytoBand’ table from https://genome.ucsc.edu/cgi-bin/hgTables.
  • chroms (list) – plot only chromosomes that are in this list. Default: [‘chr%s’ % i for i in list(range(1, 23)) + [‘M’, ‘X’, ‘Y’]]
  • chrom_height (float) – height of each ideogram
  • chrom_spacing (float) – spacing between consecutive ideograms
  • fig_size (tuple) – width and height in inches
  • colors (dict) – colors for different chromosome stains
  • to_log (bool) – whether to print log info
add_chromosomes()[source]

Add chromosome ideograms

Returns:fig and ax
add_data(df: pandas.core.frame.DataFrame, height: float = 0.5, padding: float = 0.1, color: str = '#2243a8', alpha: float = 0.5, linewidths: float = 0, **kwargs)[source]

Add (genomic) data in the plot

Parameters:
  • df (DataFrame) – data
  • height (float) – height of genomic track. Should be smaller than ‘chrom_spacing’
  • padding (float) – padding between the top of a genomic track and its corresponding ideogram
  • color (str) – track’s color. It will be used in case ‘colors’ not in df.columns
  • alpha (float) – alpha value used for blending
  • linewidths (float) – line widths
  • kwargs – are passed to BrokenBarHCollection
add_data_above(df: pandas.core.frame.DataFrame, color: str = None)[source]

Wrapper for adding data above ideograms

Parameters:
  • color (str) – bars color
  • df (DataFrame) – data
add_data_below(df: pandas.core.frame.DataFrame, color: str = None)[source]

Wrapper for adding data below ideograms

Parameters:
  • color (str) – bars color
  • df (DataFrame) – data
add_legend(to_patches: list, loc='lower right', **kwargs)[source]

Create a legend base on to_patches list

Parameters:
  • to_patches (list) – list of dict -> {color: color, label: label}
  • loc (str) – legend location
  • kwargs – are passed to pyplot.legend
save(filename: str, **kwargs)[source]

Save ideograms in a file

Parameters:
  • filename (str) – filename
  • kwargs – are passed to pyplot.savefig