CNVfinder’s code documentation¶

cnvdetector¶

class cnvfinder.cnvdetector.CNVTest(nrrtest: cnvfinder.nrrhandler.nrrhandler.NRRTest = None, vcftest: cnvfinder.vcfhandler.vcfhandler.VCFTest = None, path: str = 'results')[source]¶

Detect and call CNVs based on read depth and/or variant info

Parameters:	nrrtest (NRRTest) – read depth data. When it’s None, analysis employing “read depth” won’t be performed vcftest – variant data. When it’s None, analysis employing “read depth” won’t be performed path – path to output directory

analyze_plot()[source]¶: Detect CNVs applying ratio and/or BAF data. Output (list and plots) will be saved at self.path

create_ideogram()[source]¶: Create ideograms for the current test

save()[source]¶: Save list of detected CNVs

summary()[source]¶: Write a summary of the tests at self.path/summary.txt

class cnvfinder.cnvdetector.CNVConfig(filename: str, ratio: bool = True, variant: bool = True)[source]¶

A wrapper for CNVTest. This class loads test parameters from a configuration file

Parameters:	filename (str) – path to configuration file ratio (bool) – specify the usage of read depth data variant (bool) – specify the usage of variant data

nrrhandler¶

class cnvfinder.nrrhandler.NRR(bedfile: str = None, bamfile: str = None, region: str = None, counters: list = [], bed: Union[cnvfinder.bedloader.bedloader.ROI, cnvfinder.tsvparser.tsvparser.CoverageFileParser] = None, parallel: bool = True, to_label: bool = False, covfile: str = None)[source]¶

NRR stands for “Number of Reads in Region” loaded from a BAM file

Parameters:

bedfile (str) – path to bedfile where amplicons are listed in
bamfile (str) – path to alignment file (bam format)
region (str) – limit target definition to a given region. It should be in the form: chr1:10000-90000
counters (list) – list of read depth counters
bed (ROI) – amplicons already loaded in memory
parallel (bool) – whether to count target read depth in parallel
to_label (bool) – whether to label targets regarding each target read depth in comparison to the mean
covfile (str) – path to amplicon.cov file

count(cores: int = None, parallel: bool = False)[source]¶

For each defined target, count read depth

Parameters:	cores (int) – number of cores to be used when counting in parallel mode. Default: all available parallel (bool) – whether to count read depth in parallel

count_label_by_pool() → pandas.core.frame.DataFrame[source]¶

Count the number of targets arranged by label in each pool

Returns:	ldf number of targets considering pools x labels

load(filename: str) → Optional[int][source]¶

Load a single NRR from a text file

Parameters:	filename (str) – path to count file. Normally something like: bamfile.bam.txt
Returns:	1 when data loading is successful, None otherwise

save(filename: str = None)[source]¶

Save a single NRR on a text file

Parameters:	filename (str) – path to output

class cnvfinder.nrrhandler.NRRList(bedfile: str = None, bamfiles: list = None, region: str = None, bed: cnvfinder.bedloader.bedloader.ROI = None, parallel: bool = True, to_classify: bool = False, covfiles: list = None)[source]¶

NRR object list management

Parameters:

bedfile (str) – path to file (BED) where the amplicons are listed in
bamfiles (list) – list of paths to alignment (BAM) files
region (str) – limit target definition to a given region. It should be in the form: chr1:10000-90000
bed (ROI) – amplicons already loaded into memory
parallel (bool) – whether to count defined targets read depth in parallel
to_classify (bool) – whether to classify defined targets regarding their read depth
covfiles (list) – list of paths to amplicon.cov files

compute_metrics()[source]¶: Compute baseline read depth median and iqr for each defined target

make_mean()[source]¶: Compute baseline read depth mean for each defined target

class cnvfinder.nrrhandler.NRRTest(baseline: cnvfinder.nrrhandler.nrrhandler.NRRList, sample: cnvfinder.nrrhandler.nrrhandler.NRR, path: str = 'results', size: int = 200, step: int = 10, metric: str = 'IQR', interval_range: float = 1.5, minread: int = 25, below_cutoff: float = 0.7, above_cutoff: float = 1.3, maxdist: int = 15000000, cnv_like_range: float = 0.7, bins=500, method='chr_group')[source]¶

Hold information about tests between a NRR baseline (NRRList) and a NRR test sample

Parameters:

baseline (NRRList) – represents bamfiles of the baseline
sample (NRR) – represents the bamfile of a test sample
path (str) – output directory path
size (int) – block’s size when sliding window
step (int) – step’s size when sliding window
metric (str) – param used to define which metric should be used ‘std’ or ‘IQR’
interval_range (float) – value to multiply metric by
minread (int) – minimum number of reads used to filter targets
below_cutoff (float) – filter out data (ratios) below this cutoff
above_cutoff (float) – filter out data (ratios) above this cutoff
maxdist (int) – maximum distance allowed of a cnv-like block, to its closest cnv block, for it be a cnv as well
cnv_like_range (float) – value to multiply interval_range by in order to detect cnv-like
bins (int) – number of bins to use when plotting ratio data
method (str) – method used in order to group rations when plotting

filter(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶

Filter dataframe by mean >= minread

Parameters:	df (DataFrame) – dataframe to be filtered
Returns:	filtered dataframe

make_ratio()[source]¶: Compute ratio between the standardized read depth values from a test sample and the mean baseline read depth

merge(potential_cnvs: list) → list[source]¶

Merge CNV blocks

Parameters:	potential_cnvs (list) – list of potential CNV blocks
Returns:	list of CNV blocks merged

plot()[source]¶: Plot all defined target ratios

summary(npools: int = 12) → pandas.core.frame.DataFrame[source]¶

Create a summary of NRRTest samples (baseline and test)

Parameters:	npools (int) – number of pools
Returns:	summary as a dataframe

class cnvfinder.nrrhandler.NRRConfig(filename: str)[source]¶

Detect CNVs based on read depth data

Parameters:	filename (str) – path to configuration file

nrrhandler.readcount(filename: str) → list¶

Count the number of reads in regions using pysam.AlignmentFile.count method

Return counters:
Parameters:	region_list (list) – regions in the format: ‘chr1:1000-10000’ filename (str) – path to bamfile
	list of number of reads in regions

vcfhandler¶

class cnvfinder.vcfhandler.VCF(vcffile: str)[source]¶

Hold information on variants loaded from a VCF file using pysam.VariantFile

Parameters:	vcffile (str) – path to vcf file

static calc_baf(info: pandas.core.series.Series) → float[source]¶

Compute the frequency of the first alternative allele

Parameters:	info (Series) – single variant data
Returns:	computed B-allele frequency

class cnvfinder.vcfhandler.VCFList(vcffiles: list)[source]¶

VCF object list management

Parameters:	vcffiles (list) – list of paths to variant (VCF) files

class cnvfinder.vcfhandler.VCFTest(baseline: cnvfinder.vcfhandler.vcfhandler.VCFList, sample: cnvfinder.vcfhandler.vcfhandler.VCF, metric: str = 'IQR', interval_range: float = 1.5, size: int = 400, step: int = 40, cnv_like_range: float = 0.7, maxdist: int = 15000000, path: str = 'results')[source]¶

Hold information about tests between a VCF baseline (VCFList) and a VCF test sample

Parameters:

baseline (VCFList) – represents baseline’s VCF files
sample (VCF) – represents sample’s VCF file
metric (str) – param used to define which metric should be used. For instance, only ‘IQR’ is available for BAF
interval_range (float) – value to multiply metric by
size (int) – block’s size when sliding window
step (int) – block’s size when sliding window
cnv_like_range (float) – value to multiply interval_range by in order to detect cnv-like
maxdist (int) – maximum distance allowed of a cnv-like block, to its closest cnv block, for it be a cnv as well
path (str) – output directory path

cluster_baf(region: str = None, mode: str = None, n_clusters: int = 2) → tuple[source]¶

Cluster BAF values in ‘region’, considering ‘mode’, and applying the KMeans method.

Parameters:	region (str) – it should be in the form: chr1:10000-90000 mode – homozygous or heterozygous n_clusters (int) – number of expected clusters
Returns:	labels (defaultdict), cluster centers (list), and silhouette score (float)

compare(nmin: int = None, maxdiff: float = 1e-05)[source]¶

Apply filters on sample variants considering variants from baseline files. Current filters:

similar ratios: given a variant, filter it out if BAF values are identical considering all or ‘nmin’ samples

Parameters:	nmin (int) – minimum number of baseline samples to one variant to be considered a false positive maxdiff (float) – max difference between numbers for them to be equal

eliminate_vars()[source]¶: Filter out the given variants from self.df

filter(mindp: int = 60)[source]¶

Apply filters on variants located at self.df

Parameters:	mindp (int) – minimum number of reads to pass the filter

getin(region: str = None, column: str = None, mode: str = None) → Union[list, None, pandas.core.frame.DataFrame][source]¶

Get data in ‘column’ for variants that are located in region considering ‘mode’. Available columns:

chrom
start
stop
alleles
alts
contig
id
info
pos
qual
ref
rid
rlen
dp
baf

Parameters:	region (str) – it should be in the form: chr1:10000-90000 column (str) – column’s name mode – define which view should be returned: homozygous or heterozygous
Returns:	dataframe containing the view, or a list if ‘column’ is passed

getview(mode: str = None) → pandas.core.frame.DataFrame[source]¶

Create a view of self.df depending on the mode. Available modes:

homozygous: create view with homozygous variants
heterozygous: create a view with heterozygous variants

Parameters:	mode (str) – view mode
Returns:	dataframe’s view or dataframe, if mode = None

iterchroms(mode: str = None, df: pandas.core.frame.DataFrame = None) → list[source]¶

Group variants by chromosome

Parameters:	mode (str) – homozygous or heterozygous df (Dataframe) – dataframe to split
Returns:	list of dictionary groups {‘id’: ‘chrom:chromStart-chromEnd’, ‘df’: dataframe}

load_filtered_out()[source]¶: Load filtered out variants from file. Path to file is defined as: self.vcffile + ‘.txt’

save_filtered_out()[source]¶: Save filtered out variants of the test sample. Output file is defined as: self.vcffile + ‘.txt’

split()[source]¶

Split self.df in two dataframes

self.het_vars for heterozygous variants and
self.hom_vars for homozygous variants

vcfplot(region: str = None, mode: str = None, filename: str = None, auto_open: bool = False)[source]¶

Plot variants within region

Parameters:	region – it should be in the form: chr1:10000-90000 mode (str) – homozygous or heterozygous filename (str) – path to output file auto_open (bool) – whether to automatically open the resulting plot

class cnvfinder.vcfhandler.VCFConfig(filename: str, tofilter: bool = True)[source]¶

Detect CNVs based on read B-allele frequency (BAF) data

Parameters:	filename (str) – path to configuration file tofilter (bool) – whether to apply filters on variants

bedloader¶

class cnvfinder.bedloader.ROI(bedfile: str, spacing: int = 20, mindata: int = 50, maxpool: int = None, lines: list = None)[source]¶

This class stores “regions of interest” in the .bed file

self.amplicons is a list of Amplicon objects
self.targets is a DataFrame of valid targets for CNV analysis

Parameters:

bedfile (str) – path to file in which sequencing amplicons are listed (BED)
spacing (int) – is the number of nucleotides to ignore at amplicon start and end, to avoid overlapping reads
mindata (int) – is the minimum number of nucleotides for a valid target
maxpool (int) – is the maximum number of origin pools allowed for a valid target
lines (list) – optional list of lines containing beddata

define_targets(spacing, min_data) → pandas.core.frame.DataFrame[source]¶

Return valid targets for further analysis. Each target is in the form: (chromosome, start, end, Amplicon)

Parameters:	spacing (int) – number of nucleotides to ignore at amplicon start and end, to avoid overlapping reads min_data (int) – minimum number of nucleotides for a valid target
Returns:	a dataframe of targets

static load_amplicons(lines: list, maxpool: int, pool_loc: int) → list[source]¶

Return a sorted list of amplicons

Parameters:	pool_loc – location of pool data in each line lines (list) – list of lines loaded from bed file maxpool (int) – maximum number of origin pools allowed for a valid target
Returns:	sorted list of amplicons

tsvparser¶

class cnvfinder.tsvparser.CoverageFileParser(filename: str)[source]¶

Parses a amplicon coverage file loaded by BedFileLoader

Parameters:	filename (str) – path to amplicon.cov file

static create_column_map(columns) → collections.defaultdict[source]¶

Create a dict based on columns

Parameters:	columns (list) – list of columns
Returns:	a dictionary mapping ‘columns’ values and indexes

define_targets(lines, columns) → tuple[source]¶

Extract columns from lines based on columns of interest and split them in two entities: one DataFrame representing actual targets [chrom, chromStart, chromEnd…] and a list representing the number of reads for each target.

Parameters:	lines (list) – actual data columns (list) – list of columns of interest
Returns:	a DataFrame describing the targets and a list of counters

ideogram¶

class cnvfinder.ideogram.Ideogram(file: str = None, chroms: list = None, chrom_height: float = 1, chrom_spacing: float = 1.5, fig_size: tuple = None, colors: dict = None, to_log=False)[source]¶

Create ideograms

Parameters:

file (str) – file to load chromosome bands data. Default: ‘cytoBand’ table from https://genome.ucsc.edu/cgi-bin/hgTables.
chroms (list) – plot only chromosomes that are in this list. Default: [‘chr%s’ % i for i in list(range(1, 23)) + [‘M’, ‘X’, ‘Y’]]
chrom_height (float) – height of each ideogram
chrom_spacing (float) – spacing between consecutive ideograms
fig_size (tuple) – width and height in inches
colors (dict) – colors for different chromosome stains
to_log (bool) – whether to print log info

add_chromosomes()[source]¶

Add chromosome ideograms

Returns:	fig and ax

add_data(df: pandas.core.frame.DataFrame, height: float = 0.5, padding: float = 0.1, color: str = '#2243a8', alpha: float = 0.5, linewidths: float = 0, **kwargs)[source]¶

Add (genomic) data in the plot

Parameters:

df (DataFrame) – data
height (float) – height of genomic track. Should be smaller than ‘chrom_spacing’
padding (float) – padding between the top of a genomic track and its corresponding ideogram
color (str) – track’s color. It will be used in case ‘colors’ not in df.columns
alpha (float) – alpha value used for blending
linewidths (float) – line widths
kwargs – are passed to BrokenBarHCollection

add_data_above(df: pandas.core.frame.DataFrame, color: str = None)[source]¶

Wrapper for adding data above ideograms

Parameters:	color (str) – bars color df (DataFrame) – data

add_data_below(df: pandas.core.frame.DataFrame, color: str = None)[source]¶

Wrapper for adding data below ideograms

Parameters:	color (str) – bars color df (DataFrame) – data

add_legend(to_patches: list, loc='lower right', **kwargs)[source]¶

Create a legend base on to_patches list

Parameters:	to_patches (list) – list of dict -> {color: color, label: label} loc (str) – legend location kwargs – are passed to pyplot.legend

save(filename: str, **kwargs)[source]¶

Save ideograms in a file

Parameters:	filename (str) – filename kwargs – are passed to pyplot.savefig

CNVfinder’s code documentation¶

cnvdetector¶

nrrhandler¶

vcfhandler¶

bedloader¶

tsvparser¶

ideogram¶

Table of Contents

Previous topic

This Page