CNVfinder’s code documentation¶
cnvdetector¶
-
class
cnvfinder.cnvdetector.
CNVTest
(nrrtest: cnvfinder.nrrhandler.nrrhandler.NRRTest = None, vcftest: cnvfinder.vcfhandler.vcfhandler.VCFTest = None, path: str = 'results')[source]¶ Detect and call CNVs based on read depth and/or variant info
Parameters: - nrrtest (NRRTest) – read depth data. When it’s None, analysis employing “read depth” won’t be performed
- vcftest – variant data. When it’s None, analysis employing “read depth” won’t be performed
- path – path to output directory
nrrhandler¶
-
class
cnvfinder.nrrhandler.
NRR
(bedfile: str = None, bamfile: str = None, region: str = None, counters: list = [], bed: Union[cnvfinder.bedloader.bedloader.ROI, cnvfinder.tsvparser.tsvparser.CoverageFileParser] = None, parallel: bool = True, to_label: bool = False, covfile: str = None)[source]¶ NRR stands for “Number of Reads in Region” loaded from a BAM file
Parameters: - bedfile (str) – path to bedfile where amplicons are listed in
- bamfile (str) – path to alignment file (bam format)
- region (str) – limit target definition to a given region. It should be in the form: chr1:10000-90000
- counters (list) – list of read depth counters
- bed (ROI) – amplicons already loaded in memory
- parallel (bool) – whether to count target read depth in parallel
- to_label (bool) – whether to label targets regarding each target read depth in comparison to the mean
- covfile (str) – path to amplicon.cov file
-
count
(cores: int = None, parallel: bool = False)[source]¶ For each defined target, count read depth
Parameters:
-
count_label_by_pool
() → pandas.core.frame.DataFrame[source]¶ Count the number of targets arranged by label in each pool
Returns: ldf number of targets considering pools x labels
-
class
cnvfinder.nrrhandler.
NRRList
(bedfile: str = None, bamfiles: list = None, region: str = None, bed: cnvfinder.bedloader.bedloader.ROI = None, parallel: bool = True, to_classify: bool = False, covfiles: list = None)[source]¶ NRR object list management
Parameters: - bedfile (str) – path to file (BED) where the amplicons are listed in
- bamfiles (list) – list of paths to alignment (BAM) files
- region (str) – limit target definition to a given region. It should be in the form: chr1:10000-90000
- bed (ROI) – amplicons already loaded into memory
- parallel (bool) – whether to count defined targets read depth in parallel
- to_classify (bool) – whether to classify defined targets regarding their read depth
- covfiles (list) – list of paths to amplicon.cov files
-
class
cnvfinder.nrrhandler.
NRRTest
(baseline: cnvfinder.nrrhandler.nrrhandler.NRRList, sample: cnvfinder.nrrhandler.nrrhandler.NRR, path: str = 'results', size: int = 200, step: int = 10, metric: str = 'IQR', interval_range: float = 1.5, minread: int = 25, below_cutoff: float = 0.7, above_cutoff: float = 1.3, maxdist: int = 15000000, cnv_like_range: float = 0.7, bins=500, method='chr_group')[source]¶ Hold information about tests between a NRR baseline (NRRList) and a NRR test sample
Parameters: - baseline (NRRList) – represents bamfiles of the baseline
- sample (NRR) – represents the bamfile of a test sample
- path (str) – output directory path
- size (int) – block’s size when sliding window
- step (int) – step’s size when sliding window
- metric (str) – param used to define which metric should be used ‘std’ or ‘IQR’
- interval_range (float) – value to multiply metric by
- minread (int) – minimum number of reads used to filter targets
- below_cutoff (float) – filter out data (ratios) below this cutoff
- above_cutoff (float) – filter out data (ratios) above this cutoff
- maxdist (int) – maximum distance allowed of a cnv-like block, to its closest cnv block, for it be a cnv as well
- cnv_like_range (float) – value to multiply interval_range by in order to detect cnv-like
- bins (int) – number of bins to use when plotting ratio data
- method (str) – method used in order to group rations when plotting
-
filter
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Filter dataframe by mean >= minread
Parameters: df (DataFrame) – dataframe to be filtered Returns: filtered dataframe
-
make_ratio
()[source]¶ Compute ratio between the standardized read depth values from a test sample and the mean baseline read depth
vcfhandler¶
-
class
cnvfinder.vcfhandler.
VCF
(vcffile: str)[source]¶ Hold information on variants loaded from a VCF file using pysam.VariantFile
Parameters: vcffile (str) – path to vcf file
-
class
cnvfinder.vcfhandler.
VCFList
(vcffiles: list)[source]¶ VCF object list management
Parameters: vcffiles (list) – list of paths to variant (VCF) files
-
class
cnvfinder.vcfhandler.
VCFTest
(baseline: cnvfinder.vcfhandler.vcfhandler.VCFList, sample: cnvfinder.vcfhandler.vcfhandler.VCF, metric: str = 'IQR', interval_range: float = 1.5, size: int = 400, step: int = 40, cnv_like_range: float = 0.7, maxdist: int = 15000000, path: str = 'results')[source]¶ Hold information about tests between a VCF baseline (VCFList) and a VCF test sample
Parameters: - baseline (VCFList) – represents baseline’s VCF files
- sample (VCF) – represents sample’s VCF file
- metric (str) – param used to define which metric should be used. For instance, only ‘IQR’ is available for BAF
- interval_range (float) – value to multiply metric by
- size (int) – block’s size when sliding window
- step (int) – block’s size when sliding window
- cnv_like_range (float) – value to multiply interval_range by in order to detect cnv-like
- maxdist (int) – maximum distance allowed of a cnv-like block, to its closest cnv block, for it be a cnv as well
- path (str) – output directory path
-
cluster_baf
(region: str = None, mode: str = None, n_clusters: int = 2) → tuple[source]¶ Cluster BAF values in ‘region’, considering ‘mode’, and applying the KMeans method.
Parameters: Returns: labels (defaultdict), cluster centers (list), and silhouette score (float)
-
compare
(nmin: int = None, maxdiff: float = 1e-05)[source]¶ Apply filters on sample variants considering variants from baseline files. Current filters:
- similar ratios: given a variant, filter it out if BAF values are identical considering all or ‘nmin’ samples
Parameters:
-
filter
(mindp: int = 60)[source]¶ Apply filters on variants located at self.df
Parameters: mindp (int) – minimum number of reads to pass the filter
-
getin
(region: str = None, column: str = None, mode: str = None) → Union[list, None, pandas.core.frame.DataFrame][source]¶ Get data in ‘column’ for variants that are located in region considering ‘mode’. Available columns:
- chrom
- start
- stop
- alleles
- alts
- contig
- id
- info
- pos
- qual
- ref
- rid
- rlen
- dp
- baf
Parameters: Returns: dataframe containing the view, or a list if ‘column’ is passed
-
getview
(mode: str = None) → pandas.core.frame.DataFrame[source]¶ Create a view of self.df depending on the mode. Available modes:
- homozygous: create view with homozygous variants
- heterozygous: create a view with heterozygous variants
Parameters: mode (str) – view mode Returns: dataframe’s view or dataframe, if mode = None
-
iterchroms
(mode: str = None, df: pandas.core.frame.DataFrame = None) → list[source]¶ Group variants by chromosome
Parameters: - mode (str) – homozygous or heterozygous
- df (Dataframe) – dataframe to split
Returns: list of dictionary groups {‘id’: ‘chrom:chromStart-chromEnd’, ‘df’: dataframe}
-
load_filtered_out
()[source]¶ Load filtered out variants from file. Path to file is defined as: self.vcffile + ‘.txt’
-
save_filtered_out
()[source]¶ Save filtered out variants of the test sample. Output file is defined as: self.vcffile + ‘.txt’
-
split
()[source]¶ Split self.df in two dataframes
- self.het_vars for heterozygous variants and
- self.hom_vars for homozygous variants
bedloader¶
-
class
cnvfinder.bedloader.
ROI
(bedfile: str, spacing: int = 20, mindata: int = 50, maxpool: int = None, lines: list = None)[source]¶ This class stores “regions of interest” in the .bed file
- self.amplicons is a list of Amplicon objects
- self.targets is a DataFrame of valid targets for CNV analysis
Parameters: - bedfile (str) – path to file in which sequencing amplicons are listed (BED)
- spacing (int) – is the number of nucleotides to ignore at amplicon start and end, to avoid overlapping reads
- mindata (int) – is the minimum number of nucleotides for a valid target
- maxpool (int) – is the maximum number of origin pools allowed for a valid target
- lines (list) – optional list of lines containing beddata
-
define_targets
(spacing, min_data) → pandas.core.frame.DataFrame[source]¶ Return valid targets for further analysis. Each target is in the form: (chromosome, start, end, Amplicon)
Parameters: Returns: a dataframe of targets
tsvparser¶
-
class
cnvfinder.tsvparser.
CoverageFileParser
(filename: str)[source]¶ Parses a amplicon coverage file loaded by BedFileLoader
Parameters: filename (str) – path to amplicon.cov file -
static
create_column_map
(columns) → collections.defaultdict[source]¶ Create a dict based on columns
Parameters: columns (list) – list of columns Returns: a dictionary mapping ‘columns’ values and indexes
-
define_targets
(lines, columns) → tuple[source]¶ Extract columns from lines based on columns of interest and split them in two entities: one DataFrame representing actual targets [chrom, chromStart, chromEnd…] and a list representing the number of reads for each target.
Parameters: Returns: a DataFrame describing the targets and a list of counters
-
static
ideogram¶
-
class
cnvfinder.ideogram.
Ideogram
(file: str = None, chroms: list = None, chrom_height: float = 1, chrom_spacing: float = 1.5, fig_size: tuple = None, colors: dict = None, to_log=False)[source]¶ Create ideograms
Parameters: - file (str) – file to load chromosome bands data. Default: ‘cytoBand’ table from https://genome.ucsc.edu/cgi-bin/hgTables.
- chroms (list) – plot only chromosomes that are in this list. Default: [‘chr%s’ % i for i in list(range(1, 23)) + [‘M’, ‘X’, ‘Y’]]
- chrom_height (float) – height of each ideogram
- chrom_spacing (float) – spacing between consecutive ideograms
- fig_size (tuple) – width and height in inches
- colors (dict) – colors for different chromosome stains
- to_log (bool) – whether to print log info
-
add_data
(df: pandas.core.frame.DataFrame, height: float = 0.5, padding: float = 0.1, color: str = '#2243a8', alpha: float = 0.5, linewidths: float = 0, **kwargs)[source]¶ Add (genomic) data in the plot
Parameters: - df (DataFrame) – data
- height (float) – height of genomic track. Should be smaller than ‘chrom_spacing’
- padding (float) – padding between the top of a genomic track and its corresponding ideogram
- color (str) – track’s color. It will be used in case ‘colors’ not in df.columns
- alpha (float) – alpha value used for blending
- linewidths (float) – line widths
- kwargs – are passed to BrokenBarHCollection
-
add_data_above
(df: pandas.core.frame.DataFrame, color: str = None)[source]¶ Wrapper for adding data above ideograms
Parameters: - color (str) – bars color
- df (DataFrame) – data
-
add_data_below
(df: pandas.core.frame.DataFrame, color: str = None)[source]¶ Wrapper for adding data below ideograms
Parameters: - color (str) – bars color
- df (DataFrame) – data