from k1lib.imports import *
Here's what it kinda looks like:
cat("covid.gb") | headOut()
LOCUS NC_045512 29903 bp ss-RNA linear VRL 18-JUL-2020 DEFINITION Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. ACCESSION NC_045512 VERSION NC_045512.2 DBLINK BioProject: PRJNA485481 KEYWORDS RefSeq. SOURCE Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ORGANISM Severe acute respiratory syndrome coronavirus 2 Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
And the end:
cat("covid.gb") | rows()[-10:] | headOut()
29341 tattgacgca tacaaaacat tcccaccaac agagcctaaa aaggacaaaa agaagaaggc 29401 tgatgaaact caagccttac cgcagagaca gaagaaacag caaactgtga ctcttcttcc 29461 tgctgcagat ttggatgatt tctccaaaca attgcaacaa tccatgagca gtgctgactc 29521 aactcaggcc taaactcatg cagaccacac aaggcagatg ggctatataa acgttttcgc 29581 ttttccgttt acgatatata gtctactctt gtgcagaatg aattctcgta actacatagc 29641 acaagtagat gtagttaact ttaatctcac atagcaatct ttaatcagtg tgtaacatta 29701 gggaggactt gaaagagcca ccacattttc accgaggcca cgcggagtac gatcgagtgt 29761 acagtgaaca atgctaggga gagctgccta tatggaagag ccctaatgtg taaaattaat 29821 tttagtagtg ctatccccat gtgattttaa tagcttctta ggagaatgac aaaaaaaaaa 29881 aaaaaaaaaa aaaaaaaaaa aaa
So, 29903 nucleotides in total, just as advertised. The last nucleotide section always starts with "ORIGIN", so let's look for that:
cat("covid.gb") | grep("ORIGIN", after=1e9) | headOut(3)
ORIGIN 1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct 61 gttctctaaa cgaactttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact
Nice. Let's extract everything out:
cat("covid.gb") | grep("ORIGIN", after=1e9) | ~head(1) | op().strip().all() | op().split(" ").all() | cut()[1:] | join("").all() | join("") | op()[:100]
'attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtggctgtcactc'
This is rather long, so there's a built in operation for that
# hide behind a wrapper cause I get annoyed at Jupyter Lab's contextual help displaying the huge text
nt = k1lib.Wrapper(cat("covid.gb") | gb.origin())
nt()[:100]
'attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtggctgtcactc'
Before ORIGIN "section", there's the FEATURES section that looks like this:
cat("covid.gb") | grep("FEATURES", after=1e9) | headOut(20)
FEATURES Location/Qualifiers source 1..29903 /organism="Severe acute respiratory syndrome coronavirus 2" /mol_type="genomic RNA" /isolate="Wuhan-Hu-1" /host="Homo sapiens" /db_xref="taxon:2697049" /country="China" /collection_date="Dec-2019" 5'UTR 1..265 gene 266..21555 /gene="ORF1ab" /locus_tag="GU280_gp01" /db_xref="GeneID:43740578" CDS join(266..13468,13468..21555) /gene="ORF1ab" /locus_tag="GU280_gp01" /ribosomal_slippage /note="pp1ab; translated by -1 ribosomal frameshift"
As you can see, there are multiple features, like source
, 5'UTR
, gene
, CDS
, and whatnot. Of course, you can extract these on your own, but builtin functions already have something like that:
feats = cat("covid.gb") | gb.feats() | deref()
feats | rows()[:3] | deref()
[[' source 1..29903', ' /organism="Severe acute respiratory syndrome coronavirus', ' 2"', ' /mol_type="genomic RNA"', ' /isolate="Wuhan-Hu-1"', ' /host="Homo sapiens"', ' /db_xref="taxon:2697049"', ' /country="China"', ' /collection_date="Dec-2019"'], [" 5'UTR 1..265"], [' gene 266..21555', ' /gene="ORF1ab"', ' /locus_tag="GU280_gp01"', ' /db_xref="GeneID:43740578"']]
Say you want to search the features for a frameshift event, you can do something like this:
feats | gb.feats.filt("frameshift", "CDS") | item() | headOut()
CDS join(266..13468,13468..21555) /gene="ORF1ab" /locus_tag="GU280_gp01" /ribosomal_slippage /note="pp1ab; translated by -1 ribosomal frameshift" /codon_start=1 /product="ORF1ab polyprotein" /protein_id="YP_009724389.1" /db_xref="GeneID:43740578" /translation="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQ
So apparently, there's a frameshift at nucleotide 13468, where it gets repeated twice. Let's check if that's correct. First, let's grab the protein:
orf1ab = feats | gb.feats.filt("frameshift", "CDS") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")
len(orf1ab), len(orf1ab)*3 / len(nt()) * 100
(7096, 71.19018158713173)
ORF1ab is quite a chunky boi. Over 7k length, or 71% of the genome. The nucleotides of interest are:
nt()[13465:][:20]
'aacgggtttgcggtgtaagt'
So, the shifted nt sequence must be "AACCGG", or:
"AACCGG" | translate() | item()
'NR'
orf1ab[(13468-266+1)//3-1:][:20]
'NRVCGVSAARLTPCGTGTST'
Yep, bingo! Peptide sequence starts with NR
s = feats | gb.feats.filt("spike", "CDS") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")
Also in the news before delta variant times, I've heard they talk a lot about "D614G" variant, I wonder what's that all about, then discovered this:
s[613:][:10], "DG" | longAa() | item()
('DVNCTEVPVA', 'AsparticAcid Glycine')
Yeah this checks out. So "D614G" mutation just means at position 614 on the spike protein, a D (aspartic acid) has become G (glycine).
orf3a = feats | gb.feats.filt("ORF3a", "CDS") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")
Let's try again at a different spot. I grabbed a random mutation with this code name: "hCoV-19/Japan/PG-69007/2021: ORF3a L275F"
orf3a[274:], "LF" | longAa() | item()
('L', 'Leucine Phenylalanine')
Lmao, the change is right at the last amino acid
genes = ["ORF1ab", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF10"]
proteinLengths = feats | oneToMany(*(gb.feats.filt(f"/gene=\"{g}\"") for g in genes))\
| (gb.feats.filt(" CDS ") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")).all()\
| shape(0).all() | deref()
Let's see the distribution of all genes:
proteinLengths | wrapList() | transpose() | insertColumn(genes) | ~sort(1) | display(None)
ORF1ab 7096 S 1273 N 419 ORF3a 275 M 222 ORF7a 121 ORF8 121 E 75 ORF6 61 ORF7b 43 ORF10 38
And how much of the genome are the proteins themselves?
sum(proteinLengths) * 3 / len(nt()) * 100
97.75607798548641
All proteins combined take up like 97.7% of the genome. Quite densely packed, unlike eukaryote genomes.
How about utr regions? Do they take up much? Let's quickly search for them:
feats | gb.feats.filt("UTR") | item().all() | deref()
[" 5'UTR 1..265", ' stem_loop 29609..29644', ' stem_loop 29629..29657', " 3'UTR 29675..29903"]
utr = feats | gb.feats.filt("UTR") | item().all() | rows(0, 3) | op().split("R")[1].all()\
| (op().split("..") | toInt() | toList() | ~aS(lambda x, y: y - x)).all() | toSum()
(sum(proteinLengths) * 3 + utr) / len(nt()) * 100
99.40139785305823
Really close to 100% now