CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
3KAB_1 1CET_1 9GSX_1 Letter Amino acid
7 12 24 P Proline
24 16 69 S Serine
13 6 12 R Arginine
5 21 76 N Asparagine
6 16 60 D Aspartic acid
15 15 30 E Glutamic acid
4 32 47 V Valine
8 8 47 Q Glutamine
10 26 48 K Lycine
4 10 23 M Methionine
3 8 27 Y Tyrosine
9 32 49 L Leucine
6 14 67 T Threonine
3 1 3 W Tryptophan
11 26 48 A Alanine
2 4 0 C Cysteine
5 9 5 H Histidine
8 27 45 I Isoleucine
17 26 48 G Glycine
7 7 22 F Phenylalanine

3KAB_1|Chain A|Peptidyl-prolyl cis-trans isomerase NIMA-interacting 1|Homo sapiens (9606)
>1CET_1|Chain A|PROTEIN (L-LACTATE DEHYDROGENASE)|Plasmodium falciparum (5833)
>9GSX_1|Chains A, B, C, D, E, F, G, H, I, J, K|Flagellin|Campylobacter jejuni (197)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
3KAB , Knot 78 167 0.79 40 122 161
GSHGMADEEKLPPGWEKAMSRSSGRVYYFNHITNASQWERPSGNSSSGGKNGQGEPARVRCSHLLVKHSQSRRPSSWRQEKITRTKEEALELINGYIQKIKSGEEDFESLASQFSDCSSAKARGDLGAFSRGQMQKPFEDASFALRTGEMSGPVFTDSGIHIILRTE
1CET , Knot 137 316 0.83 40 178 302
MAPKAKIVLVGSGMIGGVMATLIVQKNLGDVVLFDIVKNMPHGKALDTSHTNVMAYSNCKVSGSNTYDDLAGSDVVIVTAGFTKAPGKSDKEWNRLDLLPLNNKIMIEIGGHIKKNCPNAFIIVVTNPVDVMVQLLHQHSGVPKNKIIGLGGVLDTSRLKYYISQKLNVCPRDVNAHIVGAHGNKMVLLKRYITVGGIPLQEFINNKLISDAELEAIFDRTVNTALEIVNLHASPYVAPAAAIIEMAESYLKDLKKVLICSTLLEGQYGHSDIFGGTPVVLGANGVEQVIELQLNSEEKAKFDEAIAETKRMKALA
9GSX , Knot 277 750 0.81 38 250 663
MRITNKLNFTNSVNNSMGGQSALYQISQQLASGLKIQNSYEDASTYIDNTRLEYEIKTLEQVKESTSRAQEMTQNSMKALQDMVKLLEDFKVKVTQAASDSNSQTSREAIAKELERIKESIVQLANTSVNGQYLFAGSQVANKPFDSNGNYYGDKNNINVVTGAGTESPYNIPGWDLFFKADGDYKKQISTNVSFTDNRWDLNKDPDKTKYLTGDSKWQQLIGQGYVKDNSLDADKDFEYDDSKLDFPPTTLYVQGTKPDGTSFKSAVLVKPEDTLEDVMENIGALYGNTPNNKVVEVSMNDSGQIQITDLKQGNNKLDFHAVAFTPQADTKDELKNIIEAANQEGISMNEVTNRVMQASTAAPSNGDITKLNNPVTVTINNQQFTIDLKQTDFIKSKMTDTDGNAANGADYDNVYFEKNGNTVYGNVSQVIKGSNAYATDSTKLSEVMAGDSLNGTTLNLKVNSKGGNSYDVTINLQTSTVSYPDPNNPGQTISFPIMHTNPATGNSGVVTGSNDITYGQINDIIGMFAADKIPTTTIQANNGQINNADYTQIQQLMKDSQATVDVSMDYKGRISVTDKLSSGTNIEISLSDSQSGQFPAPPFTTTSTVQNGPNFSFSANNSLTIDEPNVDIIKDLDSMIDAVLKGNMRADSESENPRNTGMQGALERLDHLADHVSKLNTTMGAYHNTIEGVNTRTSFLSVNVQSIKSNVIDVDYGEAMMNLMQVQLAYQASLKASTTISQLSLLNYM

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(3KAB_1)}(2) \setminus P_{f(1CET_1)}(2)|=58\), \(|P_{f(1CET_1)}(2) \setminus P_{f(3KAB_1)}(2)|=114\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:10011100001111100110000101001001001001001010000110010101101000011100000001001000010000001101101010010010001001100100000101010111100101001100101110010101111000110111000
Pair \(Z_2\) Length of longest common subsequence
3KAB_1,1CET_1 172 3
3KAB_1,9GSX_1 192 4
1CET_1,9GSX_1 154 4

Newick tree

 
[
	3KAB_1:95.38,
	[
		1CET_1:77,9GSX_1:77
	]:18.38
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{483 }{\log_{20} 483}-\frac{167}{\log_{20}167})=92.7\)
Status Protein1 Protein2 d d1/2
Query variables 3KAB_1 1CET_1 116 88.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]