CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
1EEI_1 3RJY_1 8CUB_1 Letter Amino acid
5 17 56 S Serine
10 11 38 T Threonine
3 17 47 G Glycine
7 17 24 N Asparagine
7 28 27 E Glutamic acid
4 8 25 H Histidine
3 5 19 M Methionine
3 12 38 R Arginine
6 18 84 L Leucine
3 20 37 F Phenylalanine
5 7 29 Q Glutamine
4 17 21 D Aspartic acid
2 2 13 C Cysteine
10 24 40 I Isoleucine
9 27 24 K Lycine
3 16 27 P Proline
1 13 5 W Tryptophan
3 17 19 Y Tyrosine
11 17 37 A Alanine
4 27 54 V Valine

1EEI_1|Chains A[auth D], B[auth E], C[auth F], D[auth G], E[auth H]|PROTEIN (CHOLERA TOXIN B)|Vibrio cholerae (666)
>3RJY_1|Chain A|Endoglucanase FnCel5A|Fervidobacterium nodosum (381764)
>8CUB_1|Chains A, C|ATP-binding cassette sub-family G member 5|Homo sapiens (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
1EEI , Knot 55 103 0.82 40 89 101
TPQNITDLCAEYHNTQIHTLNDKIFSYTESLAGKREMAIITFKNGATFQVEVPGSQHIDSQKKAIERMKDTLRIAYLTEAKVEKLCVWNNKTPHAIAAISMAN
3RJY , Knot 144 320 0.86 40 202 309
MDQSVSNVDKMSAFEYNKMIGHGINMGNALEAPVEGSWGVYIEDEYFKIIKERGFDSVRIPIRWSAHISEKYPYEIDKFFLDRVKHVVDVALKNDLVVIINCHHFEELYQAPDKYGPVLVEIWKQVAQAFKDYPDKLFFEIFNEPAQNLTPTKWNELYPKVLGEIRKTNPSRIVIIDVPNWSNYSYVRELKLVDDKNIIVSFHYYEPFNFTHQGAEWVSPTLPIGVKWEGKDWEVEQIRNHFKYVSEWAKKNNVPIFLGEFGAYSKADMESRVKWTKTVRRIAEEFGFSLAYWEFCAGFGLYDRWTKTWIEPLTTSALGK
8CUB , Knot 257 664 0.83 40 279 612
DLSSLTPGGSMGLQVNRGSQSSLEGAPATAPEPHSLGILHASYSVSHRVRPWWDITSCRQQWTRQILKDVSLYVESGQIMCILGSSGSGKTTLLDAMSGRLGRAGTFLGEVYVNGRALRREQFQDCFSYVLQSDTLLSSLTVRETLHYTALLAIRRGNPGSFQKKVEAVMAELSLSHVADRLIGNYSLGGISTGERRRVSIAAQLLQDPKVMLFDEPTTGLDCMTANQIVVLLVELARRNRIVVLTIHQPRSELFQLFDKIAILSFGELIFCGTPAEMLDFFNDCGYPCPEHSNPFDFYMDLTSVDTQSKEREIETSKRVQMIESAYKKSAICHKTLKNIERMKHLKTLPMVPFKTKDSPGVFSKLGVLLRRVTRNLVRNKLAVITRLLQNLIMGLFLLFFVLRVRSNVLKGAIQDRVGLLYQFVGATPYTGMLNAVNLFPVLRAVSDQESQDGLYQKWQMMLAYALHVLPFSVVATMIFSSVCYWTLGLHPEVARFGYFSAALLAPHLIGEFLTLVLLGIVQNPNIVNSVVALLSIAGVLVGSGFLRNIQEMPIPFKIISYFTFQKYCSEILVVNEFYGLNFTCGSSNVSVTTNPMCAFTQGIQFIEKTCPGATSRFTMNFLILYSFIPALVILGIVVFKIRDHLISRGSHHHHHHGHHHHHH

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(1EEI_1)}(2) \setminus P_{f(3RJY_1)}(2)|=37\), \(|P_{f(3RJY_1)}(2) \setminus P_{f(1EEI_1)}(2)|=150\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:0100100101000000100100011000001110001111010011010101110001000001100100010110100101001011000010111110110
Pair \(Z_2\) Length of longest common subsequence
1EEI_1,3RJY_1 187 3
1EEI_1,8CUB_1 228 5
3RJY_1,8CUB_1 181 4

Newick tree

 
[
	1EEI_1:10.45,
	[
		3RJY_1:90.5,8CUB_1:90.5
	]:17.95
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{423 }{\log_{20} 423}-\frac{103}{\log_{20}103})=97.2\)
Status Protein1 Protein2 d d1/2
Query variables 1EEI_1 3RJY_1 124 81
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]