CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
4EXQ_1 1MXS_1 3ZVL_1 Letter Amino acid
30 14 25 R Arginine
8 6 12 N Asparagine
14 7 24 F Phenylalanine
21 12 18 T Threonine
11 3 7 Y Tyrosine
1 5 7 C Cysteine
37 21 36 L Leucine
7 7 6 M Methionine
6 7 24 K Lycine
21 14 36 P Proline
5 3 6 W Tryptophan
51 31 35 A Alanine
25 11 20 D Aspartic acid
10 1 8 H Histidine
17 19 14 I Isoleucine
13 10 29 S Serine
25 14 31 V Valine
11 5 18 Q Glutamine
18 15 28 E Glutamic acid
37 20 32 G Glycine

4EXQ_1|Chain A|Uroporphyrinogen decarboxylase|Burkholderia thailandensis (271848)
>1MXS_1|Chain A|KDPG Aldolase|Pseudomonas putida (303)
>3ZVL_1|Chain A|BIFUNCTIONAL POLYNUCLEOTIDE PHOSPHATASE/KINASE|MUS MUSCULUS (10090)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
4EXQ , Knot 154 368 0.82 40 188 345
GPGSMAQTLINDTFLRALLREPTDYTPIWLMRQAGRYLPEYNATRARAGSFLGLAKHPDYATEVTLQPLERFPLDAAILFSDILTIPDAMGLGLDFAAGEGPKFAHPVRTEADVAKLAVPDIGATLGYVTDAVREIRRALTDGEGRQRVPLIGFSGSPWTLACYMVEGGGSDDFRTVKSMAYARPDLMHRILDVNAQAVAAYLNAQIEAGAQAVMIFDTWGGALADGAYQRFSLDYIRRVVAQLKREHDGARVPAIAFTKGGGLWLEDLAATGVDAVGLDWTVNLGRARERVAGRVALQGNLDPTILFAPPEAIRAEARAVLDSYGNHPGHVFNLGHGISQFTPPEHVAELVDEVHRHSRAIRSGTGS
1MXS , Knot 96 225 0.77 40 140 212
TTLERPQPKLSMADKAARIDAICEKARILPVITIAREEDILPLADALAAGGIRTLEVTLRSQHGLKAIQVLREQRPELCVGAGTVLDRSMFAAVEAAGAQFVVTPGITEDILEAGVDSEIPLLPGISTPSEIMMGYALGYRRFKLFPAEISGGVAAIKAFGGPFGDIRFCPTGGVNPANVRNYMALPNVMCVGTTWMLDSSWIKNGDWARIEACSAEAIALLDAN
3ZVL , Knot 174 416 0.84 40 223 394
GPHMTSGSQPDAPPDTPGDPEEGEDTEPQKKRVRKSSLGWESLKKLLVFTASGVKPQGKVAAFDLDGTLITTRSGKVFPTSPSDWRILYPEIPKKLQELAAEGYKLVIFTNQMGIGRGKLPAEVFKGKVEAVLEKLGVPFQVLVATHAGLNRKPVSGMWDHLQEQANEGIPISVEDSVFVGDAAGRLANWAPGRKKKDFSCADRLFALNVGLPFATPEEFFLKWPAARFELPAFDPRTISSAGPLYLPESSSLLSPNPEVVVAVGFPGAGKSTFIQEHLVSAGYVHVNRDTLGSWQRCVSSCQAALRQGKRVVIDNTNPDVPSRARYIQCAKDAGVPCRCFNFCATIEQARHNNRFREMTDPSHAPVSDMVMFSYRKQFEPPTLAEGFLEILEIPFRLQEHLDPALQRLYRQFSEG

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(4EXQ_1)}(2) \setminus P_{f(1MXS_1)}(2)|=99\), \(|P_{f(1MXS_1)}(2) \setminus P_{f(4EXQ_1)}(2)|=51\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:11101100110001101110010000111110011001100010010110111110010010010101100111011111001101101111110111101101101100010110111101110110100110010011001010001111110101101100110111000100100110101011001101010111101010101110111110011111101100010100100111010000011011111100111111001110110111101010110100011101110101010111111011010101110001001101101101100101100110110010000011001010
Pair \(Z_2\) Length of longest common subsequence
4EXQ_1,1MXS_1 150 4
4EXQ_1,3ZVL_1 157 4
1MXS_1,3ZVL_1 163 3

Newick tree

 
[
	3ZVL_1:81.61,
	[
		4EXQ_1:75,1MXS_1:75
	]:6.61
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{593 }{\log_{20} 593}-\frac{225}{\log_{20}225})=104.\)
Status Protein1 Protein2 d d1/2
Query variables 4EXQ_1 1MXS_1 130 102.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]