CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
4TQU_1 4YGE_1 5YKZ_1 Letter Amino acid
8 17 5 D Aspartic acid
1 2 0 C Cysteine
18 25 8 G Glycine
18 15 7 F Phenylalanine
19 12 19 T Threonine
9 5 0 W Tryptophan
8 17 5 N Asparagine
4 9 0 H Histidine
10 4 1 M Methionine
13 12 6 P Proline
23 17 12 V Valine
23 22 8 A Alanine
10 11 4 Q Glutamine
36 12 8 L Leucine
10 12 6 K Lycine
17 13 12 S Serine
16 8 2 Y Tyrosine
13 10 1 R Arginine
9 9 6 E Glutamic acid
36 14 9 I Isoleucine

4TQU_1|Chain A[auth M]|AlgM1|Sphingomonas sp. (28214)
>4YGE_1|Chains A, C, E|Protein ERGIC-53|Homo sapiens (9606)
>5YKZ_1|Chains A, B|Capsid protein|Penaeus vannamei nodavirus (430911)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
4TQU , Knot 130 301 0.82 40 180 284
MERLWKDIKRDWLLYAMLLPTIIWFLIFLYKPMIGLQMAFKQYSAWKGIAGSPWIGFDHFVTLFQSEQFIRAIKNTLTLSGLSLLFGFPMPILLALMINEVYSKGYRKAVQTIVYLPHFISIVIVAGLVVTFLSPSTGVVNNMLSWIGLDRVYFLTQPEWFRPIYISSNIWKEAGFDSIVYLAAIMSINPALYESAQVDGATRWQMITRITLPCIVPTIAVLLVIRLGHILEVGFEYIILLYQPTTYETADVISTYIYRLGLQGARYDIATAAGIFNAVVALVIVLFANHMSRRITKTGVF
4YGE , Knot 114 246 0.85 40 175 240
MNHKVHMDGVGGDPAVALPHRRFEYKYSFKGPHLVQSDGTVPFWAHAGNAIPSSDQIRVAPSLKSQRGSVWTKTKAAFENWEVEVTFRVTGRGRIGADGLAIWYAENQGLEGPVFGSADLWNGVGIFFDSFDNDGKKNNPAIVIIGNNGQIHYDHQNDGASQALASCQRDFRNKPYPVRAKITYYQNTLTVMINNGFTPDKNDYEFCAKVENMIIPAQGHFGISAATGGLADDHDVLSFLTFQLTE
5YKZ , Knot 61 119 0.81 34 92 114
NPTSLTDVRVDKAVNFIKPEVSGVAEIQTVTGLSPSTSYLLTPAFLEQNFQSEAGIYILSATPVEGEGTISINMDPTVTTVSGFIKVKTDTFGTFDLSVVLTTASKKQTTGFNIIAATS

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(4TQU_1)}(2) \setminus P_{f(4YGE_1)}(2)|=89\), \(|P_{f(4YGE_1)}(2) \setminus P_{f(4TQU_1)}(2)|=84\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1001100100011101111101111111100111110111000011011110111110011011000011011000101011011111111111111100100010001100110110110111111111011010011100110111100101100101101101000110011100110111110101110001010110010110010110111011111110110110111001111001000001011000100111011000110111110111111111110010001000111
Pair \(Z_2\) Length of longest common subsequence
4TQU_1,4YGE_1 173 4
4TQU_1,5YKZ_1 142 5
4YGE_1,5YKZ_1 159 3

Newick tree

 
[
	4YGE_1:86.72,
	[
		4TQU_1:71,5YKZ_1:71
	]:15.72
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{547 }{\log_{20} 547}-\frac{246}{\log_{20}246})=85.7\)
Status Protein1 Protein2 d d1/2
Query variables 4TQU_1 4YGE_1 108 99.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]