CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
2HVA_1 5FEE_1 3ZRH_1 Letter Amino acid
13 20 33 A Alanine
2 6 10 C Cysteine
9 33 60 L Leucine
7 13 6 M Methionine
8 16 30 R Arginine
9 21 30 D Aspartic acid
21 21 23 G Glycine
4 6 14 W Tryptophan
18 23 28 V Valine
5 8 11 N Asparagine
1 8 12 H Histidine
5 25 20 I Isoleucine
13 18 14 P Proline
13 12 23 T Threonine
7 12 24 Q Glutamine
16 26 32 E Glutamic acid
13 23 21 K Lycine
7 8 19 F Phenylalanine
11 16 29 S Serine
10 13 15 Y Tyrosine

2HVA_1|Chain A|Heme-binding protein 1|Mus musculus (10090)
>5FEE_1|Chain A|Epidermal growth factor receptor|Homo sapiens (9606)
>3ZRH_1|Chain A|UBIQUITIN THIOESTERASE ZRANB1|HOMO SAPIENS (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
2HVA , Knot 91 192 0.83 40 138 188
GSMLGMIRNSLFGSVETWPWQVLSTGGKEDVSYEERACEGGKFATVEVTDKPVDEALREAMPKIMKYVGGTNDKGVGMGMTVPVSFAVFPNEDGSLQKKLKVWFRIPNQFQGSPPAPSDESVKIEEREGITVYSTQFGGYAKEADYVAHATQLRTTLEGTPATYQGDVYYCAGYDPPMKPYGRRNEVWLVKA
5FEE , Knot 146 328 0.86 40 210 322
GGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLIMQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQG
3ZRH , Knot 184 454 0.82 40 233 417
ALEVDFKKLKQIKNRMKKTDWLFLNACVGVVEGDLAAIEAYKSSGGDIARQLTADEVRLLNRPSAFDVGYTLVHLAIRFQRQDMLAILLTEVSQQAAKCIPAMVCPELTEQIRREIAASLHQRKGDFACYFLTDLVTFTLPADIEDLPPTVQEKLFDEVLDRDVQKELEEESPIINWSLELATRLDSRLYALWNRTAGDCLLDSVLQATWGIYDKDSVLRKALHDSLHDCSHWFYTRWKDWESWYSQSFGLHFSLREEQWQEDWAFILSLASQPGASLEQTHIFVLAHILRRPIIVYGVKYYKSFRGETLGYTRFQGVYLPLLWEQSFCWKSPIALGYTRGHFSALVAMENDGYGNRGAGANLNTDDDVTITFLPLVDSERKLLHVHFLSAQELGNEEQQEKLLREWLDCCVTEGGVLVAMQKSSRRRNHPLVTQMVEKWLDRYRQIRPCTSLS

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(2HVA_1)}(2) \setminus P_{f(5FEE_1)}(2)|=55\), \(|P_{f(5FEE_1)}(2) \setminus P_{f(2HVA_1)}(2)|=127\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:101111100011101001110110011000100000100110110101000110011001110110011100001111110111011111000101000101110110010101111000010100001101000011101001001101001000101011000101000110011101010000111101
Pair \(Z_2\) Length of longest common subsequence
2HVA_1,5FEE_1 182 4
2HVA_1,3ZRH_1 197 4
5FEE_1,3ZRH_1 169 4

Newick tree

 
[
	2HVA_1:98.02,
	[
		5FEE_1:84.5,3ZRH_1:84.5
	]:13.52
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{520 }{\log_{20} 520}-\frac{192}{\log_{20}192})=94.9\)
Status Protein1 Protein2 d d1/2
Query variables 2HVA_1 5FEE_1 126 97
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]