CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
2IIH_1 4CCE_1 7THE_1 Letter Amino acid
0 30 17 N Asparagine
6 32 9 D Aspartic acid
12 62 15 G Glycine
4 39 7 I Isoleucine
6 29 11 P Proline
16 34 6 E Glutamic acid
18 59 13 L Leucine
1 39 15 Y Tyrosine
21 37 12 A Alanine
9 29 9 R Arginine
2 3 8 C Cysteine
4 21 6 Q Glutamine
3 25 1 H Histidine
3 29 14 F Phenylalanine
1 25 2 W Tryptophan
16 35 16 V Valine
13 26 8 K Lycine
5 12 0 M Methionine
4 42 15 S Serine
13 46 11 T Threonine

2IIH_1|Chain A|Molybdenum cofactor biosynthesis protein C|Thermus thermophilus (300852)
>4CCE_1|Chain A|GALACTOCEREBROSIDASE|MUS MUSCULUS (10090)
>7THE_1|Chain A|Spike protein S1|Severe acute respiratory syndrome coronavirus 2 (2697049)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
2IIH , Knot 73 157 0.78 38 107 150
MDLTHFQDGRPRMVDVTEKPETFRTATAEAFVELTEEALSALEKGGVGKGDPLVVAQLAGILAAKKTADLIPLCHPLPLTGVEVRVELLKAEKRVRIEATVKTKAETGVEMEAMTACAVAALTVYDMLKAASKGLVISQVRLLHKAGGKSGEWRREQ
4CCE , Knot 256 654 0.84 40 278 606
HHHHHHIEGRGAYVLDDSDGLGREFDGIGAVSGGGATSRLLVNYPEPYRSEILDYLFKPNFGASLHILKVEIGGDGQTTDGTEPSHMHYELDENYFRGYEWWLMKEAKKRNPDIILMGLPWSFPGWLGKGFSWPYVNLQLTAYYVVRWILGAKHYHDLDIDYIGIWNERPFDANYIKELRKMLDYQGLQRVRIIASDNLWEPISSSLLLDQELWKVVDVIGAHYPGTYTVWNAKMSGKKLWSSEDFSTINSNVGAGCWSRILNQNYINGNMTSTIAWNLVASYYEELPYGRSGLMTAQEPWSGHYVVASPIWVSAHTTQFTQPGWYYLKTVGHLEKGGSYVALTDGLGNLTIIIETMSHQHSMCIRPYLPYYNVSHQLATFTLKGSLREIQELQVWYTKLGTPQQRLHFKQLDTLWLLDGSGSFTLELEEDEIFTLTTLTTGRKGSYPPPPSSKPFPTNYKDDFNVEYPLFSEAPNFADQTGVFEYYMNNEDREHRFTLRQVLNQRPITWAADASSTISVIGDHHWTNMTVQCDVYIETPRSGGVFIAGRVNKGGILIRSATGVFFWIFANGSYRVTADLGGWITYASGHADVTAKRWYTLTLGIKGYFAFGMLNGTILWKNVRVKYPGHGWAAIGTHTFEFAQFDNFRVEAAR
7THE , Knot 91 195 0.82 38 138 188
TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGP

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(2IIH_1)}(2) \setminus P_{f(4CCE_1)}(2)|=30\), \(|P_{f(4CCE_1)}(2) \setminus P_{f(2IIH_1)}(2)|=201\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1010010010101101000100100101011101000110110011110101111101111111000101111001111011010101101000101010100010011010110101111101001101100111100101100111001010000
Pair \(Z_2\) Length of longest common subsequence
2IIH_1,4CCE_1 231 5
2IIH_1,7THE_1 159 3
4CCE_1,7THE_1 200 4

Newick tree

 
[
	4CCE_1:11.98,
	[
		2IIH_1:79.5,7THE_1:79.5
	]:36.48
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{811 }{\log_{20} 811}-\frac{157}{\log_{20}157})=183.\)
Status Protein1 Protein2 d d1/2
Query variables 2IIH_1 4CCE_1 235 142.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]