CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
5OKG_1 9GFW_1 3PRW_1 Letter Amino acid
22 12 9 F Phenylalanine
17 7 9 W Tryptophan
40 22 39 G Glycine
19 12 5 H Histidine
23 9 14 I Isoleucine
34 26 38 L Leucine
11 2 6 M Methionine
21 12 25 T Threonine
34 17 42 V Valine
1 1 0 C Cysteine
18 11 17 Q Glutamine
29 13 16 E Glutamic acid
20 24 12 K Lycine
30 7 13 R Arginine
35 19 26 D Aspartic acid
25 17 12 P Proline
19 18 33 S Serine
29 8 9 Y Tyrosine
33 13 32 A Alanine
25 10 20 N Asparagine

5OKG_1|Chains A, B, C, D|Putative 6-phospho-beta-galactobiosidase|Geobacillus stearothermophilus (1422)
>9GFW_1|Chain A|Carbonic anhydrase 2|Homo sapiens (9606)
>3PRW_1|Chain A|Lipoprotein yfgL|Escherichia coli (83333)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
5OKG , Knot 196 485 0.83 40 247 453
MIHHHHHHEHRHLKPFPPEFLWGAASAAYQVEGAWNEDGKGLSVWDVFAKQPGRTFKGTNGDVAVDHYHRYQEDVALMAEMGLKAYRFSVSWSRVFPDGNGAVNEKGLDFYDRLIEELRNHGIEPIVTLYHWDVPQALMDAYGAWESRRIIDDFDRYAVTLFQRFGDRVKYWVTLNQQNIFISFGYRLGLHPPGVKDMKRMYEANHIANLANAKVIQSFRHYVPDGKIGPSFAYSPMYPYDSRPENVLAFENAEEFQNHWWMDVYAWGMYPQAAWNYLESQGLEPTVAPGDWELLQAAKPDFMGVNYYQTTTVEHNPPDGVGEGVMNTTGKKGTSTSSGIPGLFKTVRNPHVDTTNWDWAIDPVGLRIGLRRIANRYQLPILITENGLGEFDTLEPGDIVNDDYRIDYLRRHVQEIQRAITDGVDVLGYCAWSFTDLLSWLNGYQKRYGFVYVNRDDESEKDLRRIKKKSFYWYQRVIETNGAEL
9GFW , Knot 112 260 0.79 40 176 249
MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK
3PRW , Knot 154 377 0.80 38 197 351
GPLGSSLFNSEEDVVKMSPLPTVENQFTPTTAWSTSVGSGIGNFYSNLHPALADNVVYAADRAGLVKALNADDGKEIWSVSLAEKDGWFSKEPALLSGGVTVSGGHVYIGSEKAQVYALNTSDGTVAWQTKVAGEALSRPVVSDGLVLIHTSNGQLQALNEADGAVKWTVNLDMPSLSLRGESAPTTAFGAAVVGGDNGRVSAVLMEQGQMIWQQRISQATGSTEIDRLSDVDTTPVVVNGVVFALAYNGNLTALDLRSGQIMWKRELGSVNDFIVDGNRIYLVDQNDRVMALTIDGGVTLWTQSDLLHRLLTSPVLYNGNLVVGDSEGYLHWINVEDGRFVAQQKVDSSGFQTEPVAADGKLLIQAKDGTVYSITR

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(5OKG_1)}(2) \setminus P_{f(9GFW_1)}(2)|=127\), \(|P_{f(9GFW_1)}(2) \setminus P_{f(5OKG_1)}(2)|=56\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:11000000000010111101111110110010111000101101101110011001010010111000000000111110111010010101001110101110001101000110010001101110100101101110101110000110010001101100110010011010000111011001110111100100100100110110101100100011010111011001101000010011110010010001110101111010111001000110101111010110110101111000000010001101110111000100100000111111001001010000101110111101110011000011111000111010010110110000010010001001001100110111001101001101101000001110100000000010010000101000110001101
Pair \(Z_2\) Length of longest common subsequence
5OKG_1,9GFW_1 183 4
5OKG_1,3PRW_1 164 4
9GFW_1,3PRW_1 157 3

Newick tree

 
[
	5OKG_1:89.49,
	[
		3PRW_1:78.5,9GFW_1:78.5
	]:10.99
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{745 }{\log_{20} 745}-\frac{260}{\log_{20}260})=134.\)
Status Protein1 Protein2 d d1/2
Query variables 5OKG_1 9GFW_1 172 131.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]