CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
3AQN_1 7HAT_1 5KKW_1 Letter Amino acid
14 16 21 S Serine
23 17 19 T Threonine
17 9 7 Q Glutamine
20 5 10 P Proline
19 10 9 F Phenylalanine
23 11 18 V Valine
17 8 12 N Asparagine
24 16 15 D Aspartic acid
6 1 5 W Tryptophan
29 26 13 E Glutamic acid
20 17 30 K Lycine
3 0 0 C Cysteine
24 15 15 G Glycine
9 10 6 H Histidine
20 20 11 I Isoleucine
48 19 20 L Leucine
12 7 7 M Methionine
33 15 17 A Alanine
40 8 5 R Arginine
14 7 9 Y Tyrosine

3AQN_1|Chains A, B|Poly(A) polymerase|Escherichia coli (536056)
>7HAT_1|Chain A|Heat shock protein HSP 90-alpha|Homo sapiens (9606)
>5KKW_1|Chain A|Cyclohexadienyl dehydratase|Pelagibacter ubique (strain HTCC1062) (335992)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
3AQN , Knot 175 415 0.84 40 228 391
QVTVIPREQHAISRKDISENALKVMYRLNKAGYEAWLVGGGVRDLLLGKKPKDFDVTTNATPEQVRKLFRNCRLVGRRFRLAHVMFGPEIIEVATFRGHHEGNVSDRTTSQRGQNGMLLRDNIFGSIEEDAQRRDFTINSLYYSVADFTVRDYVGGMKDLKDGVIRLIGNPETRYREDPVRMLRAVRFAAKLGMRISPETAEPIPRLATLLNDIPPAHLFEESLKLLQAGYGYETYKLLCEYHLFQPLFPTITRYFTENGDSPMERIIEQVLKNTDTRIHNDMRVNPAFLFAAMFWYPLLETAQKIAQESGLTYHDAFALAMNDVLDEACRSLAIPKRLTTLTRDIWQLQLRMSRRQGKRAWKLLEHPKFRAAYDLLALRAEVERNAELQRLVKWWGEFQVSAPPDQKGMLNELD
7HAT , Knot 105 237 0.80 38 157 225
MDQPMEEEEVETFAFQAEIAQLMSLIINTFYSNKEIFLRELISNSSDALDKIRYESLTDPSKLDSGKELHINLIPNKQDRTLTIVDTGIGMTKADLINNLGTIAKSGTKAFMEALQAGADISMIGQFGVGFYSAYLVAEKVTVITKHNDDEQYAWESSAGGSFTVRTDTGEPMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAELEHHHHHH
5KKW , Knot 109 249 0.80 38 154 236
MRGSHHHHHHIAESKLDQILSSGELKVGTTGDWDPMAMKDPATNKYKGFDIDVMQELAKDMGVKITFVPTEWKTIVSGITAGRYDISTSVTKTPKRAEVAGFTDSYYKYGTVPLVLKKNLKKYSTWKSLNNKDVTIATTLGTSQEEKAKEFFPLSKLQSVESPARDFQEVLAGRADGNITSSTEANKLVVKYPQLAIVPDGEKNPAFLAMMVSKNDQVWNDYVNEWIKSKKSSGFFNKLLAKYNLKSLL

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(3AQN_1)}(2) \setminus P_{f(7HAT_1)}(2)|=116\), \(|P_{f(7HAT_1)}(2) \setminus P_{f(3AQN_1)}(2)|=45\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:0101110000110000100011011001001100111111110011110010010100010100100110000111001011011111011011010100010100000000100111100011101000100001010010001101010001111001001110111010000000110110110111011101010010111011011001111011000101101101000001100001101111010001000100110011001100000010001010111111111101110010011000110000111111001100100011110010010001101010100001001101100101011001111010100010100110111010101110001110010
Pair \(Z_2\) Length of longest common subsequence
3AQN_1,7HAT_1 161 4
3AQN_1,5KKW_1 166 5
7HAT_1,5KKW_1 157 6

Newick tree

 
[
	3AQN_1:82.81,
	[
		7HAT_1:78.5,5KKW_1:78.5
	]:4.31
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{652 }{\log_{20} 652}-\frac{237}{\log_{20}237})=116.\)
Status Protein1 Protein2 d d1/2
Query variables 3AQN_1 7HAT_1 149 114
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]