CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
7BZF_1 4GEO_1 9ICD_1 Letter Amino acid
56 14 15 N Asparagine
28 15 11 Q Glutamine
33 21 37 I Isoleucine
51 19 31 K Lycine
43 19 17 R Arginine
29 4 6 C Cysteine
45 15 40 G Glycine
83 42 31 L Leucine
30 17 20 P Proline
74 22 30 V Valine
64 31 38 A Alanine
31 23 35 E Glutamic acid
37 17 5 H Histidine
25 10 13 M Methionine
57 13 10 F Phenylalanine
53 20 13 S Serine
61 19 18 T Threonine
9 5 6 W Tryptophan
75 27 25 D Aspartic acid
58 14 15 Y Tyrosine

7BZF_1|Chain A|RNA-directed RNA polymerase|Severe acute respiratory syndrome coronavirus 2 (2697049)
>4GEO_1|Chain A|Mitogen-activated protein kinase 14|Homo sapiens (9606)
>9ICD_1|Chain A|ISOCITRATE DEHYDROGENASE|Escherichia coli (562)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
7BZF , Knot 354 942 0.85 40 326 862
SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKDEDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQHHHHHHHHHH
4GEO , Knot 159 367 0.85 40 227 350
MHHHHHHASQERPTFYRQELNKTIWEVPERYQNLSPVGSGAYGSVCAAFDTKTGLRVAVKKLSRPFQSIIHAKRTYRELRLLKHMKHENVIGLLDVFTPARSLEEFNDVYLVTHLMGADLNNIVKCQKLTDDHVQFLIYQILRGLKYIHSADIIHRDLKPSNLAVNEDCELKILDFGLARHTDDEMTGYVATRWYRAPEIAANWMHYNQTVDIWSVGCIMAELLTGRTLFPGTDAADQLKLILRLVGTPGAELLKKISSESARNAIQSLTQMPKMNFANVFIGANPLAVDLLEKMLVLDSDKRITAAQALAHAYFAQYHDPDDEPVADPYDQSFESRDLLIDEWKSLTYDEVISFVPPPLDQEEMES
9ICD , Knot 175 416 0.84 40 228 391
MESKVVVPAQGKKITLQNGKLNVPENPIIPYIEGDGIGVDVTPAMLKVVDAAVEKAYKGERKISWMEIYTGEKSTQVYGQDVWLPAETLDLIREYRVAIKGPLTTPVGGGIRSLNVALRQELDLYICLRPVRYYQGTPSPVKHPELTDMVIFRENSEDIYAGIEWKADSADAEKVIKFLREEMGVKKIRFPEHCGIGIKPCSEEGTKRLVRAAIEYAIANDRDSVTLVHKGNIMKFTEGAFKDWGYQLAREEFGGELIDGGPWLKVKNPNTGKEIVIKDVIADAFLQQILLRPAEYDVIACMNLNGDYISDALAAQVGGIGIAPGANIGDECALFEATHGTAPKYAGQDKVNPGSIILSAEMMLRHMGWTEAADLIVKGMEGAINAKTVTYDFERLMDGAKLLKCSEFGDAIIENM

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(7BZF_1)}(2) \setminus P_{f(4GEO_1)}(2)|=123\), \(|P_{f(4GEO_1)}(2) \setminus P_{f(7BZF_1)}(2)|=24\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:010100110010110110101010100001100110100001111101100000010000000011000111000010000000010011000111100011010101011101000010000110110110010010000100111000000000100001001100101101010110010011100101001100111111101000010101001101100011011111000001111110100110100010001001010101100010000101100010010000010010010000110010101110011110011111001110111111001001001111000010100001010011101101110110101110000001011110001110010110100010011100111001001010011110010111000000000110100100111110110001000011010100111001000111110011010100001000000111100000111010010100110100010011110100010000100011001110011011110001011100110010001001011110010000111011011101111000000001000100110001011001110110101011100010100101001101001101010111000100110001001000100010000010001100101010001011110001110100001001111010010011000001110010010000100110010000011100100010110101001111101100110001011100110111010110001000010110101001001000101011010011100000000101010011001001100000000000
Pair \(Z_2\) Length of longest common subsequence
7BZF_1,4GEO_1 147 6
7BZF_1,9ICD_1 152 4
4GEO_1,9ICD_1 171 4

Newick tree

 
[
	9ICD_1:83.20,
	[
		7BZF_1:73.5,4GEO_1:73.5
	]:9.70
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1309 }{\log_{20} 1309}-\frac{367}{\log_{20}367})=244.\)
Status Protein1 Protein2 d d1/2
Query variables 7BZF_1 4GEO_1 309 211.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]