CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
8ZQD_1 3FTS_1 2YIC_1 Letter Amino acid
27 40 44 T Threonine
28 23 62 R Arginine
32 21 28 N Asparagine
68 41 68 E Glutamic acid
49 69 80 L Leucine
17 11 19 M Methionine
33 34 59 D Aspartic acid
14 26 36 Q Glutamine
61 30 68 G Glycine
61 31 44 I Isoleucine
37 45 47 S Serine
48 38 74 A Alanine
30 27 35 F Phenylalanine
6 13 9 W Tryptophan
18 22 24 Y Tyrosine
20 11 5 C Cysteine
12 16 26 H Histidine
56 40 39 K Lycine
18 35 40 P Proline
39 38 61 V Valine

8ZQD_1|Chains A, B|[FeFe]-hydrogenase|Clostridium beijerinckii (1520)
>3FTS_1|Chain A|Leukotriene A-4 hydrolase|Homo sapiens (9606)
>2YIC_1|Chains A, B, C, D|2-OXOGLUTARATE DECARBOXYLASE|MYCOBACTERIUM SMEGMATIS (1772)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
8ZQD , Knot 264 674 0.85 40 278 609
MGDNKKSFIQSALGSVFSVFSEEELKELSNGRKIAICGKVNNPGIIEVPEGATLNEIIQLCGGLINKSNFKAAQIGLPFGGFLTEDSLDKEFDFGIFYENIARTIIVLSQEDCIIQFEKFYIEYLLAKIKDGSYKNYEVVKEDITEMFNILNRISKGVSNMREIYLLRNLAVTVKSKMNQKHNIMEEIIDKFYEEIEEHIEEKKCYTSQCNHLVKLTITKKCIGCGACKRACPVDCINGELKKKHEIDYNRCTHCGACVSACPVDAISAGDNTMLFLRDLATPNKVVITQMAPAVRVAIGEAFGFEPGENVEKKIAAGLRKLGVDYVFDTSWGADLTIMEEAAELQERLERHLAGDESVKLPILTSCCPSWIKFIEQNYGDMLDVPSSAKSPMEMFAIVAKEIWAKEKGLSRDEVTSVAIMPCIAKKYEASRAEFSVDMNYDVDYVITTRELIKIFENSGINLKEIEDEEIDTVMGEYTGAGIIFGRTGGVIEAATRTALEKMTGERFDNIEFEGLRGWDGFRVCELEAGDIKLRIGVAHGLREAAKMLDKIRSGEEFFHAIEIMACVGGCIGGGGQPKTKGNKQAALQKRAEGLNNIDRSKTLRRSNENPEVLAIYEKYLDHPLSNKAHELLHTVYFPRVKKDDIWSVGVKLFGGGSGGGSGGGSWSHPQFEK
3FTS , Knot 247 611 0.86 40 286 571
MPEIVDTCSLASPASVCRTKHLHLRCSVDFTRRTLTGTAALTVQSQEDNLRSLVLDTKDLTIEKVVINGQEVKYALGERQSYKGSPMEISLPIALSKNQEIVIEISFETSPKSSALQWLTPEQTSGKEHPYLFSQCQAIHCRAILPCQDTPSVKLTYTAEVSVPKELVALMSAIRDGETPDPEDPSRKIYKFIQKVPIPCYLIALVVGALESRQIGPRTLVWSEKEQVEKSAYEFSETESMLKIAEDLGGPYVWGQYDLLVLPPSFPYGGMENPCLTFVTPTLLAGDKSLSNVIAHEISHSWTGNLVTNKTWDHFWLNEGHTVYLERHICGRLFGEKFRHFNALGGWGELQNSVKTFGETHPFTKLVVDLTDIDPDVAYSSVPYEKGFALLFYLEQLLGGPEIFLGFLKAYVEKFSYKSITTDDWKDFLYSYFKDKVDVLNQVDWNAWLYSPGLPPIKPNYDMTLTNACIALSQRWITAKEDDLNSFNATDLKDLSSHQLNEFLAQTLQRAPLPLGHIKRMQEVYNFNAINNSEIRFRWLRLCIQSKWEDAIPLALKMATEQGRMKFTRPLFKDLAAFDKSHDQAVRTYQEHKASMHPVTAMLVGKDLKVD
2YIC , Knot 326 868 0.84 40 298 784
GDSIEDKNARVIELIAAYRNRGHLMADIDPLRLDNTRFRSHPDLDVNSHGLTLWDLDREFKVDGFAGVQRKKLRDILSVLRDAYCRHVGVEYTHILEPEQQRWIQERVETKHDKPTVAEQKYILSKLNAAEAFETFLQTKYVGQKRFSLEGAETVIPMMDAVIDQCAEHGLDEVVIAMPHRGRLNVLANIVGKPYSQIFSEFEGNLNPSQAHGSGDVKYHLGATGTYIQMFGDNDIEVSLTANPSHLEAVDPVLEGLVRAKQDLLDTGEEGSDNRFSVVPLMLHGDAAFAGQGVVAETLNLALLRGYRTGGTIHIVVNNQIGFTTAPTDSRSSEYCTDVAKMIGAPIFHVNGDDPEACAWVARLAVDFRQAFKKDVVIDMLCYRRRGHNEGDDPSMTQPYMYDVIDTKRGSRKAYTEALIGRGDISMKEAEDALRDYQGQLERVFNEVRELEKHEIEPSESVEADQQIPSKLATAVDKAMLQRIGDAHLALPEGFTVHPRVRPVLEKRREMAYEGRIDWAFAELLALGSLIAEGKLVRLSGQDTQRGTFTQRHAVIVDRKTGEEFTPLQLLATNPDGTPTGGKFLVYNSALSEFAAVGFEYGYSVGNPDAMVLWEAQFGDFVNGAQSIIDEFISSGEAKWGQLSDVVLLLPHGHEGQGPDHTSGRIERFLQLWAEGSMTIAMPSTPANYFHLLRRHGKDGIQRPLIVFTPKSMLRNKAAVSDIRDFTESKFRSVLEEPMYTDGEGDRNKVTRLLLTSGKIYYELAARKAKENREDVAIVRIEQLAPLPRRRLAETLDRYPNVKEKFWVQEEPANQGAWPSFGLTLPEILPDHFTGLKRISRRAMSAPSSGSSKVHAVEQQEILDTAFG

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(8ZQD_1)}(2) \setminus P_{f(3FTS_1)}(2)|=59\), \(|P_{f(3FTS_1)}(2) \setminus P_{f(8ZQD_1)}(2)|=67\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:11000001100111011011000010010010011101010011110110110100110101111000010110111111111000010001011110001100111100000110100101001110100100000011000100110110010011001001011001110100010000011001100100010001000000000000110101000011011000101100101010000010000000011010101101101100011110011010011100111110111101111011001000111110011100110001110101100110100010001110001011110000101101100001011011001001101111110011100011000010011111011000010010101010001001100001101100011010010000100111000111111100111101100011001010010010101101101101001011010101111011001101100100100110110111011101111101000100011100010110010000010000001011110000100110001001100101101000011011101111101110111010010100
Pair \(Z_2\) Length of longest common subsequence
8ZQD_1,3FTS_1 126 4
8ZQD_1,2YIC_1 120 4
3FTS_1,2YIC_1 116 4

Newick tree

 
[
	8ZQD_1:62.64,
	[
		2YIC_1:58,3FTS_1:58
	]:4.64
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1285 }{\log_{20} 1285}-\frac{611}{\log_{20}611})=171.\)
Status Protein1 Protein2 d d1/2
Query variables 8ZQD_1 3FTS_1 215 204
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]