CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
4UEG_1 3AVX_1 7TSM_1 Letter Amino acid
12 36 14 Y Tyrosine
13 113 28 G Glycine
8 85 17 I Isoleucine
14 75 12 K Lycine
8 97 27 E Glutamic acid
19 32 11 H Histidine
16 71 20 T Threonine
20 97 26 V Valine
15 116 39 A Alanine
13 41 15 N Asparagine
20 76 18 D Aspartic acid
15 29 26 Q Glutamine
11 54 43 P Proline
4 28 6 M Methionine
18 52 17 F Phenylalanine
26 80 28 S Serine
6 8 12 W Tryptophan
9 75 29 R Arginine
4 20 10 C Cysteine
37 104 42 L Leucine

4UEG_1|Chains A, B|GLYCOGENIN-2|HOMO SAPIENS (9606)
>3AVX_1|Chain A|Elongation factor Ts, Elongation factor Tu, LINKER, Q beta replicase|Escherichia coli O157:H7 (83334)
>7TSM_1|Chains A, B, C, D|Nitric oxide synthase, endothelial|Homo sapiens (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
4UEG , Knot 129 288 0.84 40 184 277
MTDQAFVTLATNDIYCQGALVLGQSLRRHRLTRKLVVLITPQVSDLLRRILSKVFDEVIEVNLIDSADYIHLAFLKRPELGLTLTKLHCWTLTHYSKCVFLDADTLVLSNVDELFDRGEFSAAPDPGWPDCFNSGVFVFQPSLHTHKLLLQHAMEHGSFDGADQGLLNSFFRNWSTTDIHKHLPFIYNLSSNTMYTYSPAFKQFGSSAKVVHFLGSMKPWNYKYNPQSGSVLEQGSVSSSQHQAAFLHLWWTVYQNNVLPLYKSVQAENLYFQSHHHHHHDYKDDDDK
3AVX , Knot 456 1289 0.84 40 336 1104
MAEITASLVKELRERTGAGMMDCKKALTEANGDIELAIENMRKSGAIKAAKKAGNVAADGVIKTKIDGNYGIILEVNCQTDFVAKDAGFQAFADKVLDAAVAGKITDVEVLKAQFEEERVALVAKIGENINIRRVAALEGDVLGSYQHGARIGVLVAAKGADEELVKHIAMHVAASKPEFIKPEDVSAEVVEKEYQVQLDIAMQSGKPKEIAEKMVEGRMKKFTGEVSLTGQPFVMEPSKTVGQLLKEHNAEVTGFIRFEVGEGIEKVETDFAAEVAAMSKQSHMSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALEGDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMVVTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLSGASGAAGGGGSGGGGSMSKTASSRNSLSAQLRRAANTRIEVEGNLALSIANDLLLAYGQSPFNSEAECISFSPRFDGTPDDFRINYLKAEIMSKYDDFSLGIDTEAVAWEKFLAAEAECALTNARLYRPDYSEDFNFSLGESCIHMARRKIAKLIGDVPSVEGMLRHCRFSGGATTTNNRSYGHPSFKFALPQACTPRALKYVLALRASTHFDIRISDISPFNKAVTVPKNSKTDRCIAIEPGWNMFFQLGIGGILRDRLRCWGIDLNDQTINQRRAHEGSVTNNLATVDLSAASDSISLALCELLLPPGWFEVLMDLRSPKGRLPDGSVVTYEKISSMGNGYTFELESLIFASLARSVCEILDLDSSEVTVYGDDIILPSCAVPALREVFKYVGFTTNTKKTFSEGPFRESCGKHYYSGVDVTPFYIRHRIVSPADLILVLNNLYRWATIDGVWDPRAHSVYLKYRKLLPKQLQRNTIPDGYGDGALVGSVLINPFAKNRGWIRYVPVITDHTRDRERAELGSYLYDLFSRCLSESNDGLPLRGPSGCDSADLFAIDQLICRSNPTKISRSTGKFDIQYIACSSRVLAPYGVFQGTKVASLHEAHHHHHH
7TSM , Knot 186 440 0.85 40 246 410
APASLLPPAPEHSPPSSPLTQPPEGPKFPRVKNWEVGSITYDTLSAQAQQDGPCTPRRCLGSLVFPRKLQGRPSPGPPAPEQLLSQARDFINQYYSSIKRSGSQAHEQRLQEVEAEVAATGTYQLRESELVFGAKQAWRNAPRCVGRIQWGKLQVFDARDCRSAQEMFTYICNHIKYATNRGNLRSAITVFPQRCPGRGDFRIWNSQLVRYAGYRQQDGSVRGDPANVEITELCIQHGWTPGNGRFDVLPLLLQAPDEPPELFLLPPELVLEVPLEHPTLEWFAALGLRWYALPAVSNMLLEIGGLEFPAAPFSGWYMSTEIGTRNLCDPHRYNILEDVAVCMDLDTRTTSSLWKDKAAVEINVAVLHSYQLAKVTIVDHHAATASFMKHLENEQKARGGCPADWAWIVPPISGSLTPVFHQEMVNYFLSPAFRYQPDPW

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(4UEG_1)}(2) \setminus P_{f(3AVX_1)}(2)|=24\), \(|P_{f(3AVX_1)}(2) \setminus P_{f(4UEG_1)}(2)|=176\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:100011101100010001111110010000100011111010100110011001100110101100100101111001011101001001010000001110100111001001100101011101111001001111101010000111001100101011001110011001000010001111001000010000111001100101101110101100000100101100101000000111101110100001111000101001010000000000000000
Pair \(Z_2\) Length of longest common subsequence
4UEG_1,3AVX_1 200 6
4UEG_1,7TSM_1 192 4
3AVX_1,7TSM_1 144 5

Newick tree

 
[
	4UEG_1:10.27,
	[
		7TSM_1:72,3AVX_1:72
	]:33.27
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1577 }{\log_{20} 1577}-\frac{288}{\log_{20}288})=332.\)
Status Protein1 Protein2 d d1/2
Query variables 4UEG_1 3AVX_1 421 255.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]