CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
2CMY_1 8TAQ_1 8XVM_1 Letter Amino acid
34 0 97 S Serine
10 3 95 T Threonine
14 8 75 A Alanine
15 0 73 I Isoleucine
14 0 65 K Lycine
3 0 23 H Histidine
3 0 77 F Phenylalanine
8 0 61 P Proline
10 0 55 Y Tyrosine
2 0 39 R Arginine
4 0 47 E Glutamic acid
25 6 83 G Glycine
6 0 61 D Aspartic acid
12 4 30 C Cysteine
4 0 11 W Tryptophan
2 0 11 M Methionine
17 0 88 V Valine
16 0 83 N Asparagine
10 0 61 Q Glutamine
14 0 100 L Leucine

2CMY_1|Chain A|CATIONIC TRYPSIN|BOS TAURUS (9913)
>8TAQ_1|Chain A|DNA (5'-D(*GP*AP*GP*CP*AP*GP*AP*CP*CP*TP*GP*AP*CP*GP*GP*AP*AP*AP*TP*TP*A*(HT1))-3')|synthetic construct (32630)
>8XVM_1|Chain A|Spike glycoprotein|Severe acute respiratory syndrome coronavirus 2 (2697049)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
2CMY , Knot 101 223 0.81 40 140 216
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN
8TAQ , Knot 10 21 0.48 8 14 18
GAGCAGACCTGACGGAAATTA
8XVM , Knot 447 1235 0.86 40 332 1079
SSQCVMPLFNLITTTQSYTNSFTRGVYYPDKVFRSSVLHLTQDLFLPFFSNVTWFHAISGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVFIKVCEFQFCNDPFLDVYHKNNKSWMESESGVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPIIGRDFPQGFSALEPLVDLPIGINITRFQTLLALNRSYLTPGDSSSGWTAGAADYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNVTNLCPFHEVFNATRFASVYAWNRTRISNCVADYSVLYNFAPFFAFKCYGVSPTKLNDLCFTNVYADSFVIKGNEVSQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNKLDSKHSGNYDYWYRLFRKSKLKPFERDISTEIYQAGNKPCKGKGPNCYFPLQSYGFRPTYGVGHQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTKSNKKFLPFQQFGRDIVDTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVSVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSRGSAGSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKYFGGFNFSQILPDPSKPSKRSPIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGPALQIPFPMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLFSTPSALGKLQDVVNHNAQALNTLVKQLSSKFGAISSVLNDILSRLDPPEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQLELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIASSGYIPEAPRDGQAYVRKDGEWVLLSTFLEGTKHHHHHH

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(2CMY_1)}(2) \setminus P_{f(8TAQ_1)}(2)|=133\), \(|P_{f(8TAQ_1)}(2) \setminus P_{f(2CMY_1)}(2)|=7\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1111000110011001010010010110110001110110000011010110001011010001101000110100000010001111010011010001101011000101100011011100000100010110010111100000001011010001101101011000001001111100101011101101010000111000100010110001100
Pair \(Z_2\) Length of longest common subsequence
2CMY_1,8TAQ_1 140 3
2CMY_1,8XVM_1 218 4
8TAQ_1,8XVM_1 322 3

Newick tree

 
[
	8XVM_1:15.51,
	[
		2CMY_1:70,8TAQ_1:70
	]:83.51
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{244 }{\log_{20} 244}-\frac{21}{\log_{20}21})=76.3\)
Status Protein1 Protein2 d d1/2
Query variables 2CMY_1 8TAQ_1 96 52
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]