CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
1ILG_1 1QFY_1 3ZSG_1 Letter Amino acid
22 17 21 S Serine
16 16 19 T Threonine
17 16 25 A Alanine
6 12 14 N Asparagine
22 8 15 Q Glutamine
23 26 23 E Glutamic acid
15 15 22 I Isoleucine
15 14 17 P Proline
6 5 4 C Cysteine
20 25 16 G Glycine
40 23 43 L Leucine
19 14 13 F Phenylalanine
3 6 5 W Tryptophan
11 25 22 V Valine
17 9 19 R Arginine
12 18 27 D Aspartic acid
14 4 13 H Histidine
14 11 10 M Methionine
17 34 19 K Lycine
7 10 15 Y Tyrosine

1ILG_1|Chain A|ORPHAN NUCLEAR RECEPTOR PXR|Homo sapiens (9606)
>1QFY_1|Chains A, B|PROTEIN (FERREDOXIN: NADP+ REDUCTASE)|Pisum sativum (3888)
>3ZSG_1|Chain A|MITOGEN-ACTIVATED PROTEIN KINASE 14|HOMO SAPIENS (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
1ILG , Knot 139 316 0.84 40 201 300
MKKGHHHHHHGSERTGTQPLGVQGLTEEQRMMIRELMDAQMKTFDTTFSHFKNFRLPGVLSSGCELPESLQAPSREEAAKWSQVRKDLCSLKVSLQLRGEDGSVWNYKPPADSGGKEIFSLLPHMADMSTYMFKGIISFAKVISYFRDLPIEDQISLLKGAAFELCQLRFNTVFNAETGTWECGRLSYCLEDTAGGFQQLLLEPMLKFHYMLKKLQLHEEEYVLMQAISLFSPDRPGVLQHRVVDQLQEQFAITLKSYIECNRPQPAHRFLFLKIMAMLTELRSINAQHTQRLLRIQDIHPFATPLMQELFGITGS
1QFY , Knot 136 308 0.84 40 195 294
QVTTEAPAKVVKHSKKQDENIVVNKFKPKEPYVGRCLLNTKITGDDAPGETWHMVFSTEGEVPYREGQSIGIVPDGIDKNGKPHKLRLYSIASSAIGDFGDSKTVSLCVKRLVYTNDAGEVVKGVCSNFLCDLKPGSEVKITGPVGKEMLMPKDPNATVIMLGTGTGIAPFRSFLWKMFFEKHEDYQFNGLAWLFLGVPTSSSLLYKEEFEKMKEKAPENFRLDFAVSREQVNDKGEKMYIQTRMAQYAEELWELLKKDNTFVYMCGLKGMEKGIDDIMVSLAAKDGIDWIEYKRTLKKAEQWNVEVS
3ZSG , Knot 159 362 0.86 40 224 348
GSHSQERPTFYRQELNKTIWEVPERYQNLSPVGSGAYGSVCAAFDTKTGLRVAVKKLSRPFQSIIHAKRTYRELRLLKHMKHENVIGLLDVFTPARSLEEFNDVYLVTHLMGADLNNIVKCQKLTDDHVQFLIYQILRGLKYIHSADIIHRDLKPSNLAVNEDCELKILDFGLARHTDDEMTGYVATRWYRAPEIMLNWMHYNQTVDIWSVGCIMAELLTGRTLFPGTDHIDQLKLILRLVGTPGAELLKKISSESARNYIQSLTQMPKMNFANVFIGANPLAVDLLEKMLVLDSDKRITAAQALAHAYFAQYHDPDDEPVADPYDQSFESRDLLIDEWKSLTYDEVISFVPPPLDQEEMES

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(1ILG_1)}(2) \setminus P_{f(1QFY_1)}(2)|=92\), \(|P_{f(1QFY_1)}(2) \setminus P_{f(1ILG_1)}(2)|=86\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1001000000100001001111011000001110011010100100010010010111110010011001011000011010010001001010101010010110001110011001101110110100011011101101100100111000101101111010010100110100101001010001000111100111011101001100101000001110110110100111100011001000111010001000010110011110111110010010100000110100101110111001111010
Pair \(Z_2\) Length of longest common subsequence
1ILG_1,1QFY_1 178 4
1ILG_1,3ZSG_1 189 3
1QFY_1,3ZSG_1 187 5

Newick tree

 
[
	3ZSG_1:95.61,
	[
		1ILG_1:89,1QFY_1:89
	]:6.61
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{624 }{\log_{20} 624}-\frac{308}{\log_{20}308})=88.0\)
Status Protein1 Protein2 d d1/2
Query variables 1ILG_1 1QFY_1 108 108
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]