CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
4FHP_1 7UWH_1 4RCP_1 Letter Amino acid
20 16 10 G Glycine
40 0 31 L Leucine
2 0 3 W Tryptophan
6 0 5 M Methionine
22 0 7 F Phenylalanine
19 10 12 A Alanine
19 0 18 R Arginine
20 0 14 D Aspartic acid
22 0 11 I Isoleucine
26 0 11 K Lycine
16 0 7 P Proline
32 0 23 S Serine
14 12 11 T Threonine
18 0 13 Y Tyrosine
17 0 10 V Valine
12 0 10 N Asparagine
4 21 6 C Cysteine
10 0 7 Q Glutamine
22 0 13 E Glutamic acid
8 0 6 H Histidine

4FHP_1|Chain A|Poly(A) RNA polymerase protein cid1|Schizosaccharomyces pombe (284812)
>7UWH_1|Chain A|DNA (59-MER)|Escherichia coli (562)
>4RCP_1|Chain A|Serine/threonine-protein kinase PLK1|Homo sapiens (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
4FHP , Knot 154 349 0.86 40 208 332
GSHMSYQKVPNSHKEFTKFCYEVYNEIKISDKEFKEKRAALDTLRLCLKRISPDAELVAFGSLESGLALKNSDMDLCVLMDSRVQSDTIALQFYEELIAEGFEGKFLQRARIPIIKLTSDTKNGFGASFQCDIGFNNRLAIHNTLLLSSYTKLDARLKPMVLLVKHWAKRKQINSPYFGTLSSYGYVLMVLYYLIHVIKPPVFPNLLLSPLKQEKIVDGFDVGFDDKLEDIPPSQNYSSLGSLLHGFFRFYAYKFEPREKVVTFRRPDGYLTKQEKGWTSATEHTGSADQIIKDRYILAIEDPFEISHNVGRTVSSSGLYRIRGEFMAASRLLNSRSYPIPYDSLFEEA
7UWH , Knot 22 59 0.50 8 15 36
GCCGTAGGCGGGCTACCTCTCCATGACGGCGAATACCCTCCCAGGCCTGCTGGTAATCT
4RCP , Knot 110 228 0.87 40 162 220
CHLSDMLQQLHSVNASKPSERGLVRQEEAEDPACIPIFWVSKWVDYSDKYGLGYQLCDNSVGVLFNDSTRLILYNDGDSLQYIERDGTESYLTVSSHPNSLMKKITLLKYFRNYMSEHLLKAGANITPREGDELARLPYLRTWFRTRSAIILHLSNGSVQINFFQDHTKLILCPLMAAVTYIDEKRDFRTYRLSLLEEYGCCKELASRLRYARTMVDKLLSSRSASNR

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(4FHP_1)}(2) \setminus P_{f(7UWH_1)}(2)|=203\), \(|P_{f(7UWH_1)}(2) \setminus P_{f(4FHP_1)}(2)|=10\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1001000011000001001000100010100001000011100101010010101011111010011110000101011100010000111010001110110101100101111010000001111010001110001110001110000010101011111100110000100101101000101111100110110111110111011000011011011100010011100000011011011101010010100011010010101000001100100001010011000011110011010001100100011001010111100110000011100011001
Pair \(Z_2\) Length of longest common subsequence
4FHP_1,7UWH_1 213 2
4FHP_1,4RCP_1 166 4
7UWH_1,4RCP_1 163 3

Newick tree

 
[
	4FHP_1:99.69,
	[
		4RCP_1:81.5,7UWH_1:81.5
	]:18.19
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{408 }{\log_{20} 408}-\frac{59}{\log_{20}59})=108.\)
Status Protein1 Protein2 d d1/2
Query variables 4FHP_1 7UWH_1 151 85
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]