CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
8XLQ_1 3PVY_1 6KFK_1 Letter Amino acid
26 12 12 P Proline
7 14 44 T Threonine
21 24 6 R Arginine
14 14 15 K Lycine
11 11 16 F Phenylalanine
8 21 24 Q Glutamine
22 24 9 E Glutamic acid
17 13 34 S Serine
8 9 11 Y Tyrosine
25 14 29 V Valine
23 32 41 A Alanine
19 17 20 D Aspartic acid
7 5 0 C Cysteine
7 12 46 N Asparagine
10 18 17 I Isoleucine
9 12 8 M Methionine
5 10 2 W Tryptophan
24 20 39 G Glycine
8 8 2 H Histidine
40 21 27 L Leucine

8XLQ_1|Chain A|Fibroblast growth factor receptor 4|Homo sapiens (9606)
>3PVY_1|Chain A|Phenylacetic acid degradation protein paaA|Escherichia coli (511145)
>6KFK_1|Chain A|Flagellar hook protein FlgE|Salmonella enterica subsp. enterica serovar Typhimurium (90371)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
8XLQ , Knot 136 311 0.83 40 193 294
GPLLAGLVSLDLPLDPLWEFPRDRLVLGKPLGEGCFGQVVRAEAFGMDPARPDQASTVAVKMLKDNASDKDLADLVSEMEVMKLIGRHKNIINLLGVCTQEGPLYVIVECAAKGNLREFLRARRPPGPDLSPDGPRSSEGPLSFPVLVSCAYQVARGMQYLESRKCIHRDLAARNVLVTEDNVMKIADFGLARGVHHIDYYKKTSNGRLPVKWMAPEALFDEVYTHQSDVWSFGILLWEIFTLGGSPYPGIPVEELFSLLREGHRMDRPPHCPPELYGLMRECWHAAPSQRPTFKQLVEALDKVLLAVSEE
3PVY , Knot 136 311 0.83 40 204 302
MRSTQEERFEQRIAQETAIEPQDWMPDAYRKTLIRQIGQHAHSEIVGMLPEGNWITRAPTLRRKAILLAKVQDEAGHGLYLYSAAETLGCAREDIYQKMLDGRMKYSSIFNYPTLSWADIGVIGWLVDGAAIVNQVALCRTSYGPYARAMVKICKEESFHQRQGFEACMALAQGSEAQKQMLQDAINRFWWPALMMFGPNDDNSPNSARSLTWKIKRFTNDELRQRFVDNTVPQVEMLGMTVPDPDLHFDTESGHYRFGEIDWQEFNEVINGRGICNQERLDAKRKAWEEGTWVREAALAHAQKQHARKVA
6KFK , Knot 161 402 0.80 38 189 367
SFSQAVSGLNAAATNLDVIGNNIANSATYGFKSGTASFADMFAGSKVGLGVKVAGITQDFTDGTTTNTGRGLDVAISQNGFFRLVDSNGSVFYSRNGQFKLDENRNLVNMQGMQLTGYPATGTPPTIQQGANPAPITIPNTLMAAKSTTTASMQINLNSTDPVPSKTPFSVSDADSYNKKGTVTVYDSQGNAHDMNVYFVKTKDNEWAVYTHDSSDPAATAPTTASTTLKFNENGILESGGTVNITTGTINGATAATFSLSFLNSMQQNTGANNIVATNQNGYKPGDLVSYQINNDGTVVGNYSNEQEQVLGQIVLANFANNEGLASQGDNVWAATQASGVALLGTAGSGNFGKLTNGALEASNVDLSKELVNMIVAQRNYQSNAQTIKTQDQILNTLVNLR

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(8XLQ_1)}(2) \setminus P_{f(3PVY_1)}(2)|=85\), \(|P_{f(3PVY_1)}(2) \setminus P_{f(8XLQ_1)}(2)|=96\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:11111111010111011101100011110111010110110101111011010010011101100010000110110010110111000011011110000111011100110101001101001111010101100001110111110010011011001000001000111001110000110110111101100100000000101110111101110010000001101111110110111010111110011011001001001100110101110001011100010100110110011111000
Pair \(Z_2\) Length of longest common subsequence
8XLQ_1,3PVY_1 181 3
8XLQ_1,6KFK_1 168 4
3PVY_1,6KFK_1 195 3

Newick tree

 
[
	3PVY_1:97.18,
	[
		8XLQ_1:84,6KFK_1:84
	]:13.18
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{622 }{\log_{20} 622}-\frac{311}{\log_{20}311})=86.5\)
Status Protein1 Protein2 d d1/2
Query variables 8XLQ_1 3PVY_1 116 114
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]