CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
6HBG_1 4FNR_1 7DBH_1 Letter Amino acid
11 16 3 H Histidine
21 27 7 I Isoleucine
8 23 13 K Lycine
32 35 10 T Threonine
5 15 0 W Tryptophan
14 30 1 N Asparagine
4 3 5 C Cysteine
13 49 7 E Glutamic acid
12 32 4 F Phenylalanine
21 34 6 P Proline
20 51 5 V Valine
17 58 11 R Arginine
16 40 5 D Aspartic acid
11 38 4 Y Tyrosine
13 56 18 A Alanine
13 32 8 Q Glutamine
10 21 2 M Methionine
23 46 9 S Serine
15 55 8 G Glycine
8 68 13 L Leucine

6HBG_1|Chain A|Echovirus 18 viral protein 1|Echovirus E18 (47506)
>4FNR_1|Chains A, B, C, D|Alpha-galactosidase AgaA|Geobacillus stearothermophilus (1422)
>7DBH_1|Chains A, E|Histone H3mm18|Mus musculus (10090)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
6HBG , Knot 130 287 0.85 40 197 273
GDNQDRTVANTQPSGPSNSTEIPALTAVETGHTSQVDPSDTIQTRHVVNFHSRSESTIENFMGRAACVFMDQYKINGEETSTDRFAVWTINIREMAQLRRKCEMFTYMRFDIEMTMVITSCQDQGTILDQDMPVLTHQIMYVPPGGPIPAKVDGYEWQTSTNPSVFWTEGNAPPRISIPFISVGNAYSSFYDGWSHFTQDGTYGYTTLNAMGKLYIRHVNRSSPHQITSTIRVYFKPKHIKAWVPRPPRLCPYINKRDVNFVVTEITDSRTSITDTPHPEHSVLATH
4FNR , Knot 286 729 0.86 40 293 674
MSVAYNPQTKQFHLRAGKASYVMQLFRSGYLAHVYWGKAVRDVRGARAFPRLDRAFSPNPDPSDRTFSLDTLLQEYPAYGNTDFRAPAYQVQLENGSTVTDLRYKTHRIYKGKPRLNGLPATYVEHEQEAETLEIVLGDALIGLEVTLQYTAYEKWNVITRSARFENKGGERLKLLRALSMSVDFPTADYDWIHLPGAWGRERWIERRPLVTGVQAAESRRGASSHQQNPFIALVAKNADEHQGEVYGFSFVYSGNFLAQIEVDQFGTARVSMGINPFDFTWLLQPGESFQTPEVVMVYSDQGLNGMSQTYHELYRTRLARGAFRDRERPILINNWEATYFDFNEEKIVNIARTAAELGIELVVLDDGWFGERDDDRRSLGDWIVNRRKLPNGLDGLAKQVNELGLQFGLWVEPEMVSPNSELYRKHPDWCLHVPNRPRSEGRNQLVLDYSREDVCDYIIETISNVLASAPITYVKWDMNRHMTEIGSSALPPERQRETAHRYMLGLYRVMDEITSRFPHILFESCSGGGGRFDPGMLYYMPQTWTSDNTDAVSRLKIQYGTSLVYPISAMGAHVSAVPNHQVGRVASLKTRGHVAMSGNFGYELDITKLTETEKQMMKQQVAFYKDVRRLVQFGTFYRLLSPFEGNEAAWMFVSADRSEALVAYFRVLAEANAPLSYLRLKGLDSNQDYEIEGLGVYGGDELMYAGVALPYRSSDFISMMWRLKAVQQ
7DBH , Knot 72 139 0.85 38 109 133
GSHMARTKQTARKSTGDKAPRKQLATKAARKSAPSTGGVKKPHCYRPGTVALREIRSYQKSSELLIRKLPFQRLVLEIAQDFKTDLCFQSAAIGALQEASEAYLVGLFEDTNLCAIHAKRVTIMPKDTQLAGYICRECA

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(6HBG_1)}(2) \setminus P_{f(4FNR_1)}(2)|=35\), \(|P_{f(4FNR_1)}(2) \setminus P_{f(6HBG_1)}(2)|=131\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:10000001100010110000011110110010000101000100001101000000010011101101110000101000000011110101001101000001100101010101110000001011000111100011011111111101010010000010111001011101011110110100010011001000100100010111010100100001001000101010100101111011010101000010111001000000100010100011100
Pair \(Z_2\) Length of longest common subsequence
6HBG_1,4FNR_1 166 4
6HBG_1,7DBH_1 192 3
4FNR_1,7DBH_1 232 4

Newick tree

 
[
	7DBH_1:11.21,
	[
		6HBG_1:83,4FNR_1:83
	]:30.21
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1016 }{\log_{20} 1016}-\frac{287}{\log_{20}287})=195.\)
Status Protein1 Protein2 d d1/2
Query variables 6HBG_1 4FNR_1 254 175
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]