CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
3FST_1 3LRD_1 6MRV_1 Letter Amino acid
29 17 36 A Alanine
3 0 3 C Cysteine
9 10 17 M Methionine
15 7 18 F Phenylalanine
14 5 22 P Proline
12 11 32 N Asparagine
23 2 32 D Aspartic acid
9 12 21 Q Glutamine
13 1 24 H Histidine
19 6 33 I Isoleucine
15 2 37 K Lycine
20 22 42 S Serine
14 11 37 T Threonine
3 1 12 W Tryptophan
7 0 14 Y Tyrosine
19 5 44 V Valine
17 4 22 E Glutamic acid
17 11 40 G Glycine
26 9 45 L Leucine
20 1 22 R Arginine

3FST_1|Chains A, B[auth C], C[auth E]|5,10-methylenetetrahydrofolate reductase|Escherichia coli K-12 (83333)
>3LRD_1|Chains A, B|Major ampullate spidroin 1|Euprosthenops australis (332052)
>6MRV_1|Chains A, B|Sialidase26|unidentified bacterium (1869227)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
3FST , Knot 133 304 0.83 40 193 293
MSFFHASQRDALNQSLAEVQGQINVSFEFFPPRTSEMEQTLWNSIDRLSSLKPKFVSVTYGANSGERDRTHSIIKGIKDRTGLEAAPHLTCIDATPDELRTIARDYWNNGIRHIVALRGDLPPGSGKPEMYASDLVTLLKEVADFDISVAAYPEVHPEAKSAQADLLNLKRKVDAGANRAITQFFFDVESYLRFRDRCVSAGIDVEIIPGILPVSNFKQAKKLADMTNVRIPAWMAQMFDGLDDDAETRKLVGANIAMDMVKILSREGVKDFHFYTLNRAEMSYAICHTLGVRPGLLEHHHHHH
3LRD , Knot 63 137 0.75 36 93 133
GSGNSHTTPWTNPGLAENFMNSFMQGLSSMPGFTASQLDNMSTIAQSMVQSIQSLAAQGRTSPNKLQALNMAFASSMAEIAASQEGGGSLSTKTSSIASAMSNAFLQTTGVVNQPFINEITQLVSMFAQAGMNDVSA
6MRV , Knot 224 553 0.85 40 270 518
MKKNLFLSIIFSFCVILQAFASDTVFVRETQIPVLIERQDNVLFMLRLNAKESHTLDEVVLNFGKDVNMSDIQSVKLYYSGTEARQNYGKNFFAPVSYISSHTPGKTLAANPSYSINKSQVNNPKRKVALKANQKLFPGINYFWISLQMKPDASLLDKVAAKIAAIKVDNKEALMHTVSPENIVHRVGVGVRHAGDDGSASFRIPGLVTTNKGTLLGVYDVRYNNSADLQEHVDIGLSRSVDGGKTWEKMRLPLAFGETGDLPAAQNGVGDPSILVDTKTNTVWVVAAWTHGMGNQRAWWSSYPGMDMNHTAQLVLSKSTDDGKTWSKPINITEQVKDPSWYFLLQGPGRGITMQDGTLVFPIQFIDSTRVPNAGIMYSKDRGETWKIHNYARTNTTEAQVAEVEPGVLMLNMRDNRGGSRAISTTKDLGKTWTEHSSSRKALQEPVCMASLISVKAKDNVLNKDILLFSNPNTVKGRHHITIKASLDGGVTWLPEHQVMLDEGEGWGYSCLTMIDKETIGILYESSVAHMTFQAVQLRDIIKHHHHHHHHHH

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(3FST_1)}(2) \setminus P_{f(3LRD_1)}(2)|=141\), \(|P_{f(3LRD_1)}(2) \setminus P_{f(3FST_1)}(2)|=41\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1011010000110001101010101010111100001000110010010010101101001100100000001101100001101110100101010010011000100110011110101111010101010011011001101010111010101010010101101000101110011001110100010100001011101011111111001001001101001011111101101100010000111101110110110001100101001001010011000111011110000000
Pair \(Z_2\) Length of longest common subsequence
3FST_1,3LRD_1 182 4
3FST_1,6MRV_1 169 6
3LRD_1,6MRV_1 207 4

Newick tree

 
[
	3LRD_1:10.40,
	[
		3FST_1:84.5,6MRV_1:84.5
	]:16.90
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{441 }{\log_{20} 441}-\frac{137}{\log_{20}137})=90.8\)
Status Protein1 Protein2 d d1/2
Query variables 3FST_1 3LRD_1 113 81.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]