CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
2KBX_1 5SRM_1 3BOJ_1 Letter Amino acid
16 15 18 A Alanine
7 3 9 Q Glutamine
4 6 9 T Threonine
4 7 3 Y Tyrosine
13 13 24 G Glycine
6 5 11 P Proline
4 11 9 S Serine
12 2 6 R Arginine
13 16 6 N Asparagine
14 9 16 D Aspartic acid
7 6 8 H Histidine
16 19 14 L Leucine
6 2 7 M Methionine
4 3 3 C Cysteine
10 6 13 E Glutamic acid
7 6 12 I Isoleucine
8 13 14 K Lycine
4 5 9 F Phenylalanine
4 0 3 W Tryptophan
12 22 19 V Valine

2KBX_1|Chain A|Integrin-linked protein kinase|Homo sapiens (9606)
>5SRM_1|Chains A, B|Non-structural protein 3|Severe acute respiratory syndrome coronavirus 2 (2697049)
>3BOJ_1|Chain A|Cadmium-specific carbonic anhydrase|Thalassiosira weissflogii (1577725)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
2KBX , Knot 83 171 0.83 40 126 162
MDDIFTQCREGNAVAVRLWLDNTENDLNQGDDHGFSPLHWACREGRSAVVEMLIMRGARINVMNRGDDTPLHLAASHGHRDIVQKLLQYKADINAVNEHGNVPLHYACFWGQDQVAEDLVANGALVSICNKYGEMPVDKAKAPLRELLRERAEKMGQNLNRIPYKDTFWKG
5SRM , Knot 80 169 0.81 38 119 162
SMVNSFSGYLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFL
3BOJ , Knot 99 213 0.83 40 150 205
SHMSLTPDQIVAALQERGWQAEIVTEFSLLNEMVDVDPQGILKCVDGRGSDNTQFCGPKMPGGIYAIAHNRGVTTLEGLKQITKEVASKGHVPSVHGDHSSDMLGCGFFKLWVTGRFDDMGYPRPQFDADQGAKAVENAGGVIEMHHGSHAEKVVYINLVENKTLEPDEDDQRFIVDGWAAGKFGLDVPKFLIAAAATVEMLGGPKKAKIVIP

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(2KBX_1)}(2) \setminus P_{f(5SRM_1)}(2)|=80\), \(|P_{f(5SRM_1)}(2) \setminus P_{f(2KBX_1)}(2)|=73\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:100110000010111101110000001001000110110110001001110111101101011001000110111001000110011000101011000101110010111000110011101111010000101110010111001100010011001001100001101
Pair \(Z_2\) Length of longest common subsequence
2KBX_1,5SRM_1 153 3
2KBX_1,3BOJ_1 156 4
5SRM_1,3BOJ_1 157 3

Newick tree

 
[
	3BOJ_1:78.82,
	[
		2KBX_1:76.5,5SRM_1:76.5
	]:2.32
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{340 }{\log_{20} 340}-\frac{169}{\log_{20}169})=51.7\)
Status Protein1 Protein2 d d1/2
Query variables 2KBX_1 5SRM_1 67 65.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]