CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
4WOP_1 3MYM_1 5LLY_1 Letter Amino acid
4 5 11 M Methionine
11 2 22 P Proline
14 7 25 T Threonine
6 3 40 I Isoleucine
13 7 33 R Arginine
3 1 5 C Cysteine
6 7 39 E Glutamic acid
2 1 12 W Tryptophan
1 5 14 Y Tyrosine
14 11 29 D Aspartic acid
7 7 25 Q Glutamine
23 8 24 G Glycine
27 12 54 L Leucine
3 11 19 K Lycine
2 10 18 F Phenylalanine
7 12 38 S Serine
26 8 27 V Valine
51 12 48 A Alanine
3 6 27 N Asparagine
2 2 19 H Histidine

4WOP_1|Chains A, B, C, D|ATP-dependent dethiobiotin synthetase BioD|Mycobacterium tuberculosis (419947)
>3MYM_1|Chains A, B|Dehaloperoxidase A|Amphitrite ornata (129555)
>5LLY_1|Chains A, B, C, D|Diguanylate cyclase (GGDEF) domain-containing protein|Idiomarina sp. A28L (1036674)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
4WOP , Knot 89 225 0.71 40 119 190
ATILVVTGTGTGVGKTVVCAALASAARQAGIDVAVCKPVQTGTARGDDDLAEVGRLAGVTQLAGLARYPQPMAPAAAAEHAGMALPARDQIVRLIADLDRPGRLTLVEGAGGLLVELAEPGVTLRDVAVDVAAAALVVVTADLGTLNHTKLTLEALAAQQVSCAGLVIGSWPDPPGLVAASNRSALARIAMVRAALPAGAASLDAGDFAAMSAAAFDRNWVAGLV
3MYM , Knot 68 137 0.81 40 109 134
GFKQDIATIRGDLRTYAQDIFLAFLNKYPDERRYFKNYVGKSDQELKSMAKFGDHTEKVFNLMMEVADRATDCVPLASDANTLVQEKQHSSLTTGNFEKLFVALVEYMRASGQSFDSQSWDRFGKNLVSALSSAGMK
5LLY , Knot 210 529 0.83 40 254 502
GAMAADLGSDDISKLIAACDQEPIHIPNAIQPFGAMLIVEKDTQQIVYASANSAEYFSVADNTIHELSDIKQANINSLLPEHLISGLASAIRENEPIWVETDRLSFLGWRHENYYIIEVERYHVQTSNWFEIQFQRAFQKLRNCKTHNDLINTLTRLIQEISGYDRVMIYQFDPEWNGRVIAESVRQLFTSMLNHHFPASDIPAQARAMYSINPIRIIPDVNAEPQPLHMIHKPQNTEAVNLSSGVLRAVSPLHMQYLRNFGVSASTSIGIFNEDELWGIVACHHTKPRAIGRRIRRLLVRTVEFAAERLWLIHSRNVERYMVTVQAAREQLSTTADDKHSSHEIVIEHAADWCKLFRCDGIGYLRGEELTTYGETPDQTTINKLVEWLEENGKKSLFWHSHMLKEDAPGLLPDGSRFAGLLAIPLKSDADLFSYLLLFRVAQNEVRTWAGKPEKLSVETSTGTMLGPRKSFEAWQDEVSGKSQPWRTAQLYAARDIARDLLIVADSMQLNLLNDQLADANENLEKLAS

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(4WOP_1)}(2) \setminus P_{f(3MYM_1)}(2)|=77\), \(|P_{f(3MYM_1)}(2) \setminus P_{f(4WOP_1)}(2)|=67\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:101111010101110011011110110011101110011001010100011011011110011111001011111111001111111000110111010011010110111111101101110100111011111111101011010000101011110010011111101101111111000011101111011111111010110111101111000111111
Pair \(Z_2\) Length of longest common subsequence
4WOP_1,3MYM_1 144 3
4WOP_1,5LLY_1 189 4
3MYM_1,5LLY_1 207 3

Newick tree

 
[
	5LLY_1:10.61,
	[
		4WOP_1:72,3MYM_1:72
	]:34.61
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{362 }{\log_{20} 362}-\frac{137}{\log_{20}137})=68.4\)
Status Protein1 Protein2 d d1/2
Query variables 4WOP_1 3MYM_1 78 65.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]