CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
3JPR_1 8GFF_1 2GTP_1 Letter Amino acid
12 50 14 N Asparagine
27 28 24 E Glutamic acid
9 14 6 H Histidine
7 14 7 M Methionine
14 22 4 P Proline
22 15 16 G Glycine
26 55 25 L Leucine
36 56 27 K Lycine
21 42 16 S Serine
18 33 26 A Alanine
21 28 23 D Aspartic acid
14 33 19 F Phenylalanine
23 42 25 I Isoleucine
19 21 20 T Threonine
1 3 3 W Tryptophan
12 35 13 Y Tyrosine
18 13 18 V Valine
19 17 15 R Arginine
3 2 9 C Cysteine
13 21 13 Q Glutamine

3JPR_1|Chain A|DNA polymerase beta|Homo sapiens (9606)
>8GFF_1|Chain A|Lytic transglycosylase domain-containing protein|Campylobacter jejuni (197)
>2GTP_1|Chains A, B|Guanine nucleotide-binding protein G(i), alpha-1 subunit|Homo sapiens (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
3JPR , Knot 146 335 0.84 40 199 321
MSKRKAPQETLNGGITDMLTELANFEKNVSQAIHKYNAYRKAASVIAKYPHKIKSGAEAKKLPGVGTKIAEKIDEFLATGKLRKLEKIRQDDTSSSINFLTRVSGIGPSAARKFVDEGIKTLEDLRKNEDKLNHHQRIGLKYFGDFEKRIPREEMLQMQDIVLNEVKKVDSEYIATVCGSFRRGAESSGDMDVLLTHPSFTSESTKQPKLLHQVVEQLQKVHFITDTLSKGETKFMGVCQLPSKNDEKEYPHRRIDIRLIPKDQYYCGVLYFTGSDIFNKNMRAHALEKGFTINEYTIRPLGVTGVAGEPLPVDSEKDIFDYIQWKYREPKDRSE
8GFF , Knot 210 544 0.81 40 235 497
MGSSHHHHHHSSGLVPRGSHMQYSIEKLKKEENSLAKDYYIYRLLEKNKISKKDAQDLNSHIFRYIGKIKSELEKIIPLKPYINPKYAKCYTYTANTILDANLTCQSVRLNSLVFIASLNSKDRTTLAQTFKNQRPDLTNLLLAFNTSDPMSYIVQKEDINGFFKLYNYSKKYDLDLNTSLVNKLPNHIGFKDFAQNIIIKKENPKFRHSMLEINPENVSEDSAFYLGVNALTYDKTELAYDFFKKAAQSFKSQSNKDNAIFWMWLIKNNEEDLKTLSQSSSLNIYSLYAKELTNTPFPKIESLNPSKKKNNFNMQDPFAWQKINKQIRDANASQLDVLAKEFDTQETLPIYAYILERKNNFKKHYFIMPYYDNIKDYNKTRQALILAIARQESRFIPTAISVSYALGMMQFMPFLANHIGEKELKIPNFDQDFMFKPEIAYYFGNYHLNYLESRLKSPLFVAYAYNGGIGFTNRMLARNDMFKTGKFEPFLSMELVPYQESRIYGKKVLANYIVYRHLLNDSIKISDIFENLIQNKANDLNKS
2GTP , Knot 138 323 0.82 40 198 306
REVKLLLLGAGESGKSTIVKQMKIIHEAGYSEEECKQYKAVVYSNTIQSIIAIIRAMGRLKIDFGDSARADDARQLFVLAGAAEEGFMTAELAGVIKRLWKDSGVQACFNRSREYQLNDSAAYYLNDLDRIAQPNYIPTQQDVLRTRVKTTGIVETHFTFKDLHFKMFDVGGQRSERKKWIHCFEGVTAIIFCVALSDYDLVLAEDEEMNRMHESMKLFDSICNNKWFTDTSIILFLNKKDLFEEKIKKSPLTICYPEYAGSNTYEEAAAYIQCQFEDLNKRKDTKEIYTHFTCATDTKNVQFVFDAVTDVIIKNNLKDCGLF

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(3JPR_1)}(2) \setminus P_{f(8GFF_1)}(2)|=64\), \(|P_{f(8GFF_1)}(2) \setminus P_{f(3JPR_1)}(2)|=100\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:10000110001011100110011010001001100001000110111001001001101001111100110010011101010010010000000010110010111101100110011001001000000100000111001101000110001101001110010010000110101010011000101011100101000000010110011001001011000100100011110011000000001000101011100000011101010011000101011001101000010111101111011110000011001010000100000
Pair \(Z_2\) Length of longest common subsequence
3JPR_1,8GFF_1 164 4
3JPR_1,2GTP_1 149 3
8GFF_1,2GTP_1 173 4

Newick tree

 
[
	8GFF_1:87.29,
	[
		3JPR_1:74.5,2GTP_1:74.5
	]:12.79
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{879 }{\log_{20} 879}-\frac{335}{\log_{20}335})=146.\)
Status Protein1 Protein2 d d1/2
Query variables 3JPR_1 8GFF_1 187 150
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]