Not AI
I like big .vimrc and I cannot lie
- Sofia, Bulgaria
-
17:36
(UTC +03:00) - https://ggerganov.com
- @ggerganov
- user/ggerganov
Sponsors
Block or Report
Block or report ggerganov
Report abuse
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abusePinned
1,557 contributions in the last year
Less
More
Contribution activity
April 2023
Created 152 commits in 5 repositories
Created a pull request in ggerganov/llama.cpp that received 26 comments
Add Q8_0 quantization for intermediate results
ref #909 This is an implementation of mode (E) from the referenced issue. Basically, we quantize the intermediate results to 8-bits, instead of 4-b…
+442
−18
•
26
comments
Opened 25 other pull requests in 3 repositories
ggerganov/llama.cpp
14
merged
3
closed
2
open
- ggml : fix #if for f32_f32 mul_mat (CLBlast)
- Adjust mul_mat_f16 work memory
- common : change default parameters to pre-#1126
- ggml : add Q5_0 and Q5_1 quantization
- ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON)
- ggml : export symbols
- llama : refactor get / set state + remove redundant kv cache API
- ggml : alternative Q4_3 implementation using modified Q8_0
- ggml : alternative Q4_3 format + implementation
- llama : quantize attention results
- Add Q4_3 quantization (ARM NEON)
- ggml : use 8-bit precision for Q4_1 intermediate results
- ggml : Q4_2 ARM
- ggml : test dot product q4_0 x f32
- New Q4_0 implementation using 2x F16 instead of 1x F32
- Speed-up ggml_vec_dot_q4_1() ARM_NEON
- ggml : multi-thread ggml_rope() (~3-4 times faster on M1)
- Demo usage of Flash Attention
- Avoid heavy V transpose operation + improvements
ggerganov/ggml
2
open
2
merged
ggerganov/whisper.cpp
2
merged
Reviewed 130 pull requests in 3 repositories
ggerganov/llama.cpp
25 pull requests
- Various fixes to mat_mul benchmark
- CLBlast: q5_0, q5_1, q8_0 dequant kernels
- Add git-based build information for better issue tracking
- Remove Q4_3 which is no better than Q5
- Sample interface, new samplers,
- Created a Server example
- Jeopardy Example Script
- read chat prompts from a template file
- Save and restore prompt evaluation state for much faster startup times
- CLBlast support
- cuBLAS: use host pinned memory and dequantize while copying
- Q5: Slightly faster AVX2 implementation
- AVX2 optimizations for Q5_0, Q5_1
- Allow setting the rng seed after initialization.
- ggml : add Q5_0 and Q5_1 quantization
- ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON)
- Update SHA256SUMS after quantization change
- Use full range for q4_0 quantization
- Implement scalar sum over all rows in ggml_compute_forward_sum_f32
- Fix build for gcc 8 and test in CI
- Fix cuda compilation
- add save_load_state example
- Improve AVX2 for vec_dot_q4_3_q8_0
- Trigger CI for drafts, but not most PR actions
- Fix CI: quantization unit tests, editorconfig
- Some pull request reviews not shown.
ggerganov/whisper.cpp
25 pull requests
- whisper: Use correct seek_end when offset is used
- add some tips about in the readme of the android project folder
- Optionally allow a Core ML build of Whisper to work with or without Core ML models
- C++11style
- Escape quotes in csv output
- Flush upon finishing inference
- Allow duration knob to work correctly with speed_up knob
- examples : add missing #include <cstdint>
- Updated escape_double_quotes() Function
- ggml : fix build on whisper.android (ARM_NEON)
-
Do not launch threads for
log_mel_spectrogramwhen singlethreaded - Fix the bug related to word splitting errors in the "tokenize" function.
- readme: Add alternate swift bindings
- fix potential memory leaks
- Update LICENSE
- Fix typos in whisper.h
- Update stream.cpp
- readme : add Unity3d bindings
- talk/talk-llama: add basic example script for eleven-labs tts
- Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files
- Add msvc compiler args /utf-8 fix error C3688
- Corrects default speak.sh path in talk-llama
- Add lrc output support
- Making the quick start instructions clearer.
- Makefile: disable avx in case f16c is not available
- Some pull request reviews not shown.
Created an issue in ggerganov/llama.cpp that received 7 comments
Investigate alternative ggml_compute_forward_mul_mat_q_f32() implementation
This is the most computationally significant call in the entire transformer evaluation, so we have to be sure that it is running optimally. It comp…
7
comments
Opened 10 other issues in 1 repository
ggerganov/llama.cpp
10
closed
- No cuBLAS performance gain for F16
- llama.cpp + Final Jeopardy
-
Try to use quantized
ggml_mul_matin attention layer - Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32()
- Measure perplexity delta between Q4_0 and F16 "output" tensor
- Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors
- Investigate storing results from ggml operations in F16 format
- Add GPU support to ggml
- Fix quantize_row_q4_1() with ARM_NEON
- Multi-thread ggml_cpy()
Started 4 discussions in 2 repositories
ggerganov/llama.cpp
ggerganov/llama.cpp
-
Roadmap May 2023
This contribution was made on Apr 28
-
Add GPU support to ggml
This contribution was made on Apr 12
-
Roadmap Apr 2023
This contribution was made on Apr 5
ggerganov/whisper.cpp
ggerganov/whisper.cpp
-
v1.3.0
This contribution was made on Apr 15






