A Comprehensive Guide to Laptop-friendly Gemma

Gemma falls in the category of foundation models such as text-to-image, text-to-code and speech-to-text. Foundation models are large-scale neural network architectures trained on vast amounts of data to understand and generate human-like outputs. These models serve as the backbone or foundation upon which more specific natural language processing (NLP) tasks and applications can be built. Gemma is a family of lightweight, open models built by the same brain behind Google’s innovative Gemini models. It is a groundbreaking addition to Google’s array of AI models. Gemma has been pre-trained on a diverse range of text corpora, such as web documents, codes, and mathematics, enabling the model to handle a wide variety of different tasks and text formats.

Google released a laptop-friendly open AI based on Gemini technology that can be used to create content generation tools and chatbots. According to an analysis by an Awni Hannun, a machine learning research scientist at Apple, Gemma is optimized to be highly efficient in a way that makes it suitable for use in low-resource environments. Hannun observed that Gemma has a vocabulary of 250,000 (250k) tokens versus 32k for comparable models. The importance of that is that Gemma can recognize and process a wider variety of words, allowing it to handle tasks with complex language. His analysis suggests that this extensive vocabulary enhances the model’s versatility across different types of content. He also believes that it may help with math, code and other modalities. It was also noted that the “embedding weights” are massive (750 million). The embedding weights are a reference to the parameters that help in mapping words to representations of their meanings and relationships. An important feature he called out is that the embedding weights, which encode detailed information about word meanings and relationships, are used not just in processing input part but also in generating the model’s output. This sharing improves the efficiency of the model by allowing it to better leverage its understanding of language when producing text. For end users, this means more accurate, relevant, and contextually appropriate responses (content) from the model, which improves its use in content generation as well as for chatbots and translations.

Hello Gemma

This application serves as a 'Hello World' example, verifying that all necessary requirements are fulfilled and that the environment is configured correctly. Download Gemma from Hugging Face (taking 2b as example. 7b works same way) Install HuggingFace CLI: pip install huggingface-cli Check installation: huggingface-cli -h login huggingface with a token: https://huggingface.co/settings/tokens search Gemma at http://huggingface.co, and click Files and versions copy model name google/gemma-2b-it download the whole repo to local huggingface-cli download --local-dir . google/gemma-2b-it

	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "google/gemma-2b-it"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model=AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

	input_text = "write me a poem about beautifuly woman."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))

view raw gemma_win_cach.py hosted with ❤ by GitHub

After running it, you will get the right answer.

	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "E:/models/2b-gemma"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model=AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

	input_text = "write me a poem about beautifuly woman."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))

view raw gemma_win_local.py hosted with ❤ by GitHub

b

📚 Resources: https://medium.com/the-ai-forum/instruction-fine-tuning-gemma-2b-on-medical-reasoning-and-convert-the-finetuned-model-into-gguf-844191f8d329

Code Implementation Install required dependencies. !pip3 install -q -U bitsandbytes==0.42.0 !pip3 install -q -U peft==0.8.2 !pip3 install -q -U trl==0.7.10 !pip3 install -q -U accelerate==0.27.1 !pip3 install -q -U datasets==2.17.0 !pip3 install -q -U transformers==4.38.0 Setup HF_Tokens import os from google.colab import userdata os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN') Import Required dependencies. import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig #set the qunatization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # #Load the model and Tokenizer model_id = "google/gemma-2b-it" # model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}) tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True) Load the training dataset. from datasets import load_dataset # dataset = load_dataset("mamachang/medical-reasoning") dataset #################################################### DatasetDict({ train: Dataset({ features: ['input', 'instruction', 'output'], num_rows: 3702 }) }) dataset["train"][0] ################################################################### {'input': "Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'},", 'instruction': 'Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between .', 'output': '\n\nThis is a clinical vignette describing an 8-year-old boy with acute lymphoblastic leukemia who recently started chemotherapy and now presents with nausea, vomiting, decreased urination, bilateral pedal edema, and other vital sign changes. \n\nThe question asks which serum studies and urinalysis findings would help confirm the diagnosis. Based on the clinical history, the main diagnostic consideration is tumor lysis syndrome, which can occur after starting chemotherapy in a patient with a high tumor burden. \n\nTumor lysis syndrome leads to rapid cell breakdown and release of intracellular contents into the bloodstream. This results in hyperuricemia, hyperkalemia, hyperphosphatemia and acute kidney injury. The urine may show urate crystals. \n\nSo the correct answer should include these key lab abnormalities of tumor lysis syndrome.\n\n\nC: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine\n'} Convert HF dataset to pandas Data Frame df = dataset["train"].to_pandas() df.head(10) Create a prompt for training. def generate_prompt(data_point): """Gen. input text based on a prompt, task instruction, (context info.), and answer :param data_point: dict: Data point :return: dict: tokenzed prompt """ # Generate prompt prefix_text = 'Below is an instruction that describes a task. Write a response that ' \ 'appropriately completes the request.\n\n' # Samples with additional context into. if data_point['input']: text = f"""user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} \nmodel{data_point["output"]} """ # Without else: text = f"""user {prefix_text} {data_point["instruction"]} \nmodel{data_point["output"]} """ return text # add the "prompt" column in the dataset text_column = [generate_prompt(data_point) for data_point in dataset["train"]] dataset = dataset["train"].add_column("prompt", text_column) dataset ######################################################################## Dataset({ features: ['input', 'instruction', 'output', 'prompt'], num_rows: 3702 }) dataset[0]['prompt'] ####################################################################### 'prompt': "user Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between . here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}, \nmodel\n\nThis is a clinical vignette describing an 8-year-old boy with acute lymphoblastic leukemia who recently started chemotherapy and now presents with nausea, vomiting, decreased urination, bilateral pedal edema, and other vital sign changes. \n\nThe question asks which serum studies and urinalysis findings would help confirm the diagnosis. Based on the clinical history, the main diagnostic consideration is tumor lysis syndrome, which can occur after starting chemotherapy in a patient with a high tumor burden. \n\nTumor lysis syndrome leads to rapid cell breakdown and release of intracellular contents into the bloodstream. This results in hyperuricemia, hyperkalemia, hyperphosphatemia and acute kidney injury. The urine may show urate crystals. \n\nSo the correct answer should include these key lab abnormalities of tumor lysis syndrome.\n\n\nC: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine\n " Shuffle the dataset. dataset = dataset.shuffle(seed=1234) # Shuffle dataset here dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True) Train-Test Split dataset = dataset.train_test_split(test_size=0.1) train_data = dataset["train"] test_data = dataset["test"] print(train_data) print(test_data) ######################################################################### Dataset({ features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'], num_rows: 3331 }) Dataset({ features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'], num_rows: 371 }) Load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and the prepare_model_for_kbit_training method from PEFT. from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model model.gradient_checkpointing_enable() model = prepare_model_for_kbit_training(model) # print(model) # lora_config = LoraConfig( r=64, lora_alpha=32, target_modules=modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) #################################################################### GemmaForCausalLM( (model): GemmaModel( (embed_tokens): Embedding(256000, 2048, padding_idx=0) (layers): ModuleList( (0-17): 18 x GemmaDecoderLayer( (self_attn): GemmaAttention( (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False) (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False) (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False) (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False) (rotary_emb): GemmaRotaryEmbedding() ) (mlp): GemmaMLP( (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False) (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False) (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False) (act_fn): GELUActivation() ) (input_layernorm): GemmaRMSNorm() (post_attention_layernorm): GemmaRMSNorm() ) ) (norm): GemmaRMSNorm() ) (lm_head): Linear(in_features=2048, out_features=256000, bias=False) ) Retrieve the target modules. import bitsandbytes as bnb def find_all_linear_names(model): cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear) lora_module_names = set() for name, module in model.named_modules(): if isinstance(module, cls): names = name.split('.') lora_module_names.add(names[0] if len(names) == 1 else names[-1]) if 'lm_head' in lora_module_names: # needed for 16-bit lora_module_names.remove('lm_head') return list(lora_module_names) # modules = find_all_linear_names(model) print(modules) ############################################################################## ['down_proj', 'k_proj', 'o_proj', 'gate_proj', 'q_proj', 'v_proj', 'up_proj'] List the number of trainable parameters. trainable, total = model.get_nb_trainable_parameters() print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%") ############################################################################### Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351% Initiate the training. import transformers from trl import SFTTrainer tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side='right' torch.cuda.empty_cache() trainer = SFTTrainer( model=model, train_dataset=train_data, eval_dataset=test_data, dataset_text_field="prompt", peft_config=lora_config, max_seq_length=2500, args=transformers.TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, warmup_steps=0.03, max_steps=100, learning_rate=2e-4, logging_steps=1, output_dir="outputs", optim="paged_adamw_8bit", save_strategy="epoch", ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) # model.config.use_cache = False # silence the warnings. Please re-enable for inference! trainer.train() ################################################################################ /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( [100/100 05:02, Epoch 0/1] Step Training Loss 1 3.435700 2 3.578300 3 2.680100 4 2.418400 5 2.281500 6 2.156300 7 2.042900 8 1.811600 9 1.575100 10 1.622300 11 1.775600 12 1.574400 13 1.429200 14 1.413500 15 1.459100 16 1.413700 17 1.643500 18 1.393100 19 1.655600 20 1.427500 21 1.416000 22 1.413800 23 1.355800 24 1.369500 25 1.378500 26 1.272900 27 1.397200 28 1.340600 29 1.246100 30 1.420800 31 1.267500 32 1.399900 33 1.383600 34 1.245400 35 1.384200 36 1.342100 37 1.339900 38 1.235100 39 1.330200 40 1.355300 41 1.259200 42 1.281900 43 1.253000 44 1.323700 45 1.299300 46 1.242600 47 1.097000 48 1.502300 49 1.350000 50 1.385400 51 1.343200 52 1.296500 53 1.278000 54 1.327200 55 1.279600 56 1.409300 57 1.221200 58 1.384700 59 1.110500 60 1.173100 61 1.224300 62 1.327900 63 1.395600 64 1.119300 65 1.230300 66 1.224300 67 1.136700 68 1.247000 69 1.267700 70 1.164700 71 1.112800 72 1.108900 73 1.399600 74 1.368600 75 1.181800 76 1.224000 77 1.193600 78 1.219400 79 1.360000 80 1.185200 81 1.209000 82 1.180600 83 1.307800 84 1.241100 85 1.345200 86 1.175000 87 1.190100 88 1.172300 89 1.265700 90 1.265800 91 1.196400 92 1.350200 93 1.189200 94 1.176500 95 1.215200 96 1.240200 97 1.184000 98 1.221600 99 1.158300 100 1.310200 TrainOutput(global_step=100, training_loss=1.407856843471527, metrics={'train_runtime': 306.785, 'train_samples_per_second': 1.304, 'train_steps_per_second': 0.326, 'total_flos': 2311407065296896.0, 'train_loss': 1.407856843471527, 'epoch': 0.12}) Login to your HF account in order to push the trained model to HF. from huggingface_hub import notebook_login notebook_login() new_model = "gemma-medical_qa-Finetune" # trainer.model.save_pretrained(new_model) # base_model = AutoModelForCausalLM.from_pretrained( model_id, low_cpu_mem_usage=True, return_dict=True, torch_dtype=torch.float16, device_map={"": 0}, ) merged_model= PeftModel.from_pretrained(base_model, new_model) merged_model= merged_model.merge_and_unload() # Save the merged model #save_adapter=True, save_config=True merged_model.save_pretrained("merged_model",safe_serialization=True) tokenizer.save_pretrained("merged_model") tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" # # Push the model and tokenizer to the Hugging Face Model Hub merged_model.push_to_hub(new_model, use_temp_dir=False) tokenizer.push_to_hub(new_model, use_temp_dir=False) Test Finetuned model. Helper function to generate the response. def get_completion(query: str, model, tokenizer) -> str: device = "cuda:0" prompt_template = """ user Below is an instruction that describes a task. Write a response that appropriately completes the request. {query} \nmodel """ prompt = prompt_template.format(query=query) encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True) model_inputs = encodeds.to(device) generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id) # decoded = tokenizer.batch_decode(generated_ids) decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True) return (decoded) # query = """\n\n Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between . here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}""" result = get_completion(query=query, model=merged_model, tokenizer=tokenizer) print(result) # ######################################################################### A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer. user Below is an instruction that describes a task. Write a response that appropriately completes the request. Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between . here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? {'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'} model analysing the question stem, we know we are dealing with acute lymphoblastic leukemia in a 8-year-old boy who received a 1st dose of chemotherapy 5 days ago. So the answer key should provide information about the differential diagnosis of lymphoblastic leukemia and specific testing needed at this stage of diagnosis. From the choices, choices A and D are the more pertinent. Choice A is not the correct serum finding. Choice D is an abnormal serum finding that would support an underlying acute lymphoblastic leukemia diagnosis. Choice E gives an abnormal urine finding along with specific laboratory labs. The correct answer would either be choice D, suggesting testing to confirm leukemia infection, or choice E, relating to a specific symptom associated with acute lymphoblastic leukemia, such as urinary tract symptoms. print(f"Model Answer : \n {result.split('model')[-1]}") ############################################################################ Model Answer : analysing the question stem, we know we are dealing with acute lymphoblastic leukemia in a 8-year-old boy who received a 1st dose of chemotherapy 5 days ago. So the answer key should provide information about the differential diagnosis of lymphoblastic leukemia and specific testing needed at this stage of diagnosis. From the choices, choices A and D are the more pertinent. Choice A is not the correct serum finding. Choice D is an abnormal serum finding that would support an underlying acute lymphoblastic leukemia diagnosis. Choice E gives an abnormal urine finding along with specific laboratory labs. The correct answer would either be choice D, suggesting testing to confirm leukemia infection, or choice E, relating to a specific symptom associated with acute lymphoblastic leukemia, such as urinary tract symptoms. query = """Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between .here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}""" result = get_completion(query=query, model=merged_model, tokenizer=tokenizer) print(f"Model Answer : \n {result.split('model')[-1]}") ############################################################################# Model Answer : Analysis: This is a clinical vignette describing a 34-year-old man with symptoms of abdominal discomfort and blood in the urine. He has a history of hypertension for which he is prescribed medication. On physical exam, he has enlarged flank masses bilaterally. The bowel sounds are normal. Renal function tests show normal urea and creatine levels. The renal ultrasound and CT scans are abnormal. The key findings in the question stem are: - Hypertension for 2 years - 34-year-old - Right-sided kidney tumors with normal bowel sounds - Enlarged kidneys and multiple cystic cysts on renal ultrasound and CT scan According to renal sonography, CT scan findings, and normal bowel sounds, the diagnosis is complex cystic nephropathy. The question asks about the most likely presentation, which is polycystic disease given the multiple cysts and right-sided kidney lesions. Based on these findings, the most likely diagnosis is Autosomal Dominant Polycystic Kidney Disease (ADPKD), which is characterized by large right-sided kidneys and urinary bladder cystic masses. The other choices can be ruled out with less probability. This is a question about renal cysts in an adult male patient with hypertension. The key findings are: - 34-year-old male - History of 2 years of hypertension - Enlarged and hypercodense right kidney on ultrasound - Multiple cysts on CT scan with well-defined walls According to the image, the cysts are likely polycystic in nature, as they are located in the right kidney. The right kidney is enlarged, which may also indicate polycystic disease. The hypercodense cysts on CT scan further support the diagnosis. ADPKD is an autosomal dominant condition in which individuals with the genotype ADPKDIV have multiple cysts in their right kidneys. The other choices can be ruled out. A: Autosomal dominant polycystic kidney disease (ADPKD) acherous note: If the cysts were multiple in another location, such as the left kidney, this disease would possibly not be included Inference for finetuned model gemma from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig #set the LoRA configurations peft_config =LoraConfig( r=64, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) # peft_model_id = "Plaban81/gemma-medical_qa-Finetune" config = peft_config.from_pretrained(peft_model_id) # model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map="auto", ) ptokenizer= AutoTokenizer.from_pretrained(peft_model_id) def get_completion(query: str, model, tokenizer) -> str: device = "cuda:0" prompt_template = """ user Below is an instruction that describes a task. Write a response that appropriately completes the request. {query} \nmodel """ prompt = prompt_template.format(query=query) encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True) model_inputs = encodeds.to(device) generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id) # decoded = tokenizer.batch_decode(generated_ids) decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True) return (decoded) query = """Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between .here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}""" result = get_completion(query=query, model=model, tokenizer=ptokenizer) print(f"Model Answer : \n {result.split('model')[-1]}") print(result) ########################################################################### user Below is an instruction that describes a task. Write a response that appropriately completes the request. Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between .here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows: Urea 50 mg/dL Creatinine 1.4 mg/dL Protein Negative RBC Numerous The patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? {'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'} model The most likely diagnosis is **'Autosomal dominant polycystic kidney disease (ADPKD)'.** : In ADPKD, an abnormal gene mutation is responsible for the excessive growth of fluid-filled cysts in the kidneys. These cysts can be detected through various imaging techniques, including ultrasound, CT scan, and MRI. The presence of multiple renal cysts and enlarged kidneys is characteristic of ADPKD. Convert to GGUF format. import locale def getpreferredencoding(do_setlocale = True): return "UTF-8" locale.getpreferredencoding = getpreferredencoding !git clone https://github.com/ggerganov/llama.cpp !cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements/requirements-convert-hf-to-gguf.txt Download Model from huggingface_hub import snapshot_download model_name = "Plaban81/gemma-medical_qa-Finetune" methods = ['q4_k_m'] base_model = "./original_model/" quantized_path = "./quantized_model/" # snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False) original_model = quantized_path+'/FP16.gguf' Create a new folder. !mkdir ./quantized_model/ Convert hf to gguf !python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf ################################################################################# Loading model: original_model gguf: This GGUF file is for Little Endian only Set model parameters Set model tokenizer gguf: Setting special token type bos to 2 gguf: Setting special token type eos to 1 gguf: Setting special token type unk to 3 gguf: Setting special token type pad to 1 gguf: Setting add_bos_token to True gguf: Setting add_eos_token to True gguf: Setting chat_template to {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '' + role + ' ' + message['content'] | trim + ' ' }}{% endfor %}{% if add_generation_prompt %}{{'model '}}{% endif %} Exporting model to 'quantized_model/FP16.gguf' gguf: loading model part 'model-00001-of-00002.safetensors' token_embd.weight, n_dims = 2, torch.float16 --> float32 blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.0.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.0.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.0.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.0.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.0.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.0.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.0.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.1.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.1.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.1.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.1.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.1.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.1.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.1.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.1.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.1.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.10.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.10.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.10.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.10.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.10.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.10.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.10.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.10.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.10.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.11.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.11.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.11.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.11.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.11.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.11.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.11.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.11.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.11.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.12.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.12.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.12.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.12.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.12.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.12.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.12.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.12.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.12.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.13.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.13.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.13.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.13.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.13.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.13.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.13.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.13.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.13.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.14.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.14.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.14.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.14.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.14.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.14.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.14.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.14.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.14.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.15.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.15.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.15.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.15.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.15.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.15.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.15.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.15.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.15.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.16.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.16.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.16.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.16.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.16.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.16.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.16.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.16.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.16.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.17.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.17.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.17.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.17.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.17.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.17.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.2.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.2.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.2.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.2.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.2.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.2.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.2.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.2.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.2.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.3.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.3.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.3.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.3.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.3.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.3.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.3.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.3.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.3.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.4.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.4.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.4.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.4.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.4.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.4.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.4.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.4.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.4.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.5.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.5.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.5.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.5.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.5.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.5.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.5.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.5.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.5.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.6.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.6.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.6.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.6.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.6.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.6.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.6.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.6.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.6.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.7.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.7.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.7.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.7.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.7.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.7.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.7.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.7.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.7.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.8.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.8.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.8.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.8.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.8.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.8.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.8.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.8.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.8.attn_v.weight, n_dims = 2, torch.float16 --> float32 blk.9.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.9.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.9.ffn_gate.weight, n_dims = 2, torch.float16 --> float32 blk.9.ffn_up.weight, n_dims = 2, torch.float16 --> float32 blk.9.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.9.attn_k.weight, n_dims = 2, torch.float16 --> float32 blk.9.attn_output.weight, n_dims = 2, torch.float16 --> float32 blk.9.attn_q.weight, n_dims = 2, torch.float16 --> float32 blk.9.attn_v.weight, n_dims = 2, torch.float16 --> float32 gguf: loading model part 'model-00002-of-00002.safetensors' blk.17.attn_norm.weight, n_dims = 1, torch.float16 --> float32 blk.17.ffn_down.weight, n_dims = 2, torch.float16 --> float32 blk.17.ffn_norm.weight, n_dims = 1, torch.float16 --> float32 output_norm.weight, n_dims = 1, torch.float16 --> float32 Model successfully exported to 'quantized_model/FP16.gguf' Quantize the finetuned model to 4bit quantization format. import os for m in methods: qtype = f"{quantized_path}/{m.upper()}.gguf" os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m) # ! ./llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt ########################################################################## Log start main: build = 2355 (e04e04f8) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1709783565 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 164 tensors from ./quantized_model/Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = original_model llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 4: gemma.block_count u32 = 18 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_K: 108 tensors llama_model_loader: - type q6_K: 19 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.51 GiB (5.18 BPW) llm_load_print_meta: general.name = original_model llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 1 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.06 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/19 layers to GPU llm_load_tensors: CPU buffer size = 1548.98 MiB ........................................................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 9.00 MiB llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB llama_new_context_with_model: CUDA_Host input buffer size = 6.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 504.00 MiB llama_new_context_with_model: graph splits (measure): 1 system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | main: interactive mode on. Reverse prompt: 'User:' sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 512, n_predict = 90, n_keep = 1 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision. User: Hello, Bob. Bob: Hello. How may I help you today? User: Please tell me the largest city in Europe. Bob: Sure. The largest city in Europe is Moscow, the capital of Russia. User:How are you ? Bob: I am doing well, thank you. And how may I assist you today? llama_print_timings: load time = 414.63 ms llama_print_timings: sample time = 5.44 ms / 19 runs ( 0.29 ms per token, 3490.72 tokens per second) llama_print_timings: prompt eval time = 799.78 ms / 100 tokens ( 8.00 ms per token, 125.04 tokens per second) llama_print_timings: load time = 414.63 ms llama_print_timings: eval time = 842.40 ms / 18 runs ( 46.80 ms per token, 21.37 tokens per second) llama_print_timings: sample time = 5.44 ms / 19 runs ( 0.29 ms per token, 3490.72 tokens per second) llama_print_timings: total time = 16391.73 ms / 118 tokens Login to your HF Dashboard from huggingface_hub import notebook_login notebook_login() Set up the required model path and create HF repo from huggingface_hub import HfApi, HfFolder, create_repo, upload_file model_path = "./quantized_model/Q4_K_M.gguf" # Your model's local path repo_name = "gemma-medical_qa-GGUF" # Desired HF Hub repository name repo_url = create_repo(repo_name, private=False) # Upload the quantized model to Huggingface api = HfApi() api.upload_file( path_or_fileobj=model_path, path_in_repo="Q4_K_M.gguf", repo_id="Plaban81/gemma-medical_qa-GGUF", repo_type="model", ) ############################################################## Q4_K_M.gguf: 100% 1.63G/1.63G [01:15<00:00, 22.4MB/s] CommitInfo(commit_url='https://huggingface.co/Plaban81/gemma-medical_qa-GGUF/commit/811ba25102252c4ab1a5739ad5cc9d06a55a9b82', commit_message='Upload Q4_K_M.gguf with huggingface_hub', commit_description='', oid='811ba25102252c4ab1a5739ad5cc9d06a55a9b82', pr_url=None, pr_revision=None, pr_num=None) Download the quantized model for inference !wget "https://huggingface.co/Plaban81/gemma-medical_qa-GGUF/resolve/main/Q4_K_M.gguf" Install llama.cpp on GPU !CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python Use the GGUF model for inferencing using Llama.cpp. from llama_cpp import Llama # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system. llm = Llama( model_path="/content/Q4_K_M.gguf", # Download the model file first n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources n_threads=1, # The number of CPU threads to use, tailor to your system and the resulting performance n_gpu_layers=-1 # The number of layers to offload to GPU, if you have GPU acceleration available ) ############################################################################ llama_model_loader: loaded meta data with 24 key-value pairs and 164 tensors from /content/Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = original_model llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 4: gemma.block_count u32 = 18 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_K: 108 tensors llama_model_loader: - type q6_K: 19 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.51 GiB (5.18 BPW) llm_load_print_meta: general.name = original_model llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 1 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 19/19 layers to GPU llm_load_tensors: CPU buffer size = 410.16 MiB llm_load_tensors: CUDA0 buffer size = 1548.98 MiB ........................................................ llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 576.00 MiB llama_new_context_with_model: KV self size = 576.00 MiB, K (f16): 288.00 MiB, V (f16): 288.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 69.26 MiB llama_new_context_with_model: CUDA0 compute buffer size = 592.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 2 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '' + role + '\n' + message['content'] | trim + '\n' }}{% endfor %}{% if add_generation_prompt %}{{'model\n'}}{% endif %}", 'tokenizer.ggml.add_eos_token': 'true', 'tokenizer.ggml.padding_token_id': '1', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.eos_token_id': '1', 'tokenizer.ggml.bos_token_id': '2', 'general.architecture': 'gemma', 'gemma.feed_forward_length': '16384', 'tokenizer.ggml.add_bos_token': 'true', 'gemma.attention.head_count': '8', 'general.name': 'original_model', 'gemma.context_length': '8192', 'gemma.embedding_length': '2048', 'gemma.block_count': '18', 'gemma.attention.head_count_kv': '1', 'gemma.attention.key_length': '256', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'gemma.attention.layer_norm_rms_epsilon': '0.000001', 'gemma.attention.value_length': '256', 'general.file_type': '15'} Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '' + role + ' ' + message['content'] | trim + ' ' }}{% endfor %}{% if add_generation_prompt %}{{'model '}}{% endif %} Using chat eos_token: Using chat bos_token: Query 1 query = """Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between . here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}""" output = llm( prompt=query, max_tokens=512, # Generate up to 512 tokens ) output ###################################################################### {'id': 'cmpl-a53def0a-3c2d-4d09-b5d0-97f09b6fb7d6', 'object': 'text_completion', 'created': 1709784839, 'model': '/content/Q4_K_M.gguf', 'choices': [{'text': " \nmodel\n\nThis is a question about diagnosing acute lymphoblastic leukemia (ALL) in an 8-year-old boy who has acute lymphoblastic leukemia (ALL). The key information is that the boy had acute lymphoblastic leukemia before starting chemotherapy 5 days ago. The question asks for which serum study and urine findings will help confirm the diagnosis.\n\nThe choice of tests is important because ALL is a diagnosis that can be missed in children. The key findings in the question stem are hyperkalemia, hyperphosphatemia, hypocalcemia, and hyperuricemia. \n\nA serum study like an ESR, electrolytes, and calcium would help confirm the diagnosis of ALL by confirming the presence of leukemic cells. A urine study like urinalysis would help confirm the diagnosis by detecting increased levels of urinary leukocyte casts or hemoglobin indicating hemoglobinuria. \n\nChoice B (hyperuricemia, hyperkalemia, hyperphosphatemia, heme) refers to acute leukemias and would not help differentiate ALL from other lymphoblastic leukemia diagnoses. Choice C (hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals) would help confirm the diagnosis but does not provide information about leukemic cells themselves. \n\nChoice D (hyperuricemia, hyperkalemia, hyperphosphatemia, urinary monoclonal spike) refers to acute leukemias and would help confirm the diagnosis. However, it does not provide information about leukemic cells themselves. Choice E (hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals) refers to acute leukemias and would help confirm the diagnosis. However, it does not provide information about leukemic cells themselves. \n\n\nD: Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike\n \n roble's analysis is correct. The key findings in the question stem are hyperkalemia, hyperphosphatemia, hypocalcemia, and hyperuricemia. These findings are consistent with acute lymphoblastic leukemia. A urine study like urinalysis would help confirm the diagnosis of acute lymphoblastic leukemia. \n \nHere are the other choices: \nA: Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM) - this would help confirm the diagnosis of acute lymphoblastic leukemia. \nB: Hyperkalemia, hyperphosphatemia, hypocalcemia", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 304, 'completion_tokens': 512, 'total_tokens': 816}} Query 2 query = """\n\n Please answer with one of the option in the bracket. Write reasoning in between . Write answer in between . here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}""" output = llm( prompt=query, max_tokens=512, # Generate up to 512 tokens ) output ############################################################################ {'id': 'cmpl-0c9f6a09-42d1-4367-bb3d-d532d87c4dc0', 'object': 'text_completion', 'created': 1709785074, 'model': '/content/Q4_K_M.gguf', 'choices': [{'text': ' \nmodel\n\nThis is a question about diagnosing a patient with acute lymphoblastic leukemia (ALL) based on a history and physical examination findings. The key findings are:\n- 8-year-old boy\n- Acute lymphoblastic leukemia diagnosis\n- 1st dose of chemotherapy 5 days ago\n- Leukoctane count 60,000/mm3\n- Vital signs include tachycardia, edema, and hyperuricemia\n\nThe differential diagnosis includes:\n- Uricosuria due to hyperuricemia and elevated creatinine kinase (CK)\n- Uric acid crystals in the urine due to hyperuricemia and elevated creatine kinase (CK)\n- Malic aciduria due to hyperuricemia and elevated creatinine kinase (CK)\n\nThe key tests are:\n- Serum studies should include hyperkalemia, hyperphosphatemia, hypocalcemia, and elevated CK. \n- Urine studies should include a positive heme test for heme.\n\nBased on these tests, the most likely diagnosis is uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. Uric acid crystals in the urine will confirm the diagnosis.\n\n\nE: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals\n \n Reasoning:\nThe question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis.\n \n\nE: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals\n \n Reasoning:\nThe question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 306, 'completion_tokens': 449, 'total_tokens': 755}} Extracting the answer print(output["choices"][0]["text"].split("\nmodel")[-1]) ###################################################################### This is a question about diagnosing a patient with acute lymphoblastic leukemia (ALL) based on a history and physical examination findings. The key findings are: - 8-year-old boy - Acute lymphoblastic leukemia diagnosis - 1st dose of chemotherapy 5 days ago - Leukoctane count 60,000/mm3 - Vital signs include tachycardia, edema, and hyperuricemia The differential diagnosis includes: - Uricosuria due to hyperuricemia and elevated creatinine kinase (CK) - Uric acid crystals in the urine due to hyperuricemia and elevated creatine kinase (CK) - Malic aciduria due to hyperuricemia and elevated creatinine kinase (CK) The key tests are: - Serum studies should include hyperkalemia, hyperphosphatemia, hypocalcemia, and elevated CK. - Urine studies should include a positive heme test for heme. Based on these tests, the most likely diagnosis is uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. Uric acid crystals in the urine will confirm the diagnosis. E: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals Reasoning: The question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis. E: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals Reasoning: The question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis. Conclusion Here we have instruction fine-tuned Gemma-2b-it model on medical reasoning task. Post fine-tuning we have quantized the fine-tuned model using .GGUF model. We have also downloaded the quantized fine-tuned model and used if for inferencing via llama.cpp package.

A Comprehensive Guide to Laptop-friendly Gemma

Hello Gemma

Fine-Tuning Gemma-2B on Medical Reasoning, Convert the finetuned model into GGUF format using Llama.cpp

Technology Stack Used:

b

b

b

b

b

b